Re: [PR] [SPARK-47474][CORE] Revert SPARK-47461 and add some comments [spark]
LuciferYang closed pull request #45602: [SPARK-47474][CORE] Revert SPARK-47461 and add some comments URL: https://github.com/apache/spark/pull/45602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533285590 ## core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala: ## @@ -172,61 +172,64 @@ class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext { } test("migrate shuffle data to fallback storage") { -val conf = new SparkConf(false) Review Comment: Please preserve the existing test coverage as much as possible. We had better add new feature test coverage separately while keeping the existing test coverage unchanged in order to prevent any regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533284633 ## core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala: ## @@ -1558,6 +1587,7 @@ object ShuffleBlockFetcherIterator { mapIndex: Int, address: BlockManagerId, size: Long, + timeTaken: Long, Review Comment: May I ask the reason why we add a new filed here? If this is only for pure logging, please revert this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]
mridulm commented on PR #45607: URL: https://github.com/apache/spark/pull/45607#issuecomment-2011272331 +CC @zhouyejoe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533283117 ## core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala: ## @@ -188,15 +271,15 @@ private[spark] object FallbackStorage extends Logging { val name = ShuffleDataBlockId(shuffleId, mapId, NOOP_REDUCE_ID).name val hash = JavaUtils.nonNegativeHash(name) val dataFile = new Path(fallbackPath, s"$appId/$shuffleId/$hash/$name") -val f = fallbackFileSystem.open(dataFile) val size = nextOffset - offset logDebug(s"To byte array $size") val array = new Array[Byte](size.toInt) val startTimeNs = System.nanoTime() -f.seek(offset) -f.readFully(array) -logDebug(s"Took ${(System.nanoTime() - startTimeNs) / (1000 * 1000)}ms") -f.close() +Utils.tryWithResource(fallbackFileSystem.open(dataFile)) { f => + f.seek(offset) + f.readFully(array) + logDebug(s"Took ${(System.nanoTime() - startTimeNs) / (1000 * 1000)}ms") +} Review Comment: Could you spin off this `Utils.tryWithResource` related change to a new JIRA? We can merge that first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533282286 ## core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala: ## @@ -147,6 +222,14 @@ private[spark] object FallbackStorage extends Logging { } } + def stopThreadPool(conf: SparkConf): Unit = { +logInfo(s" Stopping thread pool") +if (getFallbackStorage(conf).isDefined && + getFallbackStorage(conf).get.fetchThreadPool.isDefined) { Review Comment: indentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533281979 ## core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala: ## @@ -92,6 +95,57 @@ private[storage] class FallbackStorage(conf: SparkConf) extends Logging { val hash = JavaUtils.nonNegativeHash(filename) fallbackFileSystem.exists(new Path(fallbackPath, s"$appId/$shuffleId/$hash/$filename")) } + + private val fetchThreadPool: Option[ThreadPoolExecutor] = { +val numShuffleThreads = FallbackStorage.getNumReadThreads(conf) +if (numShuffleThreads > 0) { + logInfo(s"FallbackStorage created thread pool using ${numShuffleThreads} thread(s)") + Some(ThreadUtils.newDaemonCachedThreadPool( +"FetchFromFallbackStorage-threadPool", numShuffleThreads)) +} else { + logInfo("FallbackStorage thread pool not created") + None +} + } + + def fetchBlocks( + blockManager: BlockManager, + blocks: collection.Seq[FetchBlockInfo], + address: BlockManagerId, + listener: BlockFetchingListener): Unit = { +fetchThreadPool match { + case Some(p) if !p.isShutdown => +blocks.foreach(block => + p.submit(new Runnable { +override def run(): Unit = { + fetchShuffleBlocks(block, blockManager, listener) +} + }) +) + case _ => +logInfo(s" fetchThreadPool does not exists for $address or shutdown") +blocks.foreach(block => fetchShuffleBlocks(block, blockManager, listener)) +} + } + + private def fetchShuffleBlocks( + block: FetchBlockInfo, Review Comment: indentation. ## core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala: ## @@ -92,6 +95,57 @@ private[storage] class FallbackStorage(conf: SparkConf) extends Logging { val hash = JavaUtils.nonNegativeHash(filename) fallbackFileSystem.exists(new Path(fallbackPath, s"$appId/$shuffleId/$hash/$filename")) } + + private val fetchThreadPool: Option[ThreadPoolExecutor] = { +val numShuffleThreads = FallbackStorage.getNumReadThreads(conf) +if (numShuffleThreads > 0) { + logInfo(s"FallbackStorage created thread pool using ${numShuffleThreads} thread(s)") + Some(ThreadUtils.newDaemonCachedThreadPool( +"FetchFromFallbackStorage-threadPool", numShuffleThreads)) +} else { + logInfo("FallbackStorage thread pool not created") + None +} + } + + def fetchBlocks( + blockManager: BlockManager, + blocks: collection.Seq[FetchBlockInfo], + address: BlockManagerId, + listener: BlockFetchingListener): Unit = { +fetchThreadPool match { + case Some(p) if !p.isShutdown => +blocks.foreach(block => + p.submit(new Runnable { +override def run(): Unit = { + fetchShuffleBlocks(block, blockManager, listener) +} + }) +) + case _ => +logInfo(s" fetchThreadPool does not exists for $address or shutdown") +blocks.foreach(block => fetchShuffleBlocks(block, blockManager, listener)) +} + } + + private def fetchShuffleBlocks( + block: FetchBlockInfo, Review Comment: indentation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533281749 ## core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala: ## @@ -354,7 +355,8 @@ private[spark] class CoarseGrainedExecutorBackend( // We can only trust allBlocksMigrated boolean value if there were no tasks running // since the start of computing it. if (allBlocksMigrated && (migrationTime > lastTaskFinishTime.get())) { - logInfo("No running tasks, all blocks migrated, stopping.") + val timeTakenMs = (System.nanoTime() - startTime) / (1000 * 1000) + logInfo(s"No running tasks, all blocks migrated in $timeTakenMs ms, stopping.") Review Comment: Please revert the log-related change of this file because it's irrelevant to the functionality. We can handle this separately if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
yaooqinn commented on code in PR #45626: URL: https://github.com/apache/spark/pull/45626#discussion_r1533280693 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala: ## @@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect { override def supportsOffset: Boolean = true } + +private[jdbc] object OracleDialect { Review Comment: Opps, the errors occured ``` - simple scan with LIMIT *** FAILED *** (65 milliseconds) [info] org.apache.spark.SparkException: Task not serializable ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45393][BUILD][FOLLOWUP] Update IsolatedClientLoader fallback Hadoop version to 3.4.0 [spark]
pan3793 commented on PR #45628: URL: https://github.com/apache/spark/pull/45628#issuecomment-2011263855 cc @dongjoon-hyun seems this place was missed, or intended? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45393][BUILD][FOLLOWUP] Update IsolatedClientLoader fallback Hadoop version to 3.4.0 [spark]
pan3793 opened a new pull request, #45628: URL: https://github.com/apache/spark/pull/45628 ### What changes were proposed in this pull request? Update IsolatedClientLoader fallback Hadoop version to 3.4.0 ### Why are the changes needed? Sync with the default Hadoop version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533275990 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -551,6 +551,22 @@ package object config { .checkValue(_.endsWith(java.io.File.separator), "Path should end with separator.") .createOptional + private[spark] val STORAGE_FALLBACK_STORAGE_NUM_THREADS_FOR_SHUFFLE_READ = +ConfigBuilder("spark.storage.fallbackStorage.num.threads.for.shuffle.read") Review Comment: Every `.` means a new namespace in Apache Spark configuration namespace scheme. For example, this line introduces 4 new namespaces. Please avoid using `.`. - spark.storage.fallbackStorage.num.* - spark.storage.fallbackStorage.num.threads.* - spark.storage.fallbackStorage.num.threads.for.* - spark.storage.fallbackStorage.num.threads.for.shuffle.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
yaooqinn commented on PR #45626: URL: https://github.com/apache/spark/pull/45626#issuecomment-2011261846 Thank you @dongjoon-hyun. I addressed the comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
yaooqinn commented on code in PR #45626: URL: https://github.com/apache/spark/pull/45626#discussion_r1533274803 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala: ## @@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect { override def supportsOffset: Boolean = true } + +private[jdbc] object OracleDialect { Review Comment: This is only a place for containing some const ints, which represent some variants from Oracle JDBC extensions. Thus, `object` is enough -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
yaooqinn commented on code in PR #45626: URL: https://github.com/apache/spark/pull/45626#discussion_r1533274803 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala: ## @@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect { override def supportsOffset: Boolean = true } + +private[jdbc] object OracleDialect { Review Comment: This is only a place for containing some const ints, which represent some variants from Oracle JDBC extentions. Thus, `object` is enough -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on code in PR #45228: URL: https://github.com/apache/spark/pull/45228#discussion_r1533274587 ## core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala: ## @@ -447,7 +462,30 @@ class ShuffleBlockFetcherIteratorSuite extends SparkFunSuite with PrivateMethodT assert(!iterator.hasNext) } - test("fetch continuous blocks in batch successful 3 local + 4 host local + 2 remote reads") { + def createShuffleFile(shuffleId: Int, mapId: Int, reducerId: Int, conf: SparkConf): Unit = { +var name = ShuffleIndexBlockId(shuffleId, mapId, reducerId).name +var hash = JavaUtils.nonNegativeHash(name) +val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf) +val appId = conf.getAppId +val path: String = conf.get(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).get +val indexFile = new Path(path, s"$appId/$shuffleId/$hash/$name") +name = ShuffleDataBlockId(shuffleId, mapId, reducerId).name +hash = JavaUtils.nonNegativeHash(name) +val dataFile = new Path(path, s"$appId/$shuffleId/$hash/$name") +val fallbackFileSystem = org.apache.hadoop.fs.FileSystem.get(indexFile.toUri, hadoopConf) +val indexOut = fallbackFileSystem.create(indexFile) +indexOut.writeLong(0L) // offset +indexOut.writeLong(10L) // next offset +indexOut.writeLong(10L) // next offset +indexOut.writeLong(10L) // next offset +val dataOut = fallbackFileSystem.create(dataFile) +dataOut.writeBytes("some data to write") +indexOut.close() +dataOut.close() + } + + test("fetch continuous blocks in batch successful 3 local + 4 host local + 2 remote reads " + +"+ 2 read from external storage") { Review Comment: This PR seems to introduce a test failure (or flakiness) at this test case. Could you take a look at the CI failure? - https://github.com/maheshk114/spark/actions/runs/8369424178/job/22915159186 ``` [info] ShuffleBlockFetcherIteratorSuite: ... [info] - fetch continuous blocks in batch successful 3 local + 4 host local + 2 remote reads + 2 read from external storage *** FAILED *** (107 milliseconds) [info] org.mockito.exceptions.verification.TooFewActualInvocations: blockManager.getLocalBlockData(); [info] Wanted 2 times: [info] -> at org.apache.spark.storage.BlockManager.getLocalBlockData(BlockManager.scala:721) [info] But was 1 time: [info] -> at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:592) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
dongjoon-hyun commented on PR #45228: URL: https://github.com/apache/spark/pull/45228#issuecomment-2011254687 Thank you for rebasing, @maheshk114 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [Don't Review] Only for list python packages for branch-3.4 [spark]
dongjoon-hyun commented on PR #45600: URL: https://github.com/apache/spark/pull/45600#issuecomment-2011253985 Thank you for investigation. Ya, it's really weird flakiness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]
dongjoon-hyun commented on PR #45500: URL: https://github.com/apache/spark/pull/45500#issuecomment-2011251140 Thank you for all of your contributions here, @HiuKwok ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]
dongjoon-hyun commented on code in PR #45609: URL: https://github.com/apache/spark/pull/45609#discussion_r1533265715 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.jdbc + +import java.util.Locale + +import org.apache.spark.sql.catalyst.SQLConfHelper + +private case object HiveDialect extends JdbcDialect with SQLConfHelper { + + override def canHandle(url : String): Boolean = +url.toLowerCase(Locale.ROOT).startsWith("jdbc:hive2") Review Comment: If you are trying to use this for Spark Thrift Server, this should be `SparkDialect` in Spark community. However, in that case, it will look very weird because Apache Spark **needs** a direct to access itself. That's the meaning why we don't want to add any `SparkDialect` or `HiveDialect`. > Actually, it's not. I used sbin/start-thriftserver.sh in the production environment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] [spark]
MaxGekk commented on PR #45622: URL: https://github.com/apache/spark/pull/45622#issuecomment-2011239760 @davidm-db Could you trigger GitHub actions w/ tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]
cloud-fan commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-2011239475 It's ok to not support round-trip. array/struct/map is not CSV standard either. I think the principle is to avoid breaking changes if possible. Since `to_csv` already supports array/struct/map, let's keep it but fix the inconsistency issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] [spark]
MaxGekk commented on code in PR #45622: URL: https://github.com/apache/spark/pull/45622#discussion_r1533260149 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala: ## @@ -295,10 +295,50 @@ class AnalysisErrorSuite extends AnalysisTest with DataTypeErrorsBase { "aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x) FILTER (WHERE c > 1)", 7, 76))) } - errorTest( -"non-deterministic filter predicate in aggregate functions", -CatalystSqlParser.parsePlan("SELECT count(a) FILTER (WHERE rand(int(c)) > 1) FROM TaBlE2"), -"FILTER expression is non-deterministic, it cannot be used in aggregate functions" :: Nil) + test("SPARK-47256: non deterministic FILTER expression in an aggregate function") { +val plan = + CatalystSqlParser.parsePlan("SELECT count(a) FILTER (WHERE rand(int(c)) > 1) FROM TaBlE2") Review Comment: @davidm-db Could you reproduce the errors using public APIs, please. If it is not possible, we should consider to convert the error to internal errors (see `SparkException.internalError()`). ## common/utils/src/main/resources/error/error-classes.json: ## @@ -4627,6 +4627,26 @@ ], "sqlState" : "42KDF" }, + "AGGREGATE_FILTER_EXPRESSION_NON_DETERMINISTIC" : { Review Comment: Please, create an error class w/ a common error message, and its sub-classes w/ specific error messages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
dongjoon-hyun commented on code in PR #45626: URL: https://github.com/apache/spark/pull/45626#discussion_r1533254888 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala: ## @@ -2154,4 +2155,13 @@ class JDBCSuite extends QueryTest with SharedSparkSession { val expected = Map("percentile_approx_val" -> 49) assert(namedObservation.get === expected) } + + Review Comment: extra empty line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46920][YARN] Improve executor exit error message on YARN [spark]
LuciferYang commented on PR #44951: URL: https://github.com/apache/spark/pull/44951#issuecomment-2011228756 Merged into master for Spark 4.0. Thanks @pan3793 @srowen @tgravescs @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
dongjoon-hyun commented on code in PR #45626: URL: https://github.com/apache/spark/pull/45626#discussion_r1533254166 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala: ## @@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect { override def supportsOffset: Boolean = true } + +private[jdbc] object OracleDialect { Review Comment: Ur, actually, `case object OracleDialect` has more features (serializable, hashCode, toString) than `object OracleDialect`. Do we need to shrink features like this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46920][YARN] Improve executor exit error message on YARN [spark]
LuciferYang closed pull request #44951: [SPARK-46920][YARN] Improve executor exit error message on YARN URL: https://github.com/apache/spark/pull/44951 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
dongjoon-hyun commented on code in PR #45626: URL: https://github.com/apache/spark/pull/45626#discussion_r1533254166 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala: ## @@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect { override def supportsOffset: Boolean = true } + +private[jdbc] object OracleDialect { Review Comment: Ur, actually, `case object` has more features (serializable, hashCode, toString) than `object`. Do we need to shrink features like this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-41888][PYTHON][CONNECT][TESTS] Enable doctest for `DataFrame.observe` [spark]
zhengruifeng opened a new pull request, #45627: URL: https://github.com/apache/spark/pull/45627 ### What changes were proposed in this pull request? Enable doctest for `DataFrame.observe` ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` [spark]
dongjoon-hyun closed pull request #45625: [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` URL: https://github.com/apache/spark/pull/45625 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` [spark]
dongjoon-hyun commented on PR #45625: URL: https://github.com/apache/spark/pull/45625#issuecomment-2011222356 Thank you, @yaooqinn . Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` [spark]
yaooqinn commented on PR #45625: URL: https://github.com/apache/spark/pull/45625#issuecomment-2011211719 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]
yaooqinn opened a new pull request, #45626: URL: https://github.com/apache/spark/pull/45626 ### What changes were proposed in this pull request? This PR brings the Java ServiceProvider Interface (SPI) Support for dynamic JDBC dialect registering. A custom JDBC dialect can be registered easily instead of calling JdbcDialects.registerDialect manually. ### Why are the changes needed? For pure SQL and other non-Java API users, it's difficult to register a custom JDBC dialect to use. With this patch, this can be done when the jar containing the dialect class is visible to the spark classloader. ### Does this PR introduce _any_ user-facing change? Yes, but mostly for third-party developers ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]
HiuKwok commented on PR #45500: URL: https://github.com/apache/spark/pull/45500#issuecomment-2011184474 Left a comment on the Jira ticket and I'm closing this MR for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]
HiuKwok closed pull request #45500: [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 URL: https://github.com/apache/spark/pull/45500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]
LuciferYang commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-2011183252 I don't oppose this resolution, but if `to_csv` can handle complex data structures, what would our subsequent plans be? Would `from_csv` also need to support complex data structures to make these paired functions more self-consistent? In the long term, do we need to enhance the read and write capabilities for CSV data sources to support complex data structures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141] [Core]: Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
maheshk114 commented on PR #45228: URL: https://github.com/apache/spark/pull/45228#issuecomment-2011171597 > To @maheshk114 , could you rebase this PR to the `master` branch once more to test on top of Apache Hadoop 3.4.0? The following is merged to Apache Spark `master` branch. > > * [[SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0Â #45583](https://github.com/apache/spark/pull/45583) done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]
panbingkun commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-2011171313 Use `ToPrettyString.eval` to generate pretty strings for these types in `to_csv`, I have also tried it in this PR. Below: https://github.com/apache/spark/pull/44665/commits/22a7afb6951b9274cc428f71e675df233de74a8a https://github.com/apache/spark/assets/15246973/961242a9-d26d-4afe-9b01-ad7ea6ab25d9;> (PS: Maybe some details about `using ToPrettyString.eval` still need to be confirmed.) If the final discussion result is to `fix their consistency`, this approach should also be possible. WDYT @MaxGekk @HyukjinKwon @LuciferYang @xinrong-meng @srowen ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][CORE] Fix a typo `slf4j-to-jul` to `jul-to-slf4j` [spark]
dongjoon-hyun opened a new pull request, #45625: URL: https://github.com/apache/spark/pull/45625 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47486][CONNECT] Remove unused private `ArrowDeserializers.getString` method [spark]
LuciferYang commented on PR #45610: URL: https://github.com/apache/spark/pull/45610#issuecomment-2011168276 Thanks @dongjoon-hyun ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47141] [Core]: Support enabling migration of shuffle data directly to external storage using config parameter. [spark]
maheshk114 commented on PR #45228: URL: https://github.com/apache/spark/pull/45228#issuecomment-2011167980 > To @maheshk114 , I have a big concern on the AS-IS PR title because it is misleading the users by ignoring the existing Apache Spark 3.2 features. > > > [SPARK-47141] [Core]: Support shuffle migration to external storage. > > At least, as the author of the following Apache Spark 3.2 patches, I believe Apache Spark "Support Shuffle Migration to External Storage" already. WDTY, @maheshk114 ? Could you revise the PR title to narrow down to this PR's exact contribution? > > * [[SPARK-33545][CORE] Support Fallback Storage during Worker decommission #30492](https://github.com/apache/spark/pull/30492) > * [[SPARK-34142][CORE] Support Fallback Storage Cleanup during stopping SparkContext #31215](https://github.com/apache/spark/pull/31215) Thanks a lot @dongjoon-hyun for looking into the PR. Yes, it make sense. I have updated the PR title to reflect the exact changes done in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47007][SQL][PYTHON][R][CONNECT] Add the `map_sort` function [spark]
dongjoon-hyun commented on PR #45069: URL: https://github.com/apache/spark/pull/45069#issuecomment-2011161784 +1 for Wenchen's decision. Thank you for reverting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]
cloud-fan commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-2011158182 I'd prefer to fix the inconsistency. Can we use `ToPrettyString.eval` to generate pretty strings for these types in `to_csv`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47487][SQL] Simplify code in AnsiTypeCoercion [spark]
gengliangwang commented on code in PR #45612: URL: https://github.com/apache/spark/pull/45612#discussion_r1533192344 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala: ## @@ -180,56 +180,27 @@ object AnsiTypeCoercion extends TypeCoercionBase { // cast the input to decimal. case (n: NumericType, DecimalType) => Some(DecimalType.forType(n)) - // Cast null type (usually from null literals) into target types - // By default, the result type is `target.defaultConcreteType`. When the target type is - // `TypeCollection`, there is another branch to find the "closet convertible data type" below. - case (NullType, target) if !target.isInstanceOf[TypeCollection] => -Some(target.defaultConcreteType) - // If a function expects a StringType, no StringType instance should be implicitly cast to // StringType with a collation that's not accepted (aka. lockdown unsupported collations). case (_: StringType, StringType) => None case (_: StringType, _: StringTypeCollated) => None - // This type coercion system will allow implicit converting String type as other - // primitive types, in case of breaking too many existing Spark SQL queries. - case (StringType, a: AtomicType) => -Some(a) - - // If the target type is any Numeric type, convert the String type as Double type. - case (StringType, NumericType) => -Some(DoubleType) - - // If the target type is any Decimal type, convert the String type as the default - // Decimal type. - case (StringType, DecimalType) => -Some(DecimalType.SYSTEM_DEFAULT) - - // If the target type is any timestamp type, convert the String type as the default - // Timestamp type. - case (StringType, AnyTimestampType) => -Some(AnyTimestampType.defaultConcreteType) - - case (DateType, AnyTimestampType) => -Some(AnyTimestampType.defaultConcreteType) - - case (_, target: DataType) => -if (Cast.canANSIStoreAssign(inType, target)) { - Some(target) + // Ideally the implicit cast rule should be the same as `Cast.canANSIStoreAssign` so that it's + // consistent with table insertion. To avoid breaking too many existing Spark SQL queries, + // we make the system to allow implicitly converting String type as other primitive types. + case (StringType, a @ (_: AtomicType | NumericType | DecimalType | AnyTimestampType)) => +Some(a.defaultConcreteType) + + // When the target type is `TypeCollection`, there is another branch to find the + // "closet convertible data type" below. + case (_, target) if !target.isInstanceOf[TypeCollection] => +val concreteType = target.defaultConcreteType Review Comment: Making it as `final` also sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]
leletan commented on PR #45607: URL: https://github.com/apache/spark/pull/45607#issuecomment-2011103330 @dongjoon-hyun Updated the PR and associated it with the new JIRA https://issues.apache.org/jira/browse/SPARK-47495. Please let me know if this looks good to you. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47487][SQL] Simplify code in AnsiTypeCoercion [spark]
cloud-fan commented on code in PR #45612: URL: https://github.com/apache/spark/pull/45612#discussion_r1533184546 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala: ## @@ -180,56 +180,27 @@ object AnsiTypeCoercion extends TypeCoercionBase { // cast the input to decimal. case (n: NumericType, DecimalType) => Some(DecimalType.forType(n)) - // Cast null type (usually from null literals) into target types - // By default, the result type is `target.defaultConcreteType`. When the target type is - // `TypeCollection`, there is another branch to find the "closet convertible data type" below. - case (NullType, target) if !target.isInstanceOf[TypeCollection] => -Some(target.defaultConcreteType) - // If a function expects a StringType, no StringType instance should be implicitly cast to // StringType with a collation that's not accepted (aka. lockdown unsupported collations). case (_: StringType, StringType) => None case (_: StringType, _: StringTypeCollated) => None - // This type coercion system will allow implicit converting String type as other - // primitive types, in case of breaking too many existing Spark SQL queries. - case (StringType, a: AtomicType) => -Some(a) - - // If the target type is any Numeric type, convert the String type as Double type. - case (StringType, NumericType) => -Some(DoubleType) - - // If the target type is any Decimal type, convert the String type as the default - // Decimal type. - case (StringType, DecimalType) => -Some(DecimalType.SYSTEM_DEFAULT) - - // If the target type is any timestamp type, convert the String type as the default - // Timestamp type. - case (StringType, AnyTimestampType) => -Some(AnyTimestampType.defaultConcreteType) - - case (DateType, AnyTimestampType) => -Some(AnyTimestampType.defaultConcreteType) - - case (_, target: DataType) => -if (Cast.canANSIStoreAssign(inType, target)) { - Some(target) + // Ideally the implicit cast rule should be the same as `Cast.canANSIStoreAssign` so that it's + // consistent with table insertion. To avoid breaking too many existing Spark SQL queries, + // we make the system to allow implicitly converting String type as other primitive types. + case (StringType, a @ (_: AtomicType | NumericType | DecimalType | AnyTimestampType)) => +Some(a.defaultConcreteType) + + // When the target type is `TypeCollection`, there is another branch to find the + // "closet convertible data type" below. + case (_, target) if !target.isInstanceOf[TypeCollection] => +val concreteType = target.defaultConcreteType Review Comment: It's the default implementation in `DataType`: ``` override private[sql] def defaultConcreteType: DataType = this ``` maybe I just make it as `final`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47007][SQL][PYTHON][R][CONNECT] Add the `map_sort` function [spark]
cloud-fan commented on PR #45069: URL: https://github.com/apache/spark/pull/45069#issuecomment-2011092172 I can't find it in other systems, and it does not make sense as map elements are order-less. I'm reverting it, please re-submit it without exposing the function publicly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]
leletan commented on code in PR #45607: URL: https://github.com/apache/spark/pull/45607#discussion_r1533169526 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1458,6 +1458,18 @@ package object config { .doubleConf .createWithDefault(1.5) + private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES = +ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes") + .doc("Comma-separated list of schemes for which jars will not be downloaded to the " + +"driver local disk prior to be distributed to executors, only for kubernetes deployment. " + +"For use in cases when the jars are big and executor counts are high, " + +"concurrent download causes network saturation and timeouts. " + +"Wildcard '*' is denoted to not downloading jars for any the schemes.") + .version("2.3.0") Review Comment: Will fix and move this to another JIRA & PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [Don't Review] Only for address pyspark-errors for branch-3.4 [spark]
panbingkun opened a new pull request, #45624: URL: https://github.com/apache/spark/pull/45624 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [Don't Review] Only for list python packages for branch-3.4 [spark]
panbingkun commented on PR #45600: URL: https://github.com/apache/spark/pull/45600#issuecomment-2011079125 Fo record: 1.success ``` (base) panbingkun:~/Developer/spark/spark-trunk$pip install --upgrade --no-cache-dir --force-reinstall ./python/dist/pyspark-3.4.3.dev0.tar.gz Processing ./python/dist/pyspark-3.4.3.dev0.tar.gz Preparing metadata (setup.py) ... done Collecting py4j==0.10.9.7 (from pyspark==3.4.3.dev0) Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB) Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB) 200.5/200.5 kB 134.2 kB/s eta 0:00:00 Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) ... done Created wheel for pyspark: filename=pyspark-3.4.3.dev0-py2.py3-none-any.whl size=626013326 sha256=e423146045a43d8c90ac471489dd5b6fb3702def94274f6db968961c8981731f Stored in directory: /private/var/folders/kl/p0fxfzp53g5cc38qjxqwk_cmgp/T/pip-ephem-wheel-cache-73zlf_ix/wheels/0d/86/cd/947463e4be802583aec12b9dd00109a2a65ea91253d3ffd32d Successfully built pyspark Installing collected packages: py4j, pyspark Attempting uninstall: pyspark Found existing installation: pyspark 4.0.0.dev0 Uninstalling pyspark-4.0.0.dev0: Successfully uninstalled pyspark-4.0.0.dev0 Successfully installed py4j-0.10.9.7 pyspark-3.4.3.dev0 ``` 2.fail ``` Writing pyspark-3.4.3.dev0/setup.cfg creating dist Creating tar archive removing 'pyspark-3.4.3.dev0' (and everything under it) Installing dist into virtual env Processing ./python/dist/pyspark-3.4.3.dev0.tar.gz Preparing metadata (setup.py) ... 25l- done 25hCollecting py4j==0.10.9.7 (from pyspark==3.4.3.dev0) Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB) Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB) 25l 0.0/200.5 kB ? eta -:--:-- 200.5/200.5 kB 9.1 MB/s eta 0:00:00 25hBuilding wheels for collected packages: pyspark ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]
leletan commented on code in PR #45607: URL: https://github.com/apache/spark/pull/45607#discussion_r1533169526 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1458,6 +1458,18 @@ package object config { .doubleConf .createWithDefault(1.5) + private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES = +ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes") + .doc("Comma-separated list of schemes for which jars will not be downloaded to the " + +"driver local disk prior to be distributed to executors, only for kubernetes deployment. " + +"For use in cases when the jars are big and executor counts are high, " + +"concurrent download causes network saturation and timeouts. " + +"Wildcard '*' is denoted to not downloading jars for any the schemes.") + .version("2.3.0") Review Comment: Will move this to another JIRA & PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]
xleoken commented on PR #45609: URL: https://github.com/apache/spark/pull/45609#issuecomment-2011074139 > Thanks for pinging me @dongjoon-hyun. > > I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect. Welcome @yaooqinn, can you explain in detail? The key to this patch is to override `quoteIdentifier` method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]
yaooqinn commented on PR #45613: URL: https://github.com/apache/spark/pull/45613#issuecomment-2011072653 I checked your example code in the JIRA and it looks weird. Ideally, you should stop the non-daemon threads for a JVM to exit, and stop SparkContext to terminate the app -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect [spark]
yaooqinn commented on PR #45588: URL: https://github.com/apache/spark/pull/45588#issuecomment-2011059165 Thank you, @dongjoon-hyun and @cloud-fan. I will send followups for migration guides -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47474][CORE] Revert SPARK-47461 and add some comments [spark]
LuciferYang commented on PR #45602: URL: https://github.com/apache/spark/pull/45602#issuecomment-2011052480 > BTW, @LuciferYang , as you requested in the above already, I guess we need to wait for @mridulm and @tgravescs 's review and approval. I'll leave this to them. OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]
yaooqinn commented on PR #45609: URL: https://github.com/apache/spark/pull/45609#issuecomment-2011049532 Thanks for pinging me @dongjoon-hyun. I know that it's technically feasible. But we have a much more efficient and direct way to access hive tables. I don't see the necessity for adding it as a built-in dialect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
RussellSpitzer commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2011026947 Yes those parameters only change the Iceberg schema. What your are seeing is the note that the Iceberg Schema cannot be read into the Spark Type without a special parameter. Changing the default for the writer still doesn't effect us. We had that issue previously because Trino and other engines have no problem creating a table with that type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
viirya commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2011021425 > @viirya Thanks, could you provide more details about the iceberg workaround? > > As for Delta, users have to either disable `spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature via > > ``` > ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 'supported') > ``` > > After enabling the feature, the table won't be able to read by older Delta version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better solution in such a case. Not sure if there are definitely related, but I remember there are configs like ``` .config("spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables", "true") .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2011010486 @RussellSpitzer The same pattern can happen on iceberg if there are multiple Spark versions in the pipeline On Spark 3.3+ ``` // The following can be Replace table as select as well spark.read.parquet(...).write.mode("overwrite").format("iceberg").saveAsTable("iceberg_table") ``` On Spark 3.2 and earlier: ``` > select * from iceberg_ts java.lang.IllegalArgumentException: Cannot handle timestamp without timezone fields in Spark. Spark does not natively support this type but if you would like to handle all timestamps as timestamp with timezone set 'spark.sql.iceberg.handle-timestamp-without-timezone' to true. This will not change the underlying values stored but will change their displayed values in Spark. For more information please see https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47372][SS] Add support for range scan based key state encoder for use with state store provider [spark]
neilramaswamy commented on code in PR #45503: URL: https://github.com/apache/spark/pull/45503#discussion_r1533103809 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -39,13 +41,21 @@ sealed trait RocksDBValueStateEncoder { } object RocksDBStateEncoder { - def getKeyEncoder( - keySchema: StructType, - numColsPrefixKey: Int): RocksDBKeyStateEncoder = { -if (numColsPrefixKey > 0) { - new PrefixKeyScanStateEncoder(keySchema, numColsPrefixKey) -} else { - new NoPrefixKeyStateEncoder(keySchema) + def getKeyEncoder(keyStateEncoderSpec: KeyStateEncoderSpec): RocksDBKeyStateEncoder = { +// Return the key state encoder based on the requested type +keyStateEncoderSpec match { + case NoPrefixKeyStateEncoderSpec(keySchema) => +new NoPrefixKeyStateEncoder(keySchema) + + case PrefixKeyScanStateEncoderSpec(keySchema, numColsPrefixKey) => +new PrefixKeyScanStateEncoder(keySchema, numColsPrefixKey) + + case RangeKeyScanStateEncoderSpec(keySchema, numColsPrefixKey) => Review Comment: `numOrderingCols` ? ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -192,6 +199,233 @@ class PrefixKeyScanStateEncoder( override def supportPrefixKeyScan: Boolean = true } +/** + * RocksDB Key Encoder for UnsafeRow that supports range scan for fixed size fields Review Comment: Nowhere in this do we mention why we cannot range scan non-fixed size fields (I understand, but maybe future readers won't). It's not mentioned in the error message either. Can we do that? I think also a quick comment explaining how this actually works would be helpful, i.e. "To encode a row `r` for a range scan, we first project the first `numOrderingCols` needed for the range scan into an `UnsafeRow`; we then rewrite that `UnsafeRow`'s fields in big endian. Then, for the rest of the fields, we project those into an `UnsafeRow`. We then effectively join these latter two `UnsafeRow`s together, and finally take those bytes to get the resulting row." For example, NoPrefixKeyStateEncoder has a comment like this. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala: ## @@ -192,6 +199,233 @@ class PrefixKeyScanStateEncoder( override def supportPrefixKeyScan: Boolean = true } +/** + * RocksDB Key Encoder for UnsafeRow that supports range scan for fixed size fields + * Note that for range scan, we have to encode the ordering columns using BIG_ENDIAN + * encoding to allow for scanning keys in sorted order using the byte-wise comparison + * method that RocksDB uses. + * + * @param keySchema - schema of the key to be encoded + * @param numColsPrefixKey - number of columns to be used for prefix key Review Comment: nit typo ## sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala: ## @@ -158,14 +161,360 @@ class RocksDBStateStoreSuite extends StateStoreSuiteBase[RocksDBStateStoreProvid } } + testWithColumnFamilies("rocksdb range scan validation - invalid num columns", +TestWithBothChangelogCheckpointingEnabledAndDisabled) { colFamiliesEnabled => +// zero ordering cols +val ex1 = intercept[SparkUnsupportedOperationException] { + tryWithProviderResource(newStoreProvider(keySchemaWithRangeScan, +RangeKeyScanStateEncoderSpec(keySchemaWithRangeScan, 0), +colFamiliesEnabled)) { provider => +provider.getStore(0) + } +} +checkError( + ex1, + errorClass = "STATE_STORE_INCORRECT_NUM_ORDERING_COLS_FOR_RANGE_SCAN", + parameters = Map( +"numOrderingCols" -> "0" + ), + matchPVals = true +) + +// ordering cols greater than schema cols +val ex2 = intercept[SparkUnsupportedOperationException] { + tryWithProviderResource(newStoreProvider(keySchemaWithRangeScan, +RangeKeyScanStateEncoderSpec(keySchemaWithRangeScan, keySchemaWithRangeScan.length + 1), +colFamiliesEnabled)) { provider => +provider.getStore(0) + } +} +checkError( + ex2, + errorClass = "STATE_STORE_INCORRECT_NUM_ORDERING_COLS_FOR_RANGE_SCAN", + parameters = Map( +"numOrderingCols" -> (keySchemaWithRangeScan.length + 1).toString + ), + matchPVals = true +) + } + + testWithColumnFamilies("rocksdb range scan validation - variable sized columns", +TestWithBothChangelogCheckpointingEnabledAndDisabled) { colFamiliesEnabled => +val keySchemaWithVariableSizeCols: StructType = StructType( + Seq(StructField("key1", StringType, false), StructField("key2", StringType, false))) + +val ex = intercept[SparkUnsupportedOperationException] { + tryWithProviderResource(newStoreProvider(keySchemaWithVariableSizeCols, +
Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]
panbingkun commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-2010991612 - Why is the result displayed through `to_csv` inconsistency in Scala and Python for this case? Because this case is on the `python side`, it ultimately uses `GenericArrayData`, which happens to implement the method `toString`, so `to_csv` displays readable text. https://github.com/apache/spark/blob/11247d804cd370aaeb88736a706c587e7f5c83b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L85 However, on the `scala side`, it ultimately uses `UnsafeArrayData`. `Unfortunately`, it does not implement the method `toString` (using the default `Object.toString` method), so the final `to_csv` displays `the address of the object`. - In the implementation process of this PR, it can display `non-standard but pretty strings`, as follows: https://github.com/apache/spark/pull/44665/commits/9695e975f3299556e7c268918ecd51be7a03c157 https://github.com/apache/spark/assets/15246973/fd07dc0a-4d61-4663-8631-daff518da278;> The `disadvantage` of this is that it `cannot` be `read back` through `from_csv` `at present`. If the final result of the discussion is acceptable, it should be easy to bring back this feature. - Another possible compromise solution is to add a configuration (defaultly, it does `not` support displaying data of type [Array, Map, Struct ...] as `non-standard but pretty strings` through `to_csv`). If the user sets this configuration to be enabled, restore the original behavior? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
RussellSpitzer commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010991405 I should add we have custom readers and writers for parquet so this parameter doesn't effect us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
RussellSpitzer commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010990874 I don't think this is an issue for Iceberg, we always would store values the same and then mark in the schema the difference in type. Readers would then get values based on the table schema and not what was in the file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]
xleoken commented on code in PR #45609: URL: https://github.com/apache/spark/pull/45609#discussion_r1533099686 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.jdbc + +import java.util.Locale + +import org.apache.spark.sql.catalyst.SQLConfHelper + +private case object HiveDialect extends JdbcDialect with SQLConfHelper { + + override def canHandle(url : String): Boolean = +url.toLowerCase(Locale.ROOT).startsWith("jdbc:hive2") Review Comment: hi @dongjoon-hyun, thanks for your review. Your considerations are correct, but this patch is applicable to both `Hive Thrift Server` and `Spart Thrift Server`. > You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right? Actually, it's not. I used `sbin/start-thriftserver.sh` in the production environment. > I'm not sure HiveDialect is a valid name in Apache Spark community OK, `HiveDialect` seems better for `jdbc:hive2`. In the future, if encountering Hive-specific syntax or SparkSQL-specific syntax issue, we can distinguish between Hive and Spark in specific methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-201099 @viirya Thanks, could you provide more details about the iceberg workaround? As for Delta, users have to either disable `spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature via ``` ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 'supported') ``` After enabling the feature, the table won't be able to read by older Delta version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better solution in such a case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]
anishshri-db commented on code in PR #45467: URL: https://github.com/apache/spark/pull/45467#discussion_r1533102287 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala: ## @@ -271,57 +320,111 @@ case class TransformWithStateExec( case _ => } -if (isStreaming) { - child.execute().mapPartitionsWithStateStore[InternalRow]( +if (hasInitialState) { + val storeConf = new StateStoreConf(session.sqlContext.sessionState.conf) + val hadoopConfBroadcast = sparkContext.broadcast( Review Comment: Why do we need to do this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]
xleoken commented on code in PR #45609: URL: https://github.com/apache/spark/pull/45609#discussion_r1533099686 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala: ## @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.jdbc + +import java.util.Locale + +import org.apache.spark.sql.catalyst.SQLConfHelper + +private case object HiveDialect extends JdbcDialect with SQLConfHelper { + + override def canHandle(url : String): Boolean = +url.toLowerCase(Locale.ROOT).startsWith("jdbc:hive2") Review Comment: hi @dongjoon-hyun, thanks for your review. Your considerations are correct, but this patch is applicable to both `Hive Thrift Server` and `Spart Thrift Server`. > You want to achieve to introduce Hive-specific syntax via this HiveDialect instead of Spark Thrift Server, right? Actually, it's not. I used `sbin/start-thriftserver.sh` in the production environment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]
anishshri-db commented on code in PR #45467: URL: https://github.com/apache/spark/pull/45467#discussion_r1533098209 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -3503,6 +3503,12 @@ ], "sqlState" : "42802" }, + "STATEFUL_PROCESSOR_CANNOT_REINITIALIZE_STATE_ON_KEY" : { +"message" : [ + "Cannot re-initialize state on the same grouping key during initial state handling for stateful processor. Invalid grouping key = ." Review Comment: nit: remove the space before/after equal ``` Invalid grouping key= ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
viirya commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010971958 > The result table can't be read by older version engines anymore due to a lack of TimestampNTZ support. The behavior change happens silently and it is hard for users to figure out how to fix it. Hmm, I quite recall that in Iceberg there seems some configs can be used to read such fields as workaround? I'm not sure about Delta on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` with `WithRelations` [spark]
xinrong-meng commented on code in PR #45614: URL: https://github.com/apache/spark/pull/45614#discussion_r1532525957 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -131,6 +132,23 @@ message SQL { repeated Expression pos_arguments = 5; } +// Relation of type [[WithRelations]]. +// +// This relation contains a root plan, and one or more references that are used by the root plan. +// There are two ways of referencing a relation, by name (through a subquery alias), or by plan_id +// (using RelationCommon.plan_id). +// +// This relation can be used to implement CTEs, describe DAGs, or to reduce tree depth. +message WithRelations { Review Comment: May I ask which APIs are associated with the relation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-46349] Prevent nested SortOrder instances in SortOrder expressions [spark]
github-actions[bot] closed pull request #44283: [WIP][SPARK-46349] Prevent nested SortOrder instances in SortOrder expressions URL: https://github.com/apache/spark/pull/44283 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010929308 @dongjoon-hyun yes, it is a tough decision. I am now marking this PR as WIP and will decide whether we should continue on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]
leletan commented on code in PR #45607: URL: https://github.com/apache/spark/pull/45607#discussion_r1533047660 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1458,6 +1458,18 @@ package object config { .doubleConf .createWithDefault(1.5) + private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES = +ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes") + .doc("Comma-separated list of schemes for which jars will not be downloaded to the " + +"driver local disk prior to be distributed to executors, only for kubernetes deployment. " + +"For use in cases when the jars are big and executor counts are high, " + +"concurrent download causes network saturation and timeouts. " + +"Wildcard '*' is denoted to not downloading jars for any the schemes.") + .version("2.3.0") Review Comment: Good catch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]
jingz-db commented on code in PR #45467: URL: https://github.com/apache/spark/pull/45467#discussion_r1531068752 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala: ## @@ -127,13 +152,53 @@ case class TransformWithStateExec( mappedIterator } + private def processInitialStateRows( + keyRow: UnsafeRow, + initStateIter: Iterator[InternalRow]): Unit = { +val getKeyObj = + ObjectOperator.deserializeRowToObject(keyDeserializer, groupingAttributes) + +val getStateValueObj = + ObjectOperator.deserializeRowToObject(initialStateDeserializer, initialStateDataAttrs) + +val keyObj = getKeyObj(keyRow) // convert key to objects +ImplicitGroupingKeyTracker.setImplicitKey(keyObj) +val initStateObjIter = initStateIter.map(getStateValueObj.apply) + +initStateObjIter.foreach { initState => + statefulProcessor +.asInstanceOf[StatefulProcessorWithInitialState[Any, Any, Any, Any]] +.handleInitialState(keyObj, initState) +} +ImplicitGroupingKeyTracker.removeImplicitKey() + } + private def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = { val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output) groupedIter.flatMap { case (keyRow, valueRowIter) => val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow] handleInputRows(keyUnsafeRow, valueRowIter) } } +// TODO double check this + private def processNewDataWithInitialState( + dataIter: Iterator[InternalRow], + initStateIter: Iterator[InternalRow]): Iterator[InternalRow] = { + +val groupedChildDataIter = GroupedIterator(dataIter, groupingAttributes, child.output) +val groupedInitialStateIter = + GroupedIterator(initStateIter, initialStateGroupingAttrs, initialState.output) + +// Create a CoGroupedIterator that will group the two iterators together for every key group. +new CoGroupedIterator( + groupedChildDataIter, groupedInitialStateIter, groupingAttributes).flatMap { + case (keyRow, valueRowIter, initialStateRowIter) => +// TODO in design doc: trying to re-initialize state for the same +// grouping key will result in an error? Review Comment: Not sure if I understand the scenario correctly, does it mean user tries to reassign state variable values in user defined function `handleInitialState()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] [spark]
davidm-db closed pull request #45622: [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] URL: https://github.com/apache/spark/pull/45622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 [spark]
dongjoon-hyun commented on PR #45623: URL: https://github.com/apache/spark/pull/45623#issuecomment-2010746381 Merged to master/3.5/3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 [spark]
dongjoon-hyun closed pull request #45623: [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 URL: https://github.com/apache/spark/pull/45623 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang commented on code in PR #45621: URL: https://github.com/apache/spark/pull/45621#discussion_r1532951714 ## docs/sql-migration-guide.md: ## @@ -121,6 +122,8 @@ license: | - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input. - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562). + + - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. Review Comment: Sure, created https://github.com/apache/spark/pull/45623 for this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 [spark]
gengliangwang opened a new pull request, #45623: URL: https://github.com/apache/spark/pull/45623 ### What changes were proposed in this pull request? Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 ### Why are the changes needed? Show the behavior change to users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? Yes, there are some doc suggestion from copilot in docs/sql-migration-guide.md -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47490][SS] Fix RocksDB Logger constructor use to avoid deprecation warning [spark]
dongjoon-hyun closed pull request #45616: [SPARK-47490][SS] Fix RocksDB Logger constructor use to avoid deprecation warning URL: https://github.com/apache/spark/pull/45616 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47486][CONNECT] Remove unused private `ArrowDeserializers.getString` method [spark]
dongjoon-hyun closed pull request #45610: [SPARK-47486][CONNECT] Remove unused private `ArrowDeserializers.getString` method URL: https://github.com/apache/spark/pull/45610 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]
dongjoon-hyun commented on code in PR #45607: URL: https://github.com/apache/spark/pull/45607#discussion_r1532943327 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1458,6 +1458,18 @@ package object config { .doubleConf .createWithDefault(1.5) + private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES = +ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes") + .doc("Comma-separated list of schemes for which jars will not be downloaded to the " + +"driver local disk prior to be distributed to executors, only for kubernetes deployment. " + +"For use in cases when the jars are big and executor counts are high, " + +"concurrent download causes network saturation and timeouts. " + +"Wildcard '*' is denoted to not downloading jars for any the schemes.") + .version("2.3.0") Review Comment: This should be `4.0.0`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]
dongjoon-hyun commented on PR #45607: URL: https://github.com/apache/spark/pull/45607#issuecomment-2010726489 Ack, @dbtsai . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
dongjoon-hyun commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010725475 To @gengliangwang , I believe it would be better if you can send a short email to dev@spark in order not to surprise anyone. > I have been considering this for a while because of reported Spark upgrade failures. I am open to any suggestions. cc @sadikovi as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
dongjoon-hyun commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010724234 I understand the exposed surface is limited as mentioned by @gengliangwang , but any Parquet change causes significant impact to all Spark users because Apache Parquet is the default format. cc @mridulm , @sunchao , @viirya , @peter-toth , @RussellSpitzer , too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
dongjoon-hyun commented on code in PR #45621: URL: https://github.com/apache/spark/pull/45621#discussion_r1532937164 ## docs/sql-migration-guide.md: ## @@ -42,6 +42,7 @@ license: | - Since Spark 4.0, the function `to_csv` no longer supports input with the data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV specification` does not have standards for these data types and cannot be read back using `from_csv`), Spark will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception. - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to TimestampNTZType, which is available in Spark 3.5. - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL DATETIME is not affected. +- Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true. Review Comment: We need a discussion about this breaking change in Apache Spark 4.0.0, @gengliangwang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
dongjoon-hyun commented on code in PR #45621: URL: https://github.com/apache/spark/pull/45621#discussion_r1532935416 ## docs/sql-migration-guide.md: ## @@ -121,6 +122,8 @@ license: | - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input. - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562). + + - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. Review Comment: Well, could you handle this as an independent JIRA first, @gengliangwang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang commented on code in PR #45621: URL: https://github.com/apache/spark/pull/45621#discussion_r1532932026 ## docs/sql-migration-guide.md: ## @@ -121,6 +122,8 @@ license: | - Since Spark 3.3, the `unbase64` function throws error for a malformed `str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function returns a best-efforts result for a malformed `str` input. - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS (b)`-style SQL statements, `grouping__id` returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562). + + - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 and earlier, these columns are inferred as TIMESTAMP type. To restore the behavior before Spark 3.3, you can set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a behavior change, and it will be disabled after Spark 4.0 release. Review Comment: Another option is to change the conf in the latest 3.3/3.4/3.5 releases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]
dbtsai commented on PR #45607: URL: https://github.com/apache/spark/pull/45607#issuecomment-2010712682 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang commented on PR #45621: URL: https://github.com/apache/spark/pull/45621#issuecomment-2010712125 I have been considering this for a while because of reported Spark upgrade failures. I am open to any suggestions. cc @sadikovi as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]
gengliangwang opened a new pull request, #45621: URL: https://github.com/apache/spark/pull/45621 ### What changes were proposed in this pull request? Since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. This change ensures backward compatibility with releases of Spark version 3.2 and earlier. It also aligns the behavior of schema inference for Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, enhancing consistency across the data sources. To revert to the previous behavior where TIMESTAMP_NTZ types were inferred, set `spark.sql.parquet.inferTimestampNTZ.enabled` to true. Note: With https://issues.apache.org/jira/browse/SPARK-47368 and https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't break the current workloads which are using Spark 3.3/3.4/3.5. ### Why are the changes needed? 1. Consistency with the behavior of CSV, JSON, ORC, and JDBC data sources. This also makes the schema inference behavior simpler. 2. When using `insert overwrite` or `replace table` over external parquet files into a Delta or Iceberg table, the result schema is changed from TimestampType to TimestampNTZType. The result table can't be read by older version engines anymore due to a lack of TimestampNTZ support. The behavior change happens silently and it is hard for users to figure out how to fix it. ### Does this PR introduce _any_ user-facing change? Yes, since Spark 4.0, the SQL config `spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. Consequently, when reading Parquet files that were not produced by Spark, the Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ data type. With https://issues.apache.org/jira/browse/SPARK-47368 and https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't break the current workloads which are using Spark 3.3/3.4/3.5. ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? Yes, there are some doc suggestion from copilot in `docs/sql-migration-guide.md` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump black from 23.9.1 to 24.3.0 in /dev [spark]
dependabot[bot] commented on PR #45617: URL: https://github.com/apache/spark/pull/45617#issuecomment-2010663402 OK, I won't notify you about black again, unless you re-open this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump black from 23.9.1 to 24.3.0 in /dev [spark]
dependabot[bot] closed pull request #45617: Bump black from 23.9.1 to 24.3.0 in /dev URL: https://github.com/apache/spark/pull/45617 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump black from 23.9.1 to 24.3.0 in /dev [spark]
dongjoon-hyun commented on PR #45617: URL: https://github.com/apache/spark/pull/45617#issuecomment-2010663347 @dependabot ignore this dependency -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]
dongjoon-hyun commented on code in PR #45620: URL: https://github.com/apache/spark/pull/45620#discussion_r1532892591 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.catalyst.analysis.AnalysisTest +import org.apache.spark.sql.execution.SparkSqlParser +import org.apache.spark.sql.test.SharedSparkSession + +class ParserSuite extends AnalysisTest with SharedSparkSession { + private lazy val parser = new SparkSqlParser() + + test("verify whitespace handling") { Review Comment: Please move this to the existing `PlanParserSuite.scala`(catalyst module) or `SparkSqlParserSuite` (sql/core module). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]
dongjoon-hyun commented on code in PR #45620: URL: https://github.com/apache/spark/pull/45620#discussion_r1532892591 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.catalyst.analysis.AnalysisTest +import org.apache.spark.sql.execution.SparkSqlParser +import org.apache.spark.sql.test.SharedSparkSession + +class ParserSuite extends AnalysisTest with SharedSparkSession { + private lazy val parser = new SparkSqlParser() + + test("verify whitespace handling") { Review Comment: Please move this to the existing `SparkSqlParserSuite`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]
dongjoon-hyun commented on code in PR #45620: URL: https://github.com/apache/spark/pull/45620#discussion_r1532892353 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.catalyst.analysis.AnalysisTest +import org.apache.spark.sql.execution.SparkSqlParser +import org.apache.spark.sql.test.SharedSparkSession + +class ParserSuite extends AnalysisTest with SharedSparkSession { Review Comment: It's too much for a single test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]
dongjoon-hyun commented on code in PR #45620: URL: https://github.com/apache/spark/pull/45620#discussion_r1532891606 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command Review Comment: Why is this under `command` package? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-27950][DSTREAMS][Kinesis] Configurable dynamodb url so kinesis-asl works with localstack [spark]
dongjoon-hyun commented on PR #45619: URL: https://github.com/apache/spark/pull/45619#issuecomment-2010655319 Could you set `ENABLE_KINESIS_TESTS=1` and run the tests in your environment? It would be great if you can paste your result into the PR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]
srielau opened a new pull request, #45620: URL: https://github.com/apache/spark/pull/45620 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org