Re: [PR] [SPARK-47474][CORE] Revert SPARK-47461 and add some comments [spark]

2024-03-20 Thread via GitHub


LuciferYang closed pull request #45602: [SPARK-47474][CORE] Revert SPARK-47461 
and add some comments
URL: https://github.com/apache/spark/pull/45602


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533285590


##
core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala:
##
@@ -172,61 +172,64 @@ class FallbackStorageSuite extends SparkFunSuite with 
LocalSparkContext {
   }
 
   test("migrate shuffle data to fallback storage") {
-val conf = new SparkConf(false)

Review Comment:
   Please preserve the existing test coverage as much as possible. We had 
better add new feature test coverage separately while keeping the existing test 
coverage unchanged in order to prevent any regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533284633


##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -1558,6 +1587,7 @@ object ShuffleBlockFetcherIterator {
   mapIndex: Int,
   address: BlockManagerId,
   size: Long,
+  timeTaken: Long,

Review Comment:
   May I ask the reason why we add a new filed here? If this is only for pure 
logging, please revert this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

2024-03-20 Thread via GitHub


mridulm commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2011272331

   +CC @zhouyejoe 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533283117


##
core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala:
##
@@ -188,15 +271,15 @@ private[spark] object FallbackStorage extends Logging {
 val name = ShuffleDataBlockId(shuffleId, mapId, NOOP_REDUCE_ID).name
 val hash = JavaUtils.nonNegativeHash(name)
 val dataFile = new Path(fallbackPath, s"$appId/$shuffleId/$hash/$name")
-val f = fallbackFileSystem.open(dataFile)
 val size = nextOffset - offset
 logDebug(s"To byte array $size")
 val array = new Array[Byte](size.toInt)
 val startTimeNs = System.nanoTime()
-f.seek(offset)
-f.readFully(array)
-logDebug(s"Took ${(System.nanoTime() - startTimeNs) / (1000 * 
1000)}ms")
-f.close()
+Utils.tryWithResource(fallbackFileSystem.open(dataFile)) { f =>
+  f.seek(offset)
+  f.readFully(array)
+  logDebug(s"Took ${(System.nanoTime() - startTimeNs) / (1000 * 
1000)}ms")
+}

Review Comment:
   Could you spin off this `Utils.tryWithResource` related change to a new 
JIRA? We can merge that first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533282286


##
core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala:
##
@@ -147,6 +222,14 @@ private[spark] object FallbackStorage extends Logging {
 }
   }
 
+  def stopThreadPool(conf: SparkConf): Unit = {
+logInfo(s" Stopping thread pool")
+if (getFallbackStorage(conf).isDefined &&
+  getFallbackStorage(conf).get.fetchThreadPool.isDefined) {

Review Comment:
   indentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533281979


##
core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala:
##
@@ -92,6 +95,57 @@ private[storage] class FallbackStorage(conf: SparkConf) 
extends Logging {
 val hash = JavaUtils.nonNegativeHash(filename)
 fallbackFileSystem.exists(new Path(fallbackPath, 
s"$appId/$shuffleId/$hash/$filename"))
   }
+
+  private val fetchThreadPool: Option[ThreadPoolExecutor] = {
+val numShuffleThreads = FallbackStorage.getNumReadThreads(conf)
+if (numShuffleThreads > 0) {
+  logInfo(s"FallbackStorage created thread pool using  
${numShuffleThreads} thread(s)")
+  Some(ThreadUtils.newDaemonCachedThreadPool(
+"FetchFromFallbackStorage-threadPool", numShuffleThreads))
+} else {
+  logInfo("FallbackStorage thread pool not created")
+  None
+}
+  }
+
+  def fetchBlocks(
+  blockManager: BlockManager,
+  blocks: collection.Seq[FetchBlockInfo],
+  address: BlockManagerId,
+  listener: BlockFetchingListener): Unit = {
+fetchThreadPool match {
+  case Some(p) if !p.isShutdown =>
+blocks.foreach(block =>
+  p.submit(new Runnable {
+override def run(): Unit = {
+  fetchShuffleBlocks(block, blockManager, listener)
+}
+  })
+)
+  case _ =>
+logInfo(s" fetchThreadPool does not exists for $address or shutdown")
+blocks.foreach(block => fetchShuffleBlocks(block, blockManager, 
listener))
+}
+  }
+
+  private def fetchShuffleBlocks(
+  block: FetchBlockInfo,

Review Comment:
   indentation.



##
core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala:
##
@@ -92,6 +95,57 @@ private[storage] class FallbackStorage(conf: SparkConf) 
extends Logging {
 val hash = JavaUtils.nonNegativeHash(filename)
 fallbackFileSystem.exists(new Path(fallbackPath, 
s"$appId/$shuffleId/$hash/$filename"))
   }
+
+  private val fetchThreadPool: Option[ThreadPoolExecutor] = {
+val numShuffleThreads = FallbackStorage.getNumReadThreads(conf)
+if (numShuffleThreads > 0) {
+  logInfo(s"FallbackStorage created thread pool using  
${numShuffleThreads} thread(s)")
+  Some(ThreadUtils.newDaemonCachedThreadPool(
+"FetchFromFallbackStorage-threadPool", numShuffleThreads))
+} else {
+  logInfo("FallbackStorage thread pool not created")
+  None
+}
+  }
+
+  def fetchBlocks(
+  blockManager: BlockManager,
+  blocks: collection.Seq[FetchBlockInfo],
+  address: BlockManagerId,
+  listener: BlockFetchingListener): Unit = {
+fetchThreadPool match {
+  case Some(p) if !p.isShutdown =>
+blocks.foreach(block =>
+  p.submit(new Runnable {
+override def run(): Unit = {
+  fetchShuffleBlocks(block, blockManager, listener)
+}
+  })
+)
+  case _ =>
+logInfo(s" fetchThreadPool does not exists for $address or shutdown")
+blocks.foreach(block => fetchShuffleBlocks(block, blockManager, 
listener))
+}
+  }
+
+  private def fetchShuffleBlocks(
+  block: FetchBlockInfo,

Review Comment:
   indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533281749


##
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala:
##
@@ -354,7 +355,8 @@ private[spark] class CoarseGrainedExecutorBackend(
 // We can only trust allBlocksMigrated boolean value if there 
were no tasks running
 // since the start of computing it.
 if (allBlocksMigrated && (migrationTime > 
lastTaskFinishTime.get())) {
-  logInfo("No running tasks, all blocks migrated, stopping.")
+  val timeTakenMs = (System.nanoTime() - startTime) / (1000 * 
1000)
+  logInfo(s"No running tasks, all blocks migrated in 
$timeTakenMs ms, stopping.")

Review Comment:
   Please revert the log-related change of this file because it's irrelevant to 
the functionality. We can handle this separately if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on code in PR #45626:
URL: https://github.com/apache/spark/pull/45626#discussion_r1533280693


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect {
 
   override def supportsOffset: Boolean = true
 }
+
+private[jdbc] object OracleDialect {

Review Comment:
   Opps, the errors occured
   ```
   - simple scan with LIMIT *** FAILED *** (65 milliseconds)
   [info]   org.apache.spark.SparkException: Task not serializable
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45393][BUILD][FOLLOWUP] Update IsolatedClientLoader fallback Hadoop version to 3.4.0 [spark]

2024-03-20 Thread via GitHub


pan3793 commented on PR #45628:
URL: https://github.com/apache/spark/pull/45628#issuecomment-2011263855

   cc @dongjoon-hyun seems this place was missed, or intended?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45393][BUILD][FOLLOWUP] Update IsolatedClientLoader fallback Hadoop version to 3.4.0 [spark]

2024-03-20 Thread via GitHub


pan3793 opened a new pull request, #45628:
URL: https://github.com/apache/spark/pull/45628

   
   
   ### What changes were proposed in this pull request?
   
   Update IsolatedClientLoader fallback Hadoop version to 3.4.0
   
   ### Why are the changes needed?
   
   Sync with the default Hadoop version
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533275990


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -551,6 +551,22 @@ package object config {
   .checkValue(_.endsWith(java.io.File.separator), "Path should end with 
separator.")
   .createOptional
 
+  private[spark] val STORAGE_FALLBACK_STORAGE_NUM_THREADS_FOR_SHUFFLE_READ =
+ConfigBuilder("spark.storage.fallbackStorage.num.threads.for.shuffle.read")

Review Comment:
   Every `.` means a new namespace in Apache Spark configuration namespace 
scheme. For example, this line introduces 4 new namespaces. Please avoid using 
`.`.
   - spark.storage.fallbackStorage.num.*
   - spark.storage.fallbackStorage.num.threads.*
   - spark.storage.fallbackStorage.num.threads.for.*
   - spark.storage.fallbackStorage.num.threads.for.shuffle.*



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on PR #45626:
URL: https://github.com/apache/spark/pull/45626#issuecomment-2011261846

   Thank you @dongjoon-hyun.
   
   I addressed the comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on code in PR #45626:
URL: https://github.com/apache/spark/pull/45626#discussion_r1533274803


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect {
 
   override def supportsOffset: Boolean = true
 }
+
+private[jdbc] object OracleDialect {

Review Comment:
   This is only a place for containing some const ints, which represent some 
variants from Oracle JDBC extensions. Thus, `object` is enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on code in PR #45626:
URL: https://github.com/apache/spark/pull/45626#discussion_r1533274803


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect {
 
   override def supportsOffset: Boolean = true
 }
+
+private[jdbc] object OracleDialect {

Review Comment:
   This is only a place for containing some const ints, which represent some 
variants from Oracle JDBC extentions. Thus, `object` is enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45228:
URL: https://github.com/apache/spark/pull/45228#discussion_r1533274587


##
core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala:
##
@@ -447,7 +462,30 @@ class ShuffleBlockFetcherIteratorSuite extends 
SparkFunSuite with PrivateMethodT
 assert(!iterator.hasNext)
   }
 
-  test("fetch continuous blocks in batch successful 3 local + 4 host local + 2 
remote reads") {
+  def createShuffleFile(shuffleId: Int, mapId: Int, reducerId: Int, conf: 
SparkConf): Unit = {
+var name = ShuffleIndexBlockId(shuffleId, mapId, reducerId).name
+var hash = JavaUtils.nonNegativeHash(name)
+val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
+val appId = conf.getAppId
+val path: String = conf.get(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).get
+val indexFile = new Path(path, s"$appId/$shuffleId/$hash/$name")
+name = ShuffleDataBlockId(shuffleId, mapId, reducerId).name
+hash = JavaUtils.nonNegativeHash(name)
+val dataFile = new Path(path, s"$appId/$shuffleId/$hash/$name")
+val fallbackFileSystem = 
org.apache.hadoop.fs.FileSystem.get(indexFile.toUri, hadoopConf)
+val indexOut = fallbackFileSystem.create(indexFile)
+indexOut.writeLong(0L) // offset
+indexOut.writeLong(10L) // next offset
+indexOut.writeLong(10L) // next offset
+indexOut.writeLong(10L) // next offset
+val dataOut = fallbackFileSystem.create(dataFile)
+dataOut.writeBytes("some data to write")
+indexOut.close()
+dataOut.close()
+  }
+
+  test("fetch continuous blocks in batch successful 3 local + 4 host local + 2 
remote reads " +
+"+ 2 read from external storage") {

Review Comment:
   This PR seems to introduce a test failure (or flakiness) at this test case. 
Could you take a look at the CI failure?
   - https://github.com/maheshk114/spark/actions/runs/8369424178/job/22915159186
   ```
   [info] ShuffleBlockFetcherIteratorSuite:
   ...
   [info] - fetch continuous blocks in batch successful 3 local + 4 host local 
+ 2 remote reads + 2 read from external storage *** FAILED *** (107 
milliseconds)
   [info]   org.mockito.exceptions.verification.TooFewActualInvocations: 
blockManager.getLocalBlockData();
   [info] Wanted 2 times:
   [info] -> at 
org.apache.spark.storage.BlockManager.getLocalBlockData(BlockManager.scala:721)
   [info] But was 1 time:
   [info] -> at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:592)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45228:
URL: https://github.com/apache/spark/pull/45228#issuecomment-2011254687

   Thank you for rebasing, @maheshk114 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [Don't Review] Only for list python packages for branch-3.4 [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45600:
URL: https://github.com/apache/spark/pull/45600#issuecomment-2011253985

   Thank you for investigation. Ya, it's really weird flakiness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45500:
URL: https://github.com/apache/spark/pull/45500#issuecomment-2011251140

   Thank you for all of your contributions here, @HiuKwok !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45609:
URL: https://github.com/apache/spark/pull/45609#discussion_r1533265715


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc
+
+import java.util.Locale
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+
+private case object HiveDialect extends JdbcDialect with SQLConfHelper {
+
+  override def canHandle(url : String): Boolean =
+url.toLowerCase(Locale.ROOT).startsWith("jdbc:hive2")

Review Comment:
   If you are trying to use this for Spark Thrift Server, this should be 
`SparkDialect` in Spark community. However, in that case, it will look very 
weird because Apache Spark **needs** a direct to access itself. That's the 
meaning why we don't want to add any `SparkDialect` or `HiveDialect`.
   > Actually, it's not. I used sbin/start-thriftserver.sh in the production 
environment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] [spark]

2024-03-20 Thread via GitHub


MaxGekk commented on PR #45622:
URL: https://github.com/apache/spark/pull/45622#issuecomment-2011239760

   @davidm-db Could you trigger GitHub actions w/ tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-20 Thread via GitHub


cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011239475

   It's ok to not support round-trip. array/struct/map is not CSV standard 
either. I think the principle is to avoid breaking changes if possible. Since 
`to_csv` already supports array/struct/map, let's keep it but fix the 
inconsistency issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] [spark]

2024-03-20 Thread via GitHub


MaxGekk commented on code in PR #45622:
URL: https://github.com/apache/spark/pull/45622#discussion_r1533260149


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala:
##
@@ -295,10 +295,50 @@ class AnalysisErrorSuite extends AnalysisTest with 
DataTypeErrorsBase {
 "aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x) FILTER (WHERE c > 
1)", 7, 76)))
   }
 
-  errorTest(
-"non-deterministic filter predicate in aggregate functions",
-CatalystSqlParser.parsePlan("SELECT count(a) FILTER (WHERE rand(int(c)) > 
1) FROM TaBlE2"),
-"FILTER expression is non-deterministic, it cannot be used in aggregate 
functions" :: Nil)
+  test("SPARK-47256: non deterministic FILTER expression in an aggregate 
function") {
+val plan =
+  CatalystSqlParser.parsePlan("SELECT count(a) FILTER (WHERE rand(int(c)) 
> 1) FROM TaBlE2")

Review Comment:
   @davidm-db Could you reproduce the errors using public APIs, please. If it 
is not possible, we should consider to convert the error to internal errors 
(see `SparkException.internalError()`).



##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -4627,6 +4627,26 @@
 ],
 "sqlState" : "42KDF"
   },
+  "AGGREGATE_FILTER_EXPRESSION_NON_DETERMINISTIC" : {

Review Comment:
   Please, create an error class w/ a common error message, and its sub-classes 
w/ specific error messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45626:
URL: https://github.com/apache/spark/pull/45626#discussion_r1533254888


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:
##
@@ -2154,4 +2155,13 @@ class JDBCSuite extends QueryTest with 
SharedSparkSession {
 val expected = Map("percentile_approx_val" -> 49)
 assert(namedObservation.get === expected)
   }
+
+

Review Comment:
   extra empty line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46920][YARN] Improve executor exit error message on YARN [spark]

2024-03-20 Thread via GitHub


LuciferYang commented on PR #44951:
URL: https://github.com/apache/spark/pull/44951#issuecomment-2011228756

   Merged into master for Spark 4.0. Thanks @pan3793 @srowen @tgravescs 
@mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45626:
URL: https://github.com/apache/spark/pull/45626#discussion_r1533254166


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect {
 
   override def supportsOffset: Boolean = true
 }
+
+private[jdbc] object OracleDialect {

Review Comment:
   Ur, actually, `case object OracleDialect` has more features (serializable, 
hashCode, toString) than `object OracleDialect`. Do we need to shrink features 
like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46920][YARN] Improve executor exit error message on YARN [spark]

2024-03-20 Thread via GitHub


LuciferYang closed pull request #44951: [SPARK-46920][YARN] Improve executor 
exit error message on YARN
URL: https://github.com/apache/spark/pull/44951


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45626:
URL: https://github.com/apache/spark/pull/45626#discussion_r1533254166


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala:
##
@@ -230,3 +224,11 @@ private case object OracleDialect extends JdbcDialect {
 
   override def supportsOffset: Boolean = true
 }
+
+private[jdbc] object OracleDialect {

Review Comment:
   Ur, actually, `case object` has more features (serializable, hashCode, 
toString) than `object`. Do we need to shrink features like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-41888][PYTHON][CONNECT][TESTS] Enable doctest for `DataFrame.observe` [spark]

2024-03-20 Thread via GitHub


zhengruifeng opened a new pull request, #45627:
URL: https://github.com/apache/spark/pull/45627

   ### What changes were proposed in this pull request?
   Enable doctest for `DataFrame.observe`
   
   
   ### Why are the changes needed?
   for test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test-only
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun closed pull request #45625: [MINOR][CORE] Fix a comment typo 
`slf4j-to-jul` to `jul-to-slf4j`
URL: https://github.com/apache/spark/pull/45625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45625:
URL: https://github.com/apache/spark/pull/45625#issuecomment-2011222356

   Thank you, @yaooqinn . Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on PR #45625:
URL: https://github.com/apache/spark/pull/45625#issuecomment-2011211719

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47496][SQL] Java SPI Support for dynamic JDBC dialect registering [spark]

2024-03-20 Thread via GitHub


yaooqinn opened a new pull request, #45626:
URL: https://github.com/apache/spark/pull/45626

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR brings the Java ServiceProvider Interface (SPI) Support for dynamic 
JDBC dialect registering.
   
   A custom JDBC dialect can be registered easily instead of calling 
JdbcDialects.registerDialect manually.
   
   ### Why are the changes needed?
   
   
   For pure SQL and other non-Java API users, it's difficult to register a 
custom JDBC dialect to use. With this patch, this can be done when the jar 
containing the dialect class is visible to the spark classloader.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, but mostly for third-party developers
   
   ### How was this patch tested?
   
   new tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-20 Thread via GitHub


HiuKwok commented on PR #45500:
URL: https://github.com/apache/spark/pull/45500#issuecomment-2011184474

   Left a comment on the Jira ticket and I'm closing this MR for now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47086][BUILD][CORE][SQL][UI] Migrate from Jetty 11 to Jetty 12 [spark]

2024-03-20 Thread via GitHub


HiuKwok closed pull request #45500: [SPARK-47086][BUILD][CORE][SQL][UI] Migrate 
from Jetty 11 to Jetty 12
URL: https://github.com/apache/spark/pull/45500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-20 Thread via GitHub


LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011183252

   I don't oppose this resolution, but if `to_csv` can handle complex data 
structures, what would our subsequent plans be? Would `from_csv` also need to 
support complex data structures to make these paired functions more 
self-consistent? In the long term, do we need to enhance the read and write 
capabilities for CSV data sources to support complex data structures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141] [Core]: Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


maheshk114 commented on PR #45228:
URL: https://github.com/apache/spark/pull/45228#issuecomment-2011171597

   > To @maheshk114 , could you rebase this PR to the `master` branch once more 
to test on top of Apache Hadoop 3.4.0? The following is merged to Apache Spark 
`master` branch.
   > 
   > * [[SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0 
#45583](https://github.com/apache/spark/pull/45583)
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-20 Thread via GitHub


panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011171313

   Use `ToPrettyString.eval` to generate pretty strings for these types in 
`to_csv`, I have also tried it in this PR. Below:
   
https://github.com/apache/spark/pull/44665/commits/22a7afb6951b9274cc428f71e675df233de74a8a
   https://github.com/apache/spark/assets/15246973/961242a9-d26d-4afe-9b01-ad7ea6ab25d9;>
   (PS: Maybe some details about `using ToPrettyString.eval` still need to be 
confirmed.)
   
   If the final discussion result is to `fix their consistency`, this approach 
should also be possible.
   WDYT @MaxGekk @HyukjinKwon @LuciferYang @xinrong-meng @srowen ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][CORE] Fix a typo `slf4j-to-jul` to `jul-to-slf4j` [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun opened a new pull request, #45625:
URL: https://github.com/apache/spark/pull/45625

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47486][CONNECT] Remove unused private `ArrowDeserializers.getString` method [spark]

2024-03-20 Thread via GitHub


LuciferYang commented on PR #45610:
URL: https://github.com/apache/spark/pull/45610#issuecomment-2011168276

   Thanks @dongjoon-hyun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47141] [Core]: Support enabling migration of shuffle data directly to external storage using config parameter. [spark]

2024-03-20 Thread via GitHub


maheshk114 commented on PR #45228:
URL: https://github.com/apache/spark/pull/45228#issuecomment-2011167980

   > To @maheshk114 , I have a big concern on the AS-IS PR title because it is 
misleading the users by ignoring the existing Apache Spark 3.2 features.
   > 
   > > [SPARK-47141] [Core]: Support shuffle migration to external storage.
   > 
   > At least, as the author of the following Apache Spark 3.2 patches, I 
believe Apache Spark "Support Shuffle Migration to External Storage" already. 
WDTY, @maheshk114 ? Could you revise the PR title to narrow down to this PR's 
exact contribution?
   > 
   > * [[SPARK-33545][CORE] Support Fallback Storage during Worker decommission 
#30492](https://github.com/apache/spark/pull/30492)
   > * [[SPARK-34142][CORE] Support Fallback Storage Cleanup during stopping 
SparkContext #31215](https://github.com/apache/spark/pull/31215)
   
   Thanks a lot @dongjoon-hyun for looking into the PR. Yes, it make sense. I 
have updated the PR title to reflect the exact changes done in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47007][SQL][PYTHON][R][CONNECT] Add the `map_sort` function [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45069:
URL: https://github.com/apache/spark/pull/45069#issuecomment-2011161784

   +1 for Wenchen's decision. Thank you for reverting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-20 Thread via GitHub


cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011158182

   I'd prefer to fix the inconsistency. Can we use `ToPrettyString.eval` to 
generate pretty strings for these types in `to_csv`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47487][SQL] Simplify code in AnsiTypeCoercion [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on code in PR #45612:
URL: https://github.com/apache/spark/pull/45612#discussion_r1533192344


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala:
##
@@ -180,56 +180,27 @@ object AnsiTypeCoercion extends TypeCoercionBase {
   // cast the input to decimal.
   case (n: NumericType, DecimalType) => Some(DecimalType.forType(n))
 
-  // Cast null type (usually from null literals) into target types
-  // By default, the result type is `target.defaultConcreteType`. When the 
target type is
-  // `TypeCollection`, there is another branch to find the "closet 
convertible data type" below.
-  case (NullType, target) if !target.isInstanceOf[TypeCollection] =>
-Some(target.defaultConcreteType)
-
   // If a function expects a StringType, no StringType instance should be 
implicitly cast to
   // StringType with a collation that's not accepted (aka. lockdown 
unsupported collations).
   case (_: StringType, StringType) => None
   case (_: StringType, _: StringTypeCollated) => None
 
-  // This type coercion system will allow implicit converting String type 
as other
-  // primitive types, in case of breaking too many existing Spark SQL 
queries.
-  case (StringType, a: AtomicType) =>
-Some(a)
-
-  // If the target type is any Numeric type, convert the String type as 
Double type.
-  case (StringType, NumericType) =>
-Some(DoubleType)
-
-  // If the target type is any Decimal type, convert the String type as 
the default
-  // Decimal type.
-  case (StringType, DecimalType) =>
-Some(DecimalType.SYSTEM_DEFAULT)
-
-  // If the target type is any timestamp type, convert the String type as 
the default
-  // Timestamp type.
-  case (StringType, AnyTimestampType) =>
-Some(AnyTimestampType.defaultConcreteType)
-
-  case (DateType, AnyTimestampType) =>
-Some(AnyTimestampType.defaultConcreteType)
-
-  case (_, target: DataType) =>
-if (Cast.canANSIStoreAssign(inType, target)) {
-  Some(target)
+  // Ideally the implicit cast rule should be the same as 
`Cast.canANSIStoreAssign` so that it's
+  // consistent with table insertion. To avoid breaking too many existing 
Spark SQL queries,
+  // we make the system to allow implicitly converting String type as 
other primitive types.
+  case (StringType, a @ (_: AtomicType | NumericType | DecimalType | 
AnyTimestampType)) =>
+Some(a.defaultConcreteType)
+
+  // When the target type is `TypeCollection`, there is another branch to 
find the
+  // "closet convertible data type" below.
+  case (_, target) if !target.isInstanceOf[TypeCollection] =>
+val concreteType = target.defaultConcreteType

Review Comment:
   Making it as `final` also sounds good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

2024-03-20 Thread via GitHub


leletan commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2011103330

   @dongjoon-hyun Updated the PR and associated it with the new JIRA 
https://issues.apache.org/jira/browse/SPARK-47495. Please let me know if this 
looks good to you. Thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47487][SQL] Simplify code in AnsiTypeCoercion [spark]

2024-03-20 Thread via GitHub


cloud-fan commented on code in PR #45612:
URL: https://github.com/apache/spark/pull/45612#discussion_r1533184546


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala:
##
@@ -180,56 +180,27 @@ object AnsiTypeCoercion extends TypeCoercionBase {
   // cast the input to decimal.
   case (n: NumericType, DecimalType) => Some(DecimalType.forType(n))
 
-  // Cast null type (usually from null literals) into target types
-  // By default, the result type is `target.defaultConcreteType`. When the 
target type is
-  // `TypeCollection`, there is another branch to find the "closet 
convertible data type" below.
-  case (NullType, target) if !target.isInstanceOf[TypeCollection] =>
-Some(target.defaultConcreteType)
-
   // If a function expects a StringType, no StringType instance should be 
implicitly cast to
   // StringType with a collation that's not accepted (aka. lockdown 
unsupported collations).
   case (_: StringType, StringType) => None
   case (_: StringType, _: StringTypeCollated) => None
 
-  // This type coercion system will allow implicit converting String type 
as other
-  // primitive types, in case of breaking too many existing Spark SQL 
queries.
-  case (StringType, a: AtomicType) =>
-Some(a)
-
-  // If the target type is any Numeric type, convert the String type as 
Double type.
-  case (StringType, NumericType) =>
-Some(DoubleType)
-
-  // If the target type is any Decimal type, convert the String type as 
the default
-  // Decimal type.
-  case (StringType, DecimalType) =>
-Some(DecimalType.SYSTEM_DEFAULT)
-
-  // If the target type is any timestamp type, convert the String type as 
the default
-  // Timestamp type.
-  case (StringType, AnyTimestampType) =>
-Some(AnyTimestampType.defaultConcreteType)
-
-  case (DateType, AnyTimestampType) =>
-Some(AnyTimestampType.defaultConcreteType)
-
-  case (_, target: DataType) =>
-if (Cast.canANSIStoreAssign(inType, target)) {
-  Some(target)
+  // Ideally the implicit cast rule should be the same as 
`Cast.canANSIStoreAssign` so that it's
+  // consistent with table insertion. To avoid breaking too many existing 
Spark SQL queries,
+  // we make the system to allow implicitly converting String type as 
other primitive types.
+  case (StringType, a @ (_: AtomicType | NumericType | DecimalType | 
AnyTimestampType)) =>
+Some(a.defaultConcreteType)
+
+  // When the target type is `TypeCollection`, there is another branch to 
find the
+  // "closet convertible data type" below.
+  case (_, target) if !target.isInstanceOf[TypeCollection] =>
+val concreteType = target.defaultConcreteType

Review Comment:
   It's the default implementation in `DataType`:
   ```
   override private[sql] def defaultConcreteType: DataType = this
   ```
   
   maybe I just make it as `final`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47007][SQL][PYTHON][R][CONNECT] Add the `map_sort` function [spark]

2024-03-20 Thread via GitHub


cloud-fan commented on PR #45069:
URL: https://github.com/apache/spark/pull/45069#issuecomment-2011092172

   I can't find it in other systems, and it does not make sense as map elements 
are order-less. I'm reverting it, please re-submit it without exposing the 
function publicly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

2024-03-20 Thread via GitHub


leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1533169526


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1458,6 +1458,18 @@ package object config {
   .doubleConf
   .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+  .doc("Comma-separated list of schemes for which jars will not be 
downloaded to the " +
+"driver local disk prior to be distributed to executors, only for 
kubernetes deployment. " +
+"For use in cases when the jars are big and executor counts are high, 
" +
+"concurrent download causes network saturation and timeouts. " +
+"Wildcard '*' is denoted to not downloading jars for any the schemes.")
+  .version("2.3.0")

Review Comment:
   Will fix and move this to another JIRA & PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [Don't Review] Only for address pyspark-errors for branch-3.4 [spark]

2024-03-20 Thread via GitHub


panbingkun opened a new pull request, #45624:
URL: https://github.com/apache/spark/pull/45624

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [Don't Review] Only for list python packages for branch-3.4 [spark]

2024-03-20 Thread via GitHub


panbingkun commented on PR #45600:
URL: https://github.com/apache/spark/pull/45600#issuecomment-2011079125

   Fo record:
   
   1.success
   ```
   (base) panbingkun:~/Developer/spark/spark-trunk$pip install --upgrade 
--no-cache-dir --force-reinstall ./python/dist/pyspark-3.4.3.dev0.tar.gz
   Processing ./python/dist/pyspark-3.4.3.dev0.tar.gz
 Preparing metadata (setup.py) ... done
   Collecting py4j==0.10.9.7 (from pyspark==3.4.3.dev0)
 Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
   Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
   200.5/200.5 kB 134.2 kB/s eta 
0:00:00
   Building wheels for collected packages: pyspark
 Building wheel for pyspark (setup.py) ... done
 Created wheel for pyspark: 
filename=pyspark-3.4.3.dev0-py2.py3-none-any.whl size=626013326 
sha256=e423146045a43d8c90ac471489dd5b6fb3702def94274f6db968961c8981731f
 Stored in directory: 
/private/var/folders/kl/p0fxfzp53g5cc38qjxqwk_cmgp/T/pip-ephem-wheel-cache-73zlf_ix/wheels/0d/86/cd/947463e4be802583aec12b9dd00109a2a65ea91253d3ffd32d
   Successfully built pyspark
   Installing collected packages: py4j, pyspark
 Attempting uninstall: pyspark
   Found existing installation: pyspark 4.0.0.dev0
   Uninstalling pyspark-4.0.0.dev0:
 Successfully uninstalled pyspark-4.0.0.dev0
   Successfully installed py4j-0.10.9.7 pyspark-3.4.3.dev0
   ```
   
   2.fail
   ```
   Writing pyspark-3.4.3.dev0/setup.cfg
   creating dist
   Creating tar archive
   removing 'pyspark-3.4.3.dev0' (and everything under it)
   Installing dist into virtual env
   Processing ./python/dist/pyspark-3.4.3.dev0.tar.gz
 Preparing metadata (setup.py) ... 25l- done
   25hCollecting py4j==0.10.9.7 (from pyspark==3.4.3.dev0)
 Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
   Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
   25l    0.0/200.5 kB ? eta -:--:--
   200.5/200.5 kB 9.1 MB/s eta 
0:00:00
   25hBuilding wheels for collected packages: pyspark
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

2024-03-20 Thread via GitHub


leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1533169526


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1458,6 +1458,18 @@ package object config {
   .doubleConf
   .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+  .doc("Comma-separated list of schemes for which jars will not be 
downloaded to the " +
+"driver local disk prior to be distributed to executors, only for 
kubernetes deployment. " +
+"For use in cases when the jars are big and executor counts are high, 
" +
+"concurrent download causes network saturation and timeouts. " +
+"Wildcard '*' is denoted to not downloading jars for any the schemes.")
+  .version("2.3.0")

Review Comment:
   Will move this to another JIRA & PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]

2024-03-20 Thread via GitHub


xleoken commented on PR #45609:
URL: https://github.com/apache/spark/pull/45609#issuecomment-2011074139

   > Thanks for pinging me @dongjoon-hyun.
   > 
   > I know that it's technically feasible. But we have a much more efficient 
and direct way to access hive tables. I don't see the necessity for adding it 
as a built-in dialect.
   
   Welcome @yaooqinn, can you explain in detail? The key to this patch is to 
override `quoteIdentifier` method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on PR #45613:
URL: https://github.com/apache/spark/pull/45613#issuecomment-2011072653

   I checked your example code in the JIRA and it looks weird. Ideally, you 
should stop the non-daemon threads for a JVM to exit, and stop SparkContext to 
terminate the app


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on PR #45588:
URL: https://github.com/apache/spark/pull/45588#issuecomment-2011059165

   Thank you, @dongjoon-hyun and @cloud-fan.
   
   I will send followups for migration guides


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47474][CORE] Revert SPARK-47461 and add some comments [spark]

2024-03-20 Thread via GitHub


LuciferYang commented on PR #45602:
URL: https://github.com/apache/spark/pull/45602#issuecomment-2011052480

   > BTW, @LuciferYang , as you requested in the above already, I guess we need 
to wait for @mridulm and @tgravescs 's review and approval. I'll leave this to 
them.
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]

2024-03-20 Thread via GitHub


yaooqinn commented on PR #45609:
URL: https://github.com/apache/spark/pull/45609#issuecomment-2011049532

   Thanks for pinging me @dongjoon-hyun.
   
   I know that it's technically feasible. But we have a much more efficient and 
direct way to access hive tables. I don't see the necessity for adding it as a 
built-in dialect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011026947

   Yes those parameters only change the Iceberg schema. What your are seeing is 
the note that the Iceberg Schema cannot be read into the Spark Type without a 
special parameter. Changing the default for the writer still doesn't effect us. 
   
   We had that issue previously because Trino and other engines have no problem 
creating a table with that type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


viirya commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011021425

   > @viirya Thanks, could you provide more details about the iceberg 
workaround?
   > 
   > As for Delta, users have to either disable 
`spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature 
via
   > 
   > ```
   > ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 
'supported')
   > ```
   > 
   > After enabling the feature, the table won't be able to read by older Delta 
version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better 
solution in such a case.
   
   Not sure if there are definitely related, but I remember there are configs 
like
   
   ```
   .config("spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables", 
"true")
   .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2011010486

   @RussellSpitzer The same pattern can happen on iceberg if there are multiple 
Spark versions in the pipeline
   On Spark 3.3+
   ```
   // The following can be Replace table as select as well
   
spark.read.parquet(...).write.mode("overwrite").format("iceberg").saveAsTable("iceberg_table")
   ```
   
   On Spark 3.2 and earlier:
   ```
   > select * from iceberg_ts 
   
   java.lang.IllegalArgumentException: Cannot handle timestamp without timezone 
fields in Spark. Spark does not natively support this type but if you would 
like to handle all timestamps as timestamp with timezone set 
'spark.sql.iceberg.handle-timestamp-without-timezone' to true. This will not 
change the underlying values stored but will change their displayed values in 
Spark. For more information please see 
https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47372][SS] Add support for range scan based key state encoder for use with state store provider [spark]

2024-03-20 Thread via GitHub


neilramaswamy commented on code in PR #45503:
URL: https://github.com/apache/spark/pull/45503#discussion_r1533103809


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala:
##
@@ -39,13 +41,21 @@ sealed trait RocksDBValueStateEncoder {
 }
 
 object RocksDBStateEncoder {
-  def getKeyEncoder(
-  keySchema: StructType,
-  numColsPrefixKey: Int): RocksDBKeyStateEncoder = {
-if (numColsPrefixKey > 0) {
-  new PrefixKeyScanStateEncoder(keySchema, numColsPrefixKey)
-} else {
-  new NoPrefixKeyStateEncoder(keySchema)
+  def getKeyEncoder(keyStateEncoderSpec: KeyStateEncoderSpec): 
RocksDBKeyStateEncoder = {
+// Return the key state encoder based on the requested type
+keyStateEncoderSpec match {
+  case NoPrefixKeyStateEncoderSpec(keySchema) =>
+new NoPrefixKeyStateEncoder(keySchema)
+
+  case PrefixKeyScanStateEncoderSpec(keySchema, numColsPrefixKey) =>
+new PrefixKeyScanStateEncoder(keySchema, numColsPrefixKey)
+
+  case RangeKeyScanStateEncoderSpec(keySchema, numColsPrefixKey) =>

Review Comment:
   `numOrderingCols` ?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala:
##
@@ -192,6 +199,233 @@ class PrefixKeyScanStateEncoder(
   override def supportPrefixKeyScan: Boolean = true
 }
 
+/**
+ * RocksDB Key Encoder for UnsafeRow that supports range scan for fixed size 
fields

Review Comment:
   Nowhere in this do we mention why we cannot range scan non-fixed size fields 
(I understand, but maybe future readers won't). It's not mentioned in the error 
message either. Can we do that?
   
   I think also a quick comment explaining how this actually works would be 
helpful, i.e. "To encode a row `r` for a range scan, we first project the first 
`numOrderingCols` needed for the range scan into an `UnsafeRow`; we then 
rewrite that `UnsafeRow`'s fields in big endian. Then, for the rest of the 
fields, we project those into an `UnsafeRow`. We then effectively join these 
latter two `UnsafeRow`s together, and finally take those bytes to get the 
resulting row." For example, NoPrefixKeyStateEncoder has a comment like this.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala:
##
@@ -192,6 +199,233 @@ class PrefixKeyScanStateEncoder(
   override def supportPrefixKeyScan: Boolean = true
 }
 
+/**
+ * RocksDB Key Encoder for UnsafeRow that supports range scan for fixed size 
fields
+ * Note that for range scan, we have to encode the ordering columns using 
BIG_ENDIAN
+ * encoding to allow for scanning keys in sorted order using the byte-wise 
comparison
+ * method that RocksDB uses.
+ *
+ * @param keySchema - schema of the key to be encoded
+ * @param numColsPrefixKey - number of columns to be used for prefix key

Review Comment:
   nit typo



##
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala:
##
@@ -158,14 +161,360 @@ class RocksDBStateStoreSuite extends 
StateStoreSuiteBase[RocksDBStateStoreProvid
 }
   }
 
+  testWithColumnFamilies("rocksdb range scan validation - invalid num columns",
+TestWithBothChangelogCheckpointingEnabledAndDisabled) { colFamiliesEnabled 
=>
+// zero ordering cols
+val ex1 = intercept[SparkUnsupportedOperationException] {
+  tryWithProviderResource(newStoreProvider(keySchemaWithRangeScan,
+RangeKeyScanStateEncoderSpec(keySchemaWithRangeScan, 0),
+colFamiliesEnabled)) { provider =>
+provider.getStore(0)
+  }
+}
+checkError(
+  ex1,
+  errorClass = "STATE_STORE_INCORRECT_NUM_ORDERING_COLS_FOR_RANGE_SCAN",
+  parameters = Map(
+"numOrderingCols" -> "0"
+  ),
+  matchPVals = true
+)
+
+// ordering cols greater than schema cols
+val ex2 = intercept[SparkUnsupportedOperationException] {
+  tryWithProviderResource(newStoreProvider(keySchemaWithRangeScan,
+RangeKeyScanStateEncoderSpec(keySchemaWithRangeScan, 
keySchemaWithRangeScan.length + 1),
+colFamiliesEnabled)) { provider =>
+provider.getStore(0)
+  }
+}
+checkError(
+  ex2,
+  errorClass = "STATE_STORE_INCORRECT_NUM_ORDERING_COLS_FOR_RANGE_SCAN",
+  parameters = Map(
+"numOrderingCols" -> (keySchemaWithRangeScan.length + 1).toString
+  ),
+  matchPVals = true
+)
+  }
+
+  testWithColumnFamilies("rocksdb range scan validation - variable sized 
columns",
+TestWithBothChangelogCheckpointingEnabledAndDisabled) { colFamiliesEnabled 
=>
+val keySchemaWithVariableSizeCols: StructType = StructType(
+  Seq(StructField("key1", StringType, false), StructField("key2", 
StringType, false)))
+
+val ex = intercept[SparkUnsupportedOperationException] {
+  tryWithProviderResource(newStoreProvider(keySchemaWithVariableSizeCols,
+

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

2024-03-20 Thread via GitHub


panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2010991612

   - Why is the result displayed through `to_csv` inconsistency in Scala and 
Python for this case?
 Because this case is on the `python side`, it ultimately uses 
`GenericArrayData`, which happens to implement the method `toString`, so 
`to_csv` displays readable text.
   
https://github.com/apache/spark/blob/11247d804cd370aaeb88736a706c587e7f5c83b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L85
   
   However, on the `scala side`, it ultimately uses `UnsafeArrayData`.  
`Unfortunately`, it does not implement the method `toString` (using the default 
`Object.toString` method), so the final `to_csv` displays `the address of the 
object`.
   
   - In the implementation process of this PR, it can display `non-standard but 
pretty strings`,  as follows:
 
https://github.com/apache/spark/pull/44665/commits/9695e975f3299556e7c268918ecd51be7a03c157
 https://github.com/apache/spark/assets/15246973/fd07dc0a-4d61-4663-8631-daff518da278;>
 The `disadvantage` of this is that it `cannot` be `read back` through 
`from_csv` `at present`.
 If the final result of the discussion is acceptable, it should be easy to 
bring back this feature.
   
   - Another possible compromise solution is to add a configuration (defaultly, 
it does `not` support displaying data of type [Array, Map, Struct ...] as 
`non-standard but pretty strings` through `to_csv`). If the user sets this 
configuration to be enabled, restore the original behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010991405

   I should add we have custom readers and writers for parquet so this 
parameter doesn't effect us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


RussellSpitzer commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010990874

   I don't think this is an issue for Iceberg, we always would store values the 
same and then mark in the schema the difference in type. Readers would then get 
values based on the table schema and not what was in the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]

2024-03-20 Thread via GitHub


xleoken commented on code in PR #45609:
URL: https://github.com/apache/spark/pull/45609#discussion_r1533099686


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc
+
+import java.util.Locale
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+
+private case object HiveDialect extends JdbcDialect with SQLConfHelper {
+
+  override def canHandle(url : String): Boolean =
+url.toLowerCase(Locale.ROOT).startsWith("jdbc:hive2")

Review Comment:
   hi @dongjoon-hyun, thanks for your review. 
   Your considerations are correct, but this patch is applicable to both `Hive 
Thrift Server` and `Spart Thrift Server`.
   
   > You want to achieve to introduce Hive-specific syntax via this HiveDialect 
instead of Spark Thrift Server, right?
   
   Actually, it's not. I used `sbin/start-thriftserver.sh` in the production 
environment.
   
   > I'm not sure HiveDialect is a valid name in Apache Spark community
   
   OK, `HiveDialect` seems better for `jdbc:hive2`. In the future, if 
encountering Hive-specific syntax or SparkSQL-specific syntax issue, we can 
distinguish between Hive and Spark in specific methods.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-201099

   @viirya Thanks, could you provide more details about the iceberg workaround? 
   
   As for Delta, users have to either disable 
`spark.sql.parquet.inferTimestampNTZ.enabled`, or enable TimestampNTZ feature 
via
   ```
   ALTER TABLE table_name SET TBLPROPERTIES ('delta.feature. timestampNtz' = 
'supported')
   ```
   
   After enabling the feature, the table won't be able to read by older Delta 
version. So disabling `spark.sql.parquet.inferTimestampNTZ.enabled` is a better 
solution in such a case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-20 Thread via GitHub


anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1533102287


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala:
##
@@ -271,57 +320,111 @@ case class TransformWithStateExec(
   case _ =>
 }
 
-if (isStreaming) {
-  child.execute().mapPartitionsWithStateStore[InternalRow](
+if (hasInitialState) {
+  val storeConf = new StateStoreConf(session.sqlContext.sessionState.conf)
+  val hadoopConfBroadcast = sparkContext.broadcast(

Review Comment:
   Why do we need to do this ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47482] Add HiveDialect to sql module [spark]

2024-03-20 Thread via GitHub


xleoken commented on code in PR #45609:
URL: https://github.com/apache/spark/pull/45609#discussion_r1533099686


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/HiveDialect.scala:
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc
+
+import java.util.Locale
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+
+private case object HiveDialect extends JdbcDialect with SQLConfHelper {
+
+  override def canHandle(url : String): Boolean =
+url.toLowerCase(Locale.ROOT).startsWith("jdbc:hive2")

Review Comment:
   hi @dongjoon-hyun, thanks for your review. 
   Your considerations are correct, but this patch is applicable to both `Hive 
Thrift Server` and `Spart Thrift Server`.
   
   > You want to achieve to introduce Hive-specific syntax via this HiveDialect 
instead of Spark Thrift Server, right?
   Actually, it's not. I used `sbin/start-thriftserver.sh` in the production 
environment.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-20 Thread via GitHub


anishshri-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1533098209


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3503,6 +3503,12 @@
 ],
 "sqlState" : "42802"
   },
+  "STATEFUL_PROCESSOR_CANNOT_REINITIALIZE_STATE_ON_KEY" : {
+"message" : [
+  "Cannot re-initialize state on the same grouping key during initial 
state handling for stateful processor. Invalid grouping key = ."

Review Comment:
   nit: remove the space before/after equal
   
   ```
   Invalid grouping key=
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


viirya commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010971958

   > The result table can't be read by older version engines anymore due to a 
lack of TimestampNTZ support. The behavior change happens silently and it is 
hard for users to figure out how to fix it.
   
   Hmm, I quite recall that in Iceberg there seems some configs can be used to 
read such fields as workaround? I'm not sure about Delta on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` with `WithRelations` [spark]

2024-03-20 Thread via GitHub


xinrong-meng commented on code in PR #45614:
URL: https://github.com/apache/spark/pull/45614#discussion_r1532525957


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -131,6 +132,23 @@ message SQL {
   repeated Expression pos_arguments = 5;
 }
 
+// Relation of type [[WithRelations]].
+//
+// This relation contains a root plan, and one or more references that are 
used by the root plan.
+// There are two ways of referencing a relation, by name (through a subquery 
alias), or by plan_id
+// (using RelationCommon.plan_id).
+//
+// This relation can be used to implement CTEs, describe DAGs, or to reduce 
tree depth.
+message WithRelations {

Review Comment:
   May I ask which APIs are associated with the relation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-46349] Prevent nested SortOrder instances in SortOrder expressions [spark]

2024-03-20 Thread via GitHub


github-actions[bot] closed pull request #44283: [WIP][SPARK-46349] Prevent 
nested SortOrder instances in SortOrder expressions
URL: https://github.com/apache/spark/pull/44283


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010929308

   @dongjoon-hyun yes, it is a tough decision. I am now marking this PR as WIP 
and will decide whether we should continue on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

2024-03-20 Thread via GitHub


leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1533047660


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1458,6 +1458,18 @@ package object config {
   .doubleConf
   .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+  .doc("Comma-separated list of schemes for which jars will not be 
downloaded to the " +
+"driver local disk prior to be distributed to executors, only for 
kubernetes deployment. " +
+"For use in cases when the jars are big and executor counts are high, 
" +
+"concurrent download causes network saturation and timeouts. " +
+"Wildcard '*' is denoted to not downloading jars for any the schemes.")
+  .version("2.3.0")

Review Comment:
   Good catch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SS][SPARK-47363] Initial State without state reader implementation for State API v2. [spark]

2024-03-20 Thread via GitHub


jingz-db commented on code in PR #45467:
URL: https://github.com/apache/spark/pull/45467#discussion_r1531068752


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala:
##
@@ -127,13 +152,53 @@ case class TransformWithStateExec(
 mappedIterator
   }
 
+  private def processInitialStateRows(
+  keyRow: UnsafeRow,
+  initStateIter: Iterator[InternalRow]): Unit = {
+val getKeyObj =
+  ObjectOperator.deserializeRowToObject(keyDeserializer, 
groupingAttributes)
+
+val getStateValueObj =
+  ObjectOperator.deserializeRowToObject(initialStateDeserializer, 
initialStateDataAttrs)
+
+val keyObj = getKeyObj(keyRow) // convert key to objects
+ImplicitGroupingKeyTracker.setImplicitKey(keyObj)
+val initStateObjIter = initStateIter.map(getStateValueObj.apply)
+
+initStateObjIter.foreach { initState =>
+  statefulProcessor
+.asInstanceOf[StatefulProcessorWithInitialState[Any, Any, Any, Any]]
+.handleInitialState(keyObj, initState)
+}
+ImplicitGroupingKeyTracker.removeImplicitKey()
+  }
+
   private def processNewData(dataIter: Iterator[InternalRow]): 
Iterator[InternalRow] = {
 val groupedIter = GroupedIterator(dataIter, groupingAttributes, 
child.output)
 groupedIter.flatMap { case (keyRow, valueRowIter) =>
   val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
   handleInputRows(keyUnsafeRow, valueRowIter)
 }
   }
+// TODO double check this
+  private def processNewDataWithInitialState(
+  dataIter: Iterator[InternalRow],
+  initStateIter: Iterator[InternalRow]): Iterator[InternalRow] = {
+
+val groupedChildDataIter = GroupedIterator(dataIter, groupingAttributes, 
child.output)
+val groupedInitialStateIter =
+  GroupedIterator(initStateIter, initialStateGroupingAttrs, 
initialState.output)
+
+// Create a CoGroupedIterator that will group the two iterators together 
for every key group.
+new CoGroupedIterator(
+  groupedChildDataIter, groupedInitialStateIter, 
groupingAttributes).flatMap {
+  case (keyRow, valueRowIter, initialStateRowIter) =>
+// TODO in design doc: trying to re-initialize state for the same
+// grouping key will result in an error?

Review Comment:
   Not sure if I understand the scenario correctly, does it mean user tries to 
reassign state variable values in user defined function `handleInitialState()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47256][SQL] Assign names to error classes _LEGACY_ERROR_TEMP_102[4-7] [spark]

2024-03-20 Thread via GitHub


davidm-db closed pull request #45622: [SPARK-47256][SQL] Assign names to error 
classes _LEGACY_ERROR_TEMP_102[4-7]
URL: https://github.com/apache/spark/pull/45622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45623:
URL: https://github.com/apache/spark/pull/45623#issuecomment-2010746381

   Merged to master/3.5/3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun closed pull request #45623: [SPARK-47494][Doc] Add migration doc 
for the behavior change of Parquet timestamp inference since Spark 3.3
URL: https://github.com/apache/spark/pull/45623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532951714


##
docs/sql-migration-guide.md:
##
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed 
`str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input 
and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function 
returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not 
produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = 
false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 
and earlier, these columns are inferred as TIMESTAMP type. To restore the 
behavior before Spark 3.3, you can set 
`spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a 
behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Sure, created https://github.com/apache/spark/pull/45623 for this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47494][Doc] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 [spark]

2024-03-20 Thread via GitHub


gengliangwang opened a new pull request, #45623:
URL: https://github.com/apache/spark/pull/45623

   
   
   ### What changes were proposed in this pull request?
   
   Add migration doc for the behavior change of Parquet timestamp inference 
since Spark 3.3
   
   ### Why are the changes needed?
   
   Show the behavior change to users.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   It's just doc change
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, there are some doc suggestion from copilot in 
docs/sql-migration-guide.md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47490][SS] Fix RocksDB Logger constructor use to avoid deprecation warning [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun closed pull request #45616: [SPARK-47490][SS] Fix RocksDB Logger 
constructor use to avoid deprecation warning
URL: https://github.com/apache/spark/pull/45616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47486][CONNECT] Remove unused private `ArrowDeserializers.getString` method [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun closed pull request #45610: [SPARK-47486][CONNECT] Remove unused 
private `ArrowDeserializers.getString` method
URL: https://github.com/apache/spark/pull/45610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1532943327


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1458,6 +1458,18 @@ package object config {
   .doubleConf
   .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+  .doc("Comma-separated list of schemes for which jars will not be 
downloaded to the " +
+"driver local disk prior to be distributed to executors, only for 
kubernetes deployment. " +
+"For use in cases when the jars are big and executor counts are high, 
" +
+"concurrent download causes network saturation and timeouts. " +
+"Wildcard '*' is denoted to not downloading jars for any the schemes.")
+  .version("2.3.0")

Review Comment:
   This should be `4.0.0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2010726489

   Ack, @dbtsai .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010725475

   To @gengliangwang , I believe it would be better if you can send a short 
email to dev@spark in order not to surprise anyone.
   
   > I have been considering this for a while because of reported Spark upgrade 
failures. I am open to any suggestions. cc @sadikovi as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010724234

   I understand the exposed surface is limited as mentioned by @gengliangwang , 
but any Parquet change causes significant impact to all Spark users because 
Apache Parquet is the default format. 
   
   cc @mridulm , @sunchao , @viirya , @peter-toth , @RussellSpitzer , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532937164


##
docs/sql-migration-guide.md:
##
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the 
data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV 
specification` does not have standards for these data types and cannot be read 
back using `from_csv`), Spark will throw 
`DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert 
Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to 
TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert 
MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL 
DATETIME is not affected.
+- Since Spark 4.0, the SQL config 
`spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. 
Consequently, when reading Parquet files that were not produced by Spark, the 
Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ 
data type. This change ensures backward compatibility with releases of Spark 
version 3.2 and earlier. It also aligns the behavior of schema inference for 
Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, 
enhancing consistency across the data sources. To revert to the previous 
behavior where TIMESTAMP_NTZ types were inferred, set 
`spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   We need a discussion about this breaking change in Apache Spark 4.0.0, 
@gengliangwang .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532935416


##
docs/sql-migration-guide.md:
##
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed 
`str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input 
and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function 
returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not 
produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = 
false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 
and earlier, these columns are inferred as TIMESTAMP type. To restore the 
behavior before Spark 3.3, you can set 
`spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a 
behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Well, could you handle this as an independent JIRA first, @gengliangwang ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1532932026


##
docs/sql-migration-guide.md:
##
@@ -121,6 +122,8 @@ license: |
   - Since Spark 3.3, the `unbase64` function throws error for a malformed 
`str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input 
and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function 
returns a best-efforts result for a malformed `str` input.
 
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https://issues.apache.org/jira/browse/SPARK-40562).
+
+  - In Spark 3.3/3.4/3.5 releases, when reading Parquet files that were not 
produced by Spark, Parquet timestamp columns with annotation `isAdjustedToUTC = 
false` are inferred as TIMESTAMP_NTZ type during schema inference. In Spark 3.2 
and earlier, these columns are inferred as TIMESTAMP type. To restore the 
behavior before Spark 3.3, you can set 
`spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. Note that this is a 
behavior change, and it will be disabled after Spark 4.0 release. 

Review Comment:
   Another option is to change the conf in the latest 3.3/3.4/3.5 releases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

2024-03-20 Thread via GitHub


dbtsai commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2010712682

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang commented on PR #45621:
URL: https://github.com/apache/spark/pull/45621#issuecomment-2010712125

   I have been considering this for a while because of reported Spark upgrade 
failures. I am open to any suggestions.
   cc @sadikovi as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

2024-03-20 Thread via GitHub


gengliangwang opened a new pull request, #45621:
URL: https://github.com/apache/spark/pull/45621

   
   
   ### What changes were proposed in this pull request?
   
   Since Spark 4.0, the SQL config 
`spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. 
Consequently, when reading Parquet files that were not produced by Spark, the 
Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ 
data type. This change ensures backward compatibility with releases of Spark 
version 3.2 and earlier. It also aligns the behavior of schema inference for 
Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, 
enhancing consistency across the data sources. To revert to the previous 
behavior where TIMESTAMP_NTZ types were inferred, set 
`spark.sql.parquet.inferTimestampNTZ.enabled` to true.
   
   Note: With https://issues.apache.org/jira/browse/SPARK-47368 and 
https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't 
break the current workloads which are using Spark 3.3/3.4/3.5.
   
   
   ### Why are the changes needed?
   
   1. Consistency with the behavior of CSV, JSON, ORC, and JDBC data sources. 
This also makes the schema inference behavior simpler.
   2. When using `insert overwrite` or `replace table` over external parquet 
files into a Delta or Iceberg table, the result schema is changed from 
TimestampType to TimestampNTZType. The result table can't be read by older 
version engines anymore due to a lack of TimestampNTZ support. The behavior 
change happens silently and it is hard for users to figure out how to fix it.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, since Spark 4.0, the SQL config 
`spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. 
Consequently, when reading Parquet files that were not produced by Spark, the 
Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ 
data type.
   
   With https://issues.apache.org/jira/browse/SPARK-47368 and 
https://issues.apache.org/jira/browse/SPARK-47447, this behavior change won't 
break the current workloads which are using Spark 3.3/3.4/3.5.
   
   ### How was this patch tested?
   
   Existing UTs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, there are some doc suggestion from copilot in 
`docs/sql-migration-guide.md`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Bump black from 23.9.1 to 24.3.0 in /dev [spark]

2024-03-20 Thread via GitHub


dependabot[bot] commented on PR #45617:
URL: https://github.com/apache/spark/pull/45617#issuecomment-2010663402

   OK, I won't notify you about black again, unless you re-open this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Bump black from 23.9.1 to 24.3.0 in /dev [spark]

2024-03-20 Thread via GitHub


dependabot[bot] closed pull request #45617: Bump black from 23.9.1 to 24.3.0 in 
/dev
URL: https://github.com/apache/spark/pull/45617


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Bump black from 23.9.1 to 24.3.0 in /dev [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45617:
URL: https://github.com/apache/spark/pull/45617#issuecomment-2010663347

   @dependabot ignore this dependency


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45620:
URL: https://github.com/apache/spark/pull/45620#discussion_r1532892591


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.catalyst.analysis.AnalysisTest
+import org.apache.spark.sql.execution.SparkSqlParser
+import org.apache.spark.sql.test.SharedSparkSession
+
+class ParserSuite extends AnalysisTest with SharedSparkSession {
+  private lazy val parser = new SparkSqlParser()
+
+  test("verify whitespace handling") {

Review Comment:
   Please move this to the existing `PlanParserSuite.scala`(catalyst module) or 
`SparkSqlParserSuite` (sql/core module).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45620:
URL: https://github.com/apache/spark/pull/45620#discussion_r1532892591


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.catalyst.analysis.AnalysisTest
+import org.apache.spark.sql.execution.SparkSqlParser
+import org.apache.spark.sql.test.SharedSparkSession
+
+class ParserSuite extends AnalysisTest with SharedSparkSession {
+  private lazy val parser = new SparkSqlParser()
+
+  test("verify whitespace handling") {

Review Comment:
   Please move this to the existing `SparkSqlParserSuite`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45620:
URL: https://github.com/apache/spark/pull/45620#discussion_r1532892353


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.catalyst.analysis.AnalysisTest
+import org.apache.spark.sql.execution.SparkSqlParser
+import org.apache.spark.sql.test.SharedSparkSession
+
+class ParserSuite extends AnalysisTest with SharedSparkSession {

Review Comment:
   It's too much for a single test case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on code in PR #45620:
URL: https://github.com/apache/spark/pull/45620#discussion_r1532891606


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ParserSuite.scala:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command

Review Comment:
   Why is this under `command` package?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-27950][DSTREAMS][Kinesis] Configurable dynamodb url so kinesis-asl works with localstack [spark]

2024-03-20 Thread via GitHub


dongjoon-hyun commented on PR #45619:
URL: https://github.com/apache/spark/pull/45619#issuecomment-2010655319

   Could you set `ENABLE_KINESIS_TESTS=1` and run the tests in your 
environment? It would be great if you can paste your result into the PR 
description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-47492] Widen whitespace rules in lexer [spark]

2024-03-20 Thread via GitHub


srielau opened a new pull request, #45620:
URL: https://github.com/apache/spark/pull/45620

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >