Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-26 Thread via GitHub


dbatomic commented on code in PR #45233:
URL: https://github.com/apache/spark/pull/45233#discussion_r1503780793


##
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##
@@ -100,6 +100,7 @@ message DataType {
 
   message String {
 uint32 type_variation_reference = 1;
+optional uint32 collation_id = 2;

Review Comment:
   Yes, StringType will always have a collation_id so making it required sounds 
good.
   
   The thing that worries me here is backwards compatibility. E.g. what happens 
if other side is on an old version without collation support?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-41873][PYTHON][CONNECT][TESTS] Enable `DataFrameParityTests.test_pandas_api` [spark]

2024-02-26 Thread via GitHub


zhengruifeng opened a new pull request, #45279:
URL: https://github.com/apache/spark/pull/45279

   ### What changes were proposed in this pull request?
   Enable `DataFrameParityTests.test_pandas_api`
   
   
   ### Why are the changes needed?
   for testing parity, this method had already been implemented
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test-only
   
   
   ### How was this patch tested?
   enabled ut
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-26 Thread via GitHub


pan3793 commented on code in PR #45266:
URL: https://github.com/apache/spark/pull/45266#discussion_r1503740095


##
core/src/main/scala/org/apache/spark/Dependency.scala:
##
@@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 finalizeTask = Option(task)
   }
 
+  // Set the threshold to 1 billion which represents approximately 1GB of 
memory
+  // allocated to map output statuses.

Review Comment:
   is the assumption based on `CompressedMapStatus`? For shuffle partition 
number exceeds `spark.shuffle.minNumPartitionsToHighlyCompress`(default is 
2000), `HighlyCompressedMapStatus` is used, is it correct then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-02-26 Thread via GitHub


anishshri-db commented on code in PR #45051:
URL: https://github.com/apache/spark/pull/45051#discussion_r1503733977


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala:
##
@@ -69,8 +70,20 @@ case class TransformWithStateExec(
 
   override def shortName: String = "transformWithStateExec"
 
-  // TODO: update this to run no-data batches when timer support is added
-  override def shouldRunAnotherBatch(newInputWatermark: Long): Boolean = false
+  override def shouldRunAnotherBatch(newInputWatermark: Long): Boolean = {
+timeoutMode match {
+  // TODO: check if we can return true only if actual timers are registered

Review Comment:
   Yea - but the time at which this is called - I believe we don't have the 
storeRDD or store instance available directly. We could potentially track some 
count, but haven't tried the change/optimization yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-02-26 Thread via GitHub


anishshri-db commented on code in PR #45051:
URL: https://github.com/apache/spark/pull/45051#discussion_r1503733123


##
sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessor.scala:
##
@@ -48,16 +48,19 @@ private[sql] trait StatefulProcessor[K, I, O] extends 
Serializable {
* @param inputRows - iterator of input rows associated with grouping key
* @param timerValues - instance of TimerValues that provides access to 
current processing/event
*time if available
+   * @param expiredTimerInfo - instance of ExpiredTimerInfo that provides 
access to expired timer
+   * if applicable
* @return - Zero or more output rows
*/
   def handleInputRows(
   key: K,
   inputRows: Iterator[I],
-  timerValues: TimerValues): Iterator[O]
+  timerValues: TimerValues,
+  expiredTimerInfo: ExpiredTimerInfo): Iterator[O]

Review Comment:
   Yea thought of that - but we would have the same problem with the Java API 
in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-02-26 Thread via GitHub


anishshri-db commented on code in PR #45051:
URL: https://github.com/apache/spark/pull/45051#discussion_r1503732422


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala:
##
@@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl(
 
   override def getQueryInfo(): QueryInfo = currQueryInfo
 
+  private def getTimerState[T](): TimerStateImpl[T] = {
+new TimerStateImpl[T](store, timeoutMode, keyEncoder)
+  }
+
+  private val timerState = getTimerState[Boolean]()
+
+  override def registerTimer(expiryTimestampMs: Long): Unit = {
+verify(timeoutMode == ProcessingTime || timeoutMode == EventTime,

Review Comment:
   The existing one is tied to `GroupState`, do did not want to reuse that. 
Also, if we add more modes here in the future - thought that its better to keep 
this generic and separate to this API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-02-26 Thread via GitHub


anishshri-db commented on code in PR #45051:
URL: https://github.com/apache/spark/pull/45051#discussion_r1503731606


##
sql/api/src/main/scala/org/apache/spark/sql/streaming/ExpiredTimerInfo.scala:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.io.Serializable
+
+import org.apache.spark.annotation.{Evolving, Experimental}
+
+/**
+ * Class used to provide access to expired timer's expiry time and timeout 
mode. These values
+ * are only relevant if the ExpiredTimerInfo is valid.
+ */
+@Experimental
+@Evolving
+private[sql] trait ExpiredTimerInfo extends Serializable {
+  /**
+   * Check if provided ExpiredTimerInfo is valid.
+   */
+  def isValid(): Boolean
+
+  /**
+   * Get the expired timer's expiry time as milliseconds in epoch time.
+   */
+  def getExpiryTimeInMs(): Long
+
+  /**
+   * Get the expired timer's timeout mode.
+   */
+  def getTimeoutMode(): TimeoutMode

Review Comment:
   But that is not available to the `StatefulProcessor` though. Do you prefer 
to pass the timeout mode in the `init` method instead ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965901217

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun closed pull request #45278: [SPARK-47182][BUILD] Exclude 
`commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*`
URL: https://github.com/apache/spark/pull/45278


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on code in PR #45266:
URL: https://github.com/apache/spark/pull/45266#discussion_r1503714884


##
core/src/main/scala/org/apache/spark/Dependency.scala:
##
@@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 finalizeTask = Option(task)
   }
 
+  // Set the threshold to 1 billion which represents approximately 1GB of 
memory
+  // allocated to map output statuses.
+  // A large number of shuffle blocks may crash the driver with an OOM error.
+  private val SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD: Long = 1L << 30
+  private val numberOfShuffleBlocks = numPartitions.toLong * 
partitioner.numPartitions.toLong
+
+  if (numberOfShuffleBlocks > SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD) {

Review Comment:
   Just a question. Is there a chance of overflow at line 213?
   > private val numberOfShuffleBlocks = numPartitions.toLong * 
partitioner.numPartitions.toLong



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-02-26 Thread via GitHub


sunchao commented on PR #45267:
URL: https://github.com/apache/spark/pull/45267#issuecomment-1965888375

   Thanks @szehon-ho ! will take a look in a few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-26 Thread via GitHub


y-wei commented on code in PR #45266:
URL: https://github.com/apache/spark/pull/45266#discussion_r1503712574


##
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:
##
@@ -483,6 +483,17 @@ abstract class ShuffleSuite extends SparkFunSuite with 
Matchers with LocalRootDi
 }
 assert(e.getMessage.contains("corrupted due to DISK_ISSUE"))
   }
+
+  test("SPARK-39771: warn when shuffle block number is too large") {
+sc = new SparkContext("local", "test", conf)
+val logAppender = new LogAppender("deprecated Avro option 
'ignoreExtension'")

Review Comment:
   True, thanks for catching this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47176][SQL] Have a ResolveAllExpressionsUpWithPruning helper function [spark]

2024-02-26 Thread via GitHub


cloud-fan closed pull request #45270: [SPARK-47176][SQL] Have a 
ResolveAllExpressionsUpWithPruning helper function
URL: https://github.com/apache/spark/pull/45270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]

2024-02-26 Thread via GitHub


HyukjinKwon closed pull request #45275: [MINOR][SS] Minor string representation 
improvement in AssertOnQuery
URL: https://github.com/apache/spark/pull/45275


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45275:
URL: https://github.com/apache/spark/pull/45275#issuecomment-1965884273

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on code in PR #45275:
URL: https://github.com/apache/spark/pull/45275#discussion_r1503711045


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala:
##
@@ -284,7 +284,11 @@ trait StreamTest extends QueryTest with SharedSparkSession 
with TimeLimits with
   /** Assert that a condition on the active query is true */
   class AssertOnQuery(val condition: StreamExecution => Boolean, val message: 
String)
 extends StreamAction with StreamMustBeRunning {
-override def toString: String = s"AssertOnQuery(, $message)"
+override def toString: String = if (message == "") {

Review Comment:
   It won't be :-). This is for the case below:
   
   ```
   def apply(condition: StreamExecution => Boolean, message: String = ""): 
AssertOnQuery = {
 new AssertOnQuery(condition, message)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]

2024-02-26 Thread via GitHub


sahnib commented on code in PR #45051:
URL: https://github.com/apache/spark/pull/45051#discussion_r1503581091


##
sql/api/src/main/scala/org/apache/spark/sql/streaming/ExpiredTimerInfo.scala:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.io.Serializable
+
+import org.apache.spark.annotation.{Evolving, Experimental}
+
+/**
+ * Class used to provide access to expired timer's expiry time and timeout 
mode. These values
+ * are only relevant if the ExpiredTimerInfo is valid.
+ */
+@Experimental
+@Evolving
+private[sql] trait ExpiredTimerInfo extends Serializable {
+  /**
+   * Check if provided ExpiredTimerInfo is valid.
+   */
+  def isValid(): Boolean
+
+  /**
+   * Get the expired timer's expiry time as milliseconds in epoch time.
+   */
+  def getExpiryTimeInMs(): Long
+
+  /**
+   * Get the expired timer's timeout mode.
+   */
+  def getTimeoutMode(): TimeoutMode

Review Comment:
   Would this ever be different than the timeout mode provided to 
`transformWithState` API? If no, do we need this here? 



##
sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessor.scala:
##
@@ -48,16 +48,19 @@ private[sql] trait StatefulProcessor[K, I, O] extends 
Serializable {
* @param inputRows - iterator of input rows associated with grouping key
* @param timerValues - instance of TimerValues that provides access to 
current processing/event
*time if available
+   * @param expiredTimerInfo - instance of ExpiredTimerInfo that provides 
access to expired timer
+   * if applicable
* @return - Zero or more output rows
*/
   def handleInputRows(
   key: K,
   inputRows: Iterator[I],
-  timerValues: TimerValues): Iterator[O]
+  timerValues: TimerValues,
+  expiredTimerInfo: ExpiredTimerInfo): Iterator[O]
 
   /**
* Function called as the last method that allows for users to perform
* any cleanup or teardown operations.
*/
-  def close (): Unit
+  def close (): Unit = {}

Review Comment:
   Nice idea to provide a default implementation here. 



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala:
##
@@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl(
 
   override def getQueryInfo(): QueryInfo = currQueryInfo
 
+  private def getTimerState[T](): TimerStateImpl[T] = {
+new TimerStateImpl[T](store, timeoutMode, keyEncoder)
+  }
+
+  private val timerState = getTimerState[Boolean]()
+
+  override def registerTimer(expiryTimestampMs: Long): Unit = {
+verify(timeoutMode == ProcessingTime || timeoutMode == EventTime,
+s"Cannot register timers with incorrect TimeoutMode")
+verify(currState == INITIALIZED || currState == DATA_PROCESSED,
+s"Cannot register timers with " +
+  s"expiryTimestampMs=$expiryTimestampMs in current state=$currState")

Review Comment:
   We should use the NERF framework for these user errors. 



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala:
##
@@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl(
 
   override def getQueryInfo(): QueryInfo = currQueryInfo
 
+  private def getTimerState[T](): TimerStateImpl[T] = {
+new TimerStateImpl[T](store, timeoutMode, keyEncoder)
+  }
+
+  private val timerState = getTimerState[Boolean]()
+
+  override def registerTimer(expiryTimestampMs: Long): Unit = {
+verify(timeoutMode == ProcessingTime || timeoutMode == EventTime,

Review Comment:
   Shouldn't this be same as the timeoutMode in transformWithState API? 



##
sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala:
##
@@ -51,6 +51,25 @@ private[sql] trait StatefulProcessorHandle extends 
Serializable {
   /** Function to return queryInfo for currently running task */
   def getQueryInfo(): QueryInfo
 
+  /**
+   * Function to register a processing/event time based timer for given 
implicit key

Review Comment:
   [nit] `implicit key` -> `implicit grouping key`. 



##

Re: [PR] [SPARK-47176][SQL] Have a ResolveAllExpressionsUpWithPruning helper function [spark]

2024-02-26 Thread via GitHub


cloud-fan commented on PR #45270:
URL: https://github.com/apache/spark/pull/45270#issuecomment-1965883152

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47164][SQL] Make Default Value From Wider Type Narrow Literal of v2 behave the same as v1 [spark]

2024-02-26 Thread via GitHub


cloud-fan commented on PR #45254:
URL: https://github.com/apache/spark/pull/45254#issuecomment-1965882323

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on code in PR #45266:
URL: https://github.com/apache/spark/pull/45266#discussion_r1503709429


##
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:
##
@@ -483,6 +483,17 @@ abstract class ShuffleSuite extends SparkFunSuite with 
Matchers with LocalRootDi
 }
 assert(e.getMessage.contains("corrupted due to DISK_ISSUE"))
   }
+
+  test("SPARK-39771: warn when shuffle block number is too large") {
+sc = new SparkContext("local", "test", conf)
+val logAppender = new LogAppender("deprecated Avro option 
'ignoreExtension'")

Review Comment:
   Maybe, is this simply a copy-and-paste mistake from here?
   
https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala#L2229



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on code in PR #45266:
URL: https://github.com/apache/spark/pull/45266#discussion_r1503708211


##
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:
##
@@ -483,6 +483,17 @@ abstract class ShuffleSuite extends SparkFunSuite with 
Matchers with LocalRootDi
 }
 assert(e.getMessage.contains("corrupted due to DISK_ISSUE"))
   }
+
+  test("SPARK-39771: warn when shuffle block number is too large") {
+sc = new SparkContext("local", "test", conf)
+val logAppender = new LogAppender("deprecated Avro option 
'ignoreExtension'")

Review Comment:
   `"deprecated Avro option 'ignoreExtension'"` seems to be wrong to me. Is 
this intentional, @y-wei ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965865838

   Thank you, @LuciferYang . I fully understand your concern and have the same 
feeling about this kind of tedious code changes which sometimes looks 
redundant. This is just a way of communication in order to give a clarify of 
Apache Spark 4.0.0 dependency issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


LuciferYang commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965863663

   > To give you the context, I'm currently working on this for Apache Spark 
4.0.0 preparation, @LuciferYang .
   > 
   > * [SPARK-47046](https://issues.apache.org/jira/browse/SPARK-47046) Apache 
Spark 4.0.0 Dependency Audit and Cleanup
   >   
   >   * https://github.com/apache/spark/security
   
   Thanks @dongjoon-hyun  This is a great work!!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965862146

   No, I don't think so if there is no other CVE related stuff bites us again, 
@LuciferYang .
   > Okay, I understand your intention. So, will there be more similar PRs in 
the future to explicitly exclude the transitive dependencies declared by Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965861469

   To give you the context, I'm currently working on this for Apache Spark 
4.0.0 preparation, @LuciferYang . 
   - SPARK-47046 Apache Spark 4.0.0 Dependency Audit and Cleanup
 - https://github.com/apache/spark/security 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


LuciferYang commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965861359

   Okay, I understand your intention. So, will there be more similar PRs in the 
future to explicitly exclude the transitive dependencies declared by Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965859447

   Specifically, the following dependency clarification is the goal of this PR.
   
   **BEFORE**
   ```
   $ build/sbt "core/dependencyTree" | grep -C2 commons-compress | tail -n3
   Using SPARK_LOCAL_IP=localhost
   [info]   +-org.apache.commons:commons-compress:1.26.0
   [info]   | +-commons-io:commons-io:2.15.1
   [info]   | +-org.apache.commons:commons-lang3:3.14.0
   ```
   
   **AFTER**
   ```
   $ build/sbt "core/dependencyTree" | grep -C2 commons-compress | tail -n3
   Using SPARK_LOCAL_IP=localhost
   [info]   +-org.apache.commons:commons-compress:1.26.0
   [info]   +-org.apache.commons:commons-crypto:1.1.0
   [info]   +-org.apache.commons:commons-lang3:3.14.0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47153][CORE] Guard serialize/deserialize in JavaSerializer with try-with-resource block [spark]

2024-02-26 Thread via GitHub


LuciferYang commented on code in PR #45238:
URL: https://github.com/apache/spark/pull/45238#discussion_r1503688318


##
core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala:
##
@@ -118,22 +118,24 @@ private[spark] class JavaSerializerInstance(
 
   override def serialize[T: ClassTag](t: T): ByteBuffer = {
 val bos = new ByteBufferOutputStream()
-val out = serializeStream(bos)
-out.writeObject(t)
-out.close()
+Utils.tryWithResource(serializeStream(bos)) { out =>

Review Comment:
   There should be other similar cases, for example:
   
   
https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L92-L94
   
   
https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L473
   
   cc @mridulm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965855776

   Currently, we are in sync with the latest versions, but the latest version 
might be different from the Apache Spark's versions. For both Maven and SBT, 
this PR want to exclude the transitive dependency clearly without depending on 
the dependency resolution for them, @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]

2024-02-26 Thread via GitHub


LuciferYang commented on code in PR #45275:
URL: https://github.com/apache/spark/pull/45275#discussion_r1503682301


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala:
##
@@ -284,7 +284,11 @@ trait StreamTest extends QueryTest with SharedSparkSession 
with TimeLimits with
   /** Assert that a condition on the active query is true */
   class AssertOnQuery(val condition: StreamExecution => Boolean, val message: 
String)
 extends StreamAction with StreamMustBeRunning {
-override def toString: String = s"AssertOnQuery(, $message)"
+override def toString: String = if (message == "") {

Review Comment:
   Is it only for `""`? What if the message is `""`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]

2024-02-26 Thread via GitHub


LuciferYang commented on PR #45278:
URL: https://github.com/apache/spark/pull/45278#issuecomment-1965844769

   If the groupId and artifactId are the same, Maven should choose the latest 
version among multiple versions. What problems could occur if we don't exclude 
it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]

2024-02-26 Thread via GitHub


itholic commented on code in PR #45273:
URL: https://github.com/apache/spark/pull/45273#discussion_r1503675810


##
core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:
##
@@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite {
 }
   } else {
 assert(subErrorDoc.trim == errorsInDoc.trim,
-  "The error class document is not up to date. Please regenerate it.")
+  "The error class document is not up to date. Please regenerate it by 
running " +
+"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
+"\\\"Error classes match with document\\\"\"`")

Review Comment:
   Sounds good. Will update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` [spark]

2024-02-26 Thread via GitHub


zhengruifeng commented on PR #45277:
URL: https://github.com/apache/spark/pull/45277#issuecomment-1965829385

   @HyukjinKwon @MaxGekk @grundprinzip @hvanhovell would you mind taking a 
look? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro-*` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun opened a new pull request, #45278:
URL: https://github.com/apache/spark/pull/45278

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` [spark]

2024-02-26 Thread via GitHub


zhengruifeng commented on PR #45277:
URL: https://github.com/apache/spark/pull/45277#issuecomment-1965815610

   I found introducing a CTE message may make it more complex, since a 
`SQLCommand` is executed eagerly.
   ```
   UnresolvedWith (CTE)
- SQLCommand
- cteRelations
   ```
   
   So I switch to adding views fields in existing `SQLCommand`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` [spark]

2024-02-26 Thread via GitHub


zhengruifeng opened a new pull request, #45277:
URL: https://github.com/apache/spark/pull/45277

   ### What changes were proposed in this pull request?
   Implement SQLStringFormatter for Python Client
   
   
   ### Why are the changes needed?
   for parity
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new feature
   
   ```
   In [1]: mydf = spark.range(10)
   
   In [2]: spark.sql("SELECT {col} FROM {mydf} WHERE id IN {x}", col=mydf.id, 
mydf=mydf, x=tuple(range(4))).show()
   +---+
   
   | id|
   +---+
   |  0|
   |  1|
   |  2|
   |  3|
   +---+
   ```
   
   
   
   ### How was this patch tested?
   enabled doc tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45273:
URL: https://github.com/apache/spark/pull/45273#discussion_r1503632310


##
core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:
##
@@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite {
 }
   } else {
 assert(subErrorDoc.trim == errorsInDoc.trim,
-  "The error class document is not up to date. Please regenerate it.")
+  "The error class document is not up to date. Please regenerate it by 
running " +
+"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
+"\\\"Error classes match with document\\\"\"`")

Review Comment:
   E.g. 
   ```
 /* Used to regenerate the error class file. Run:
  {{{
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \
   "core/testOnly *SparkThrowableSuite -- -t \"Error classes are 
correctly formatted\""
  }}}
   
  To regenerate the error class document. Run:
  {{{
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \
   "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with 
document\""
  }}}
  */
  val error_message_instrcution = "Please regenerate it by running " +
   "`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
   "\\\"Error classes match with document\\\"\"
 ``` 
 
 The use the val in the error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45273:
URL: https://github.com/apache/spark/pull/45273#discussion_r1503632310


##
core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:
##
@@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite {
 }
   } else {
 assert(subErrorDoc.trim == errorsInDoc.trim,
-  "The error class document is not up to date. Please regenerate it.")
+  "The error class document is not up to date. Please regenerate it by 
running " +
+"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
+"\\\"Error classes match with document\\\"\"`")

Review Comment:
   E.g. 
   ```
 /* Used to regenerate the error class file. Run:
  {{{
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \
   "core/testOnly *SparkThrowableSuite -- -t \"Error classes are 
correctly formatted\""
  }}}
   
  To regenerate the error class document. Run:
  {{{
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \
   "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with 
document\""
  }}}
  */
  val error_message_instrcution = "Please regenerate it by running " +
   "`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
   "\\\"Error classes match with document\\\"\"
 ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45273:
URL: https://github.com/apache/spark/pull/45273#discussion_r1503632310


##
core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:
##
@@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite {
 }
   } else {
 assert(subErrorDoc.trim == errorsInDoc.trim,
-  "The error class document is not up to date. Please regenerate it.")
+  "The error class document is not up to date. Please regenerate it by 
running " +
+"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
+"\\\"Error classes match with document\\\"\"`")

Review Comment:
   E.g. 
   ```
 /* Used to regenerate the error class file. Run:
  {{{
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \
   "core/testOnly *SparkThrowableSuite -- -t \"Error classes are 
correctly formatted\""
  }}}
   
  To regenerate the error class document. Run:
  {{{
 SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \
   "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with 
document\""
  }}}
  */
  val error_message_instrcution = Please regenerate it by running " +
   "`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
   "\\\"Error classes match with document\\\"\"
 ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45273:
URL: https://github.com/apache/spark/pull/45273#discussion_r1503631655


##
core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:
##
@@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite {
 }
   } else {
 assert(subErrorDoc.trim == errorsInDoc.trim,
-  "The error class document is not up to date. Please regenerate it.")
+  "The error class document is not up to date. Please regenerate it by 
running " +
+"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly 
*SparkThrowableSuite -- -t " +
+"\\\"Error classes match with document\\\"\"`")

Review Comment:
   Can we use a variable of string to replace  this hardcoded instruction and 
put that val at the begging of this class and under the comment of how to 
re-generate the golden file?
   
   The reason it is better because it is possible that the instruction of how 
to re-gen is updated but the developer fail to update the instruction in the 
such messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SQL] Tweak column error names and text [spark]

2024-02-26 Thread via GitHub


amaliujia commented on PR #45276:
URL: https://github.com/apache/spark/pull/45276#issuecomment-1965775599

   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45233:
URL: https://github.com/apache/spark/pull/45233#discussion_r1503625723


##
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectServiceE2ESuite.scala:
##
@@ -202,4 +202,17 @@ class SparkConnectServiceE2ESuite extends 
SparkConnectServerTest {
   }
 }
   }
+
+  test("SPARK-47144: Verify collations work with Connect client") {
+withClient { client =>
+  Seq("UCS_BASIC", "UCS_BASIC_LCASE", "UNICODE", 
"UNICODE_CI").foreach(collationName => {
+val query1 = client.execute(buildPlan(s"select 'abc' collate 
'$collationName'"))
+query1.hasNext // guarantees the request was received by server.
+
+val query2 = client.execute(buildPlan(s"select collation('abc' collate 
'$collationName')"))
+query2.hasNext // guarantees the request was received by server.
+  })
+  client.releaseSession()

Review Comment:
   Looks like the purpose of your test here is to make sure the request is 
passing through without an error. You do not need this test. 
   
   What you need is to add a test in `PlanGenerationTestSuite` as your purpose 
is to verify the client generated a correct plan. You can check the class doc 
of `PlanGenerationTestSuite` to understand how to add such a test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45233:
URL: https://github.com/apache/spark/pull/45233#discussion_r1503622539


##
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##
@@ -100,6 +100,7 @@ message DataType {
 
   message String {
 uint32 type_variation_reference = 1;
+optional uint32 collation_id = 2;

Review Comment:
   The String type should always have a collation_id so this field should be 
non optional? The default value is 0 which is the default value for the proto 
field if that is not set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45233:
URL: https://github.com/apache/spark/pull/45233#discussion_r1503622112


##
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##
@@ -100,6 +100,7 @@ message DataType {
 
   message String {
 uint32 type_variation_reference = 1;
+optional uint32 collation_id = 2;

Review Comment:
   The String type should always have a collation_id so this field should be 
non `optional`? The default value is 0. 



##
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##
@@ -100,6 +100,7 @@ message DataType {
 
   message String {
 uint32 type_variation_reference = 1;
+optional uint32 collation_id = 2;

Review Comment:
   The String type should always have a collation_id so this field should be 
non optional? The default value is 0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]

2024-02-26 Thread via GitHub


amaliujia commented on code in PR #45233:
URL: https://github.com/apache/spark/pull/45233#discussion_r1503622112


##
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##
@@ -100,6 +100,7 @@ message DataType {
 
   message String {
 uint32 type_variation_reference = 1;
+optional uint32 collation_id = 2;

Review Comment:
   The String type should always have a collation_id so this field might be non 
`optional`? The default value is 0. 



##
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##
@@ -100,6 +100,7 @@ message DataType {
 
   message String {
 uint32 type_variation_reference = 1;
+optional uint32 collation_id = 2;

Review Comment:
   The String type should always have a collation_id so this field should be 
non `optional`? The default value is 0. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun closed pull request #45274: [SPARK-47181][CORE][TESTS] Fix 
`MasterSuite` to validate the number of registered workers
URL: https://github.com/apache/spark/pull/45274


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45274:
URL: https://github.com/apache/spark/pull/45274#issuecomment-1965764570

   Thank you, @yaooqinn , @HyukjinKwon , @viirya .
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-26 Thread via GitHub


srowen commented on PR #44690:
URL: https://github.com/apache/spark/pull/44690#issuecomment-1965751067

   Wait, this was merged to master? Let's not merge to 3.5, no, because i don't 
think this actually fixes the problem (see the thread here - I didn't grok that 
this was a backport) @tgravescs not sure about this one in master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-26 Thread via GitHub


HeartSaVioR commented on PR #44690:
URL: https://github.com/apache/spark/pull/44690#issuecomment-1965745567

   @srowen No, this was merged to master via 
https://github.com/apache/spark/pull/43494. I'm just speculating as the change 
does not look to be trivial enough to worth taking the risk on adopting this 
change in maintenance version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-26 Thread via GitHub


srowen closed pull request #44690: [SPARK-45527][CORE] Use fraction to do the 
resource calculation
URL: https://github.com/apache/spark/pull/44690


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-26 Thread via GitHub


srowen commented on PR #44690:
URL: https://github.com/apache/spark/pull/44690#issuecomment-1965741499

   No, he's asking why it's open vs the 3.5 branch not master. It is not going 
into master and so is not to be back ported either. The main reason is that 
this is not a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SQL] Tweak column error names and text [spark]

2024-02-26 Thread via GitHub


nchammas commented on code in PR #45276:
URL: https://github.com/apache/spark/pull/45276#discussion_r1503601045


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -492,15 +492,15 @@
 },
 "sqlState" : "54000"
   },
-  "COLUMN_ALIASES_IS_NOT_ALLOWED" : {
+  "COLUMN_ALIASES_NOT_ALLOWED" : {

Review Comment:
   I considered naming this `COLUMN_ALIASES_ARE_NOT_ALLOWED`, but since these 
are slugs they don't really need proper grammar.
   
   This change at least corrects the incorrect combination of the plural 
"aliases" with the singular "is".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][SQL] Tweak column error names and text [spark]

2024-02-26 Thread via GitHub


nchammas opened a new pull request, #45276:
URL: https://github.com/apache/spark/pull/45276

   ### What changes were proposed in this pull request?
   
   Tweak the names and text for a few errors so they read more naturally (and 
correctly).
   
   ### Why are the changes needed?
   
   Just minor English improvements.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, these are user-facing error messages.
   
   ### How was this patch tested?
   
   No testing apart from CI.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]

2024-02-26 Thread via GitHub


HyukjinKwon closed pull request #45271: [SPARK-47178][PYTHON][TESTS] Add a test 
case for createDataFrame with dataclasses
URL: https://github.com/apache/spark/pull/45271


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45271:
URL: https://github.com/apache/spark/pull/45271#issuecomment-1965738440

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45275:
URL: https://github.com/apache/spark/pull/45275#issuecomment-1965738139

   cc @HeartSaVioR FYI. I found this out while debugging sth :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]

2024-02-26 Thread via GitHub


HyukjinKwon opened a new pull request, #45275:
URL: https://github.com/apache/spark/pull/45275

   ### What changes were proposed in this pull request?
   
   This PR changes `AssertOnQuery(, )` to 
`AssertOnQuery()` when the message is empty.
   
   ### Why are the changes needed?
   
   Just to make it a little bit more prettier and readale.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, dev-inly.
   
   ### How was this patch tested?
   
   Manually.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2024-02-26 Thread via GitHub


wbo4958 commented on PR #44690:
URL: https://github.com/apache/spark/pull/44690#issuecomment-1965726911

   Hi @HeartSaVioR, In essence, it's actually a genuine bug in Spark rather 
than a feature request. If we attempt to restrict the execution of only one 
task per executor by modifying the GPU resources, it won't work as expected. 
This scenario is quite common in ETL (Extract, Transform, Load) + ML (Machine 
Learning) workflows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on code in PR #44906:
URL: https://github.com/apache/spark/pull/44906#discussion_r1503590785


##
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala:
##
@@ -661,6 +662,52 @@ class MasterSuite extends SparkFunSuite
 scheduleExecutorsForAppWithMultiRPs(withMaxCores = true)
   }
 
+
+  private val workerSelectionPolicyTestCases = Seq(
+(CORES_FREE_ASC, true, List("10001", "10002")),
+(CORES_FREE_ASC, false, List("10001")),
+(CORES_FREE_DESC, true, List("10004", "10005")),
+(CORES_FREE_DESC, false, List("10005")),
+(MEMORY_FREE_ASC, true, List("10001", "10005")),
+(MEMORY_FREE_ASC, false, List("10001")),
+(MEMORY_FREE_DESC, true, List("10002", "10003")),
+(MEMORY_FREE_DESC, false, Seq("10002")),
+(WORKER_ID, true, Seq("10001", "10002")),
+(WORKER_ID, false, Seq("10001")))
+
+  workerSelectionPolicyTestCases.foreach { case (policy, spreadOut, expected) 
=>
+test(s"SPARK-46881: scheduling with workerSelectionPolicy - $policy 
($spreadOut)") {

Review Comment:
   To @HyukjinKwon , here is the PR.
   - #45274



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on code in PR #44906:
URL: https://github.com/apache/spark/pull/44906#discussion_r1503590785


##
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala:
##
@@ -661,6 +662,52 @@ class MasterSuite extends SparkFunSuite
 scheduleExecutorsForAppWithMultiRPs(withMaxCores = true)
   }
 
+
+  private val workerSelectionPolicyTestCases = Seq(
+(CORES_FREE_ASC, true, List("10001", "10002")),
+(CORES_FREE_ASC, false, List("10001")),
+(CORES_FREE_DESC, true, List("10004", "10005")),
+(CORES_FREE_DESC, false, List("10005")),
+(MEMORY_FREE_ASC, true, List("10001", "10005")),
+(MEMORY_FREE_ASC, false, List("10001")),
+(MEMORY_FREE_DESC, true, List("10002", "10003")),
+(MEMORY_FREE_DESC, false, Seq("10002")),
+(WORKER_ID, true, Seq("10001", "10002")),
+(WORKER_ID, false, Seq("10001")))
+
+  workerSelectionPolicyTestCases.foreach { case (policy, spreadOut, expected) 
=>
+test(s"SPARK-46881: scheduling with workerSelectionPolicy - $policy 
($spreadOut)") {

Review Comment:
   To @HyukjinKwon , here is the follow-up.
   - #45274



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45274:
URL: https://github.com/apache/spark/pull/45274#issuecomment-1965726176

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun opened a new pull request, #45274:
URL: https://github.com/apache/spark/pull/45274

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

2024-02-26 Thread via GitHub


nchammas commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1965720034

   I've filed SPARK-47180 to track potentially migrating off of Univocity to 
something else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on PR #45256:
URL: https://github.com/apache/spark/pull/45256#issuecomment-1965706216

   Thank you @dongjoon-hyun, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]

2024-02-26 Thread via GitHub


yaooqinn closed pull request #45256: [SPARK-47166][INFRA] Improves the HINTs of 
merge_spark_pr.py
URL: https://github.com/apache/spark/pull/45256


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]

2024-02-26 Thread via GitHub


itholic opened a new pull request, #45273:
URL: https://github.com/apache/spark/pull/45273

   ### What changes were proposed in this pull request?
   
   This PR proposes to improve error message from SparkThrowableSuite for 
better debuggability
   
   ### Why are the changes needed?
   
   The current error message is not very actionable for developer who need 
regenerating the error class documentation.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API change, but the error message is changed:
   
   **Before**
   ```
   The error class document is not up to date. Please regenerate it.
   ```
   
   **After**
   ```
   he error class document is not up to date. Please regenerate it by running 
`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite -- 
-t \"Error classes match with document\""`
   ```
   
   ### How was this patch tested?
   
   The existing CI should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun closed pull request #45272: [SPARK-45527][CORE][TESTS][FOLLOW-UP] 
Reduce the number of test cases in fraction resource calculation
URL: https://github.com/apache/spark/pull/45272


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45272:
URL: https://github.com/apache/spark/pull/45272#issuecomment-1965695504

   ```
   [info] TaskSchedulerImplSuite:
   [info] - SPARK-32653: Decommissioned host/executor should be considered as 
inactive (409 milliseconds)
   [info] - Scheduler does not always schedule tasks on the same workers (295 
milliseconds)
   [info] - Scheduler correctly accounts for multiple CPUs per task (26 
milliseconds)
   [info] - SPARK-18886 - partial offers (isAllFreeResources = false) reset 
timer before any resources have been rejected (27 milliseconds)
   [info] - SPARK-18886 - delay scheduling timer is reset when it accepts all 
resources offered when isAllFreeResources = true (22 milliseconds)
   [info] - SPARK-18886 - task set with no locality requirements should not 
starve one with them (21 milliseconds)
   [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) 
reset time if last full resource offer (isAllResources = true) was accepted as 
well as any following partial resource offers (20 milliseconds)
   [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) 
do not reset time if any offer was rejected since last full offer was fully 
accepted (19 milliseconds)
   [info] - Scheduler does not crash when tasks are not serializable (23 
milliseconds)
   [info] - concurrent attempts for the same stage only have one active taskset 
(21 milliseconds)
   [info] - don't schedule more tasks after a taskset is zombie (21 
milliseconds)
   [info] - if a zombie attempt finishes, continue scheduling tasks for 
non-zombie attempts (20 milliseconds)
   [info] - tasks are not re-scheduled while executor loss reason is pending 
(24 milliseconds)
   OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader 
classes because bootstrap classpath has been appended
   [info] - scheduled tasks obey task and stage excludelist (731 milliseconds)
   [info] - scheduled tasks obey node and executor excludelists (49 
milliseconds)
   [info] - abort stage when all executors are excluded and we cannot acquire 
new executor (38 milliseconds)
   [info] - SPARK-22148 abort timer should kick in when task is completely 
excluded & no new executor can be acquired (32 milliseconds)
   [info] - SPARK-22148 try to acquire a new executor when task is 
unschedulable with 1 executor (29 milliseconds)
   [info] - SPARK-22148 abort timer should clear 
unschedulableTaskSetToExpiryTime for all TaskSets (45 milliseconds)
   [info] - SPARK-22148 Ensure we don't abort the taskSet if we haven't been 
completely excluded (31 milliseconds)
   [info] - SPARK-31418 abort timer should kick in when task is completely 
excluded  manager could not acquire a new executor before the 
timeout (24 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 0 (50 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 1 (37 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 2 (36 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 3 (37 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 4 (35 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 5 (33 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 6 (44 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 7 (33 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 8 (33 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 9 (31 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 0 (34 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 1 (30 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 2 (31 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 3 (29 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 4 (31 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 5 (30 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 6 (29 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 7 (27 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 8 (39 milliseconds)
   [info] - Excluded 

Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45272:
URL: https://github.com/apache/spark/pull/45272#issuecomment-1965693056

   @dongjoon-hyun I am sorry. This PR fixes the leftover.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


HyukjinKwon opened a new pull request, #45272:
URL: https://github.com/apache/spark/pull/45272

   ### What changes were proposed in this pull request?
   
   There are two more instances to fix in 
https://github.com/apache/spark/pull/45268 mistakenly missed. This PR fixes 
both.
   
   ### Why are the changes needed?
   
   See https://github.com/apache/spark/pull/45268
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Manually
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45268:
URL: https://github.com/apache/spark/pull/45268#issuecomment-1965690534

   ahh .. I am sorry for rushing it. I will make a followup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on code in PR #44906:
URL: https://github.com/apache/spark/pull/44906#discussion_r1503563961


##
core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala:
##
@@ -661,6 +662,52 @@ class MasterSuite extends SparkFunSuite
 scheduleExecutorsForAppWithMultiRPs(withMaxCores = true)
   }
 
+
+  private val workerSelectionPolicyTestCases = Seq(
+(CORES_FREE_ASC, true, List("10001", "10002")),
+(CORES_FREE_ASC, false, List("10001")),
+(CORES_FREE_DESC, true, List("10004", "10005")),
+(CORES_FREE_DESC, false, List("10005")),
+(MEMORY_FREE_ASC, true, List("10001", "10005")),
+(MEMORY_FREE_ASC, false, List("10001")),
+(MEMORY_FREE_DESC, true, List("10002", "10003")),
+(MEMORY_FREE_DESC, false, Seq("10002")),
+(WORKER_ID, true, Seq("10001", "10002")),
+(WORKER_ID, false, Seq("10001")))
+
+  workerSelectionPolicyTestCases.foreach { case (policy, spreadOut, expected) 
=>
+test(s"SPARK-46881: scheduling with workerSelectionPolicy - $policy 
($spreadOut)") {

Review Comment:
   Thanks. Ya, let me take a look at.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]

2024-02-26 Thread via GitHub


itholic commented on PR #45244:
URL: https://github.com/apache/spark/pull/45244#issuecomment-1965685388

   Thanks, @MaxGekk for the review! Just adjusted comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-02-26 Thread via GitHub


himadripal commented on PR #45267:
URL: https://github.com/apache/spark/pull/45267#issuecomment-1965681861

   was trying this test case, 
   `val partition1 = Array(Expressions.years("ts"),
 bucket(2, "id"))
   val partition2 = Array(Expressions.days("ts"),
 bucket(4, "id"))`
   
   this throws an exception when paritiallyClustered = false and 
allowCompatibleTransform=true. 
   
   exception :
   `class java.lang.Integer cannot be cast to class java.lang.Long 
(java.lang.Integer and java.lang.Long are in module java.base of loader 
'bootstrap')
   java.lang.ClassCastException: class java.lang.Integer cannot be cast to 
class java.lang.Long (java.lang.Integer and java.lang.Long are in module 
java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:103)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45268:
URL: https://github.com/apache/spark/pull/45268#issuecomment-1965681672

   Just for the record, after this commit, the following is the result on 
`master` branch. It reduced much but we still have more.
   
   ```
   [info] TaskSchedulerImplSuite:
   [info] - SPARK-32653: Decommissioned host/executor should be considered as 
inactive (414 milliseconds)
   [info] - Scheduler does not always schedule tasks on the same workers (304 
milliseconds)
   [info] - Scheduler correctly accounts for multiple CPUs per task (23 
milliseconds)
   [info] - SPARK-18886 - partial offers (isAllFreeResources = false) reset 
timer before any resources have been rejected (27 milliseconds)
   [info] - SPARK-18886 - delay scheduling timer is reset when it accepts all 
resources offered when isAllFreeResources = true (21 milliseconds)
   [info] - SPARK-18886 - task set with no locality requirements should not 
starve one with them (21 milliseconds)
   [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) 
reset time if last full resource offer (isAllResources = true) was accepted as 
well as any following partial resource offers (21 milliseconds)
   [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) 
do not reset time if any offer was rejected since last full offer was fully 
accepted (20 milliseconds)
   [info] - Scheduler does not crash when tasks are not serializable (22 
milliseconds)
   [info] - concurrent attempts for the same stage only have one active taskset 
(20 milliseconds)
   [info] - don't schedule more tasks after a taskset is zombie (21 
milliseconds)
   [info] - if a zombie attempt finishes, continue scheduling tasks for 
non-zombie attempts (20 milliseconds)
   [info] - tasks are not re-scheduled while executor loss reason is pending 
(24 milliseconds)
   OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader 
classes because bootstrap classpath has been appended
   [info] - scheduled tasks obey task and stage excludelist (726 milliseconds)
   [info] - scheduled tasks obey node and executor excludelists (47 
milliseconds)
   [info] - abort stage when all executors are excluded and we cannot acquire 
new executor (37 milliseconds)
   [info] - SPARK-22148 abort timer should kick in when task is completely 
excluded & no new executor can be acquired (30 milliseconds)
   [info] - SPARK-22148 try to acquire a new executor when task is 
unschedulable with 1 executor (30 milliseconds)
   [info] - SPARK-22148 abort timer should clear 
unschedulableTaskSetToExpiryTime for all TaskSets (48 milliseconds)
   [info] - SPARK-22148 Ensure we don't abort the taskSet if we haven't been 
completely excluded (38 milliseconds)
   [info] - SPARK-31418 abort timer should kick in when task is completely 
excluded  manager could not acquire a new executor before the 
timeout (25 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 0 (42 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 1 (36 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 2 (35 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 3 (38 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 4 (35 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 5 (33 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 6 (43 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 7 (34 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 8 (33 milliseconds)
   [info] - Excluded node for entire task set prevents per-task exclusion 
checks: iteration 9 (31 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 0 (33 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 1 (31 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 2 (29 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 3 (28 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 4 (30 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 5 (29 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 6 (29 milliseconds)
   [info] - Excluded executor for entire task set prevents per-task exclusion 
checks: iteration 7 (30 milliseconds)
   [info] 

Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on code in PR #45271:
URL: https://github.com/apache/spark/pull/45271#discussion_r1503557832


##
python/pyspark/sql/session.py:
##
@@ -1299,7 +1299,7 @@ def createDataFrame(  # type: ignore[misc]
 --
 data : :class:`RDD` or iterable
 an RDD of any kind of SQL data representation (:class:`Row`,
-:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`,
+:class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or 
:class:`list`,

Review Comment:
   ```suggestion
   :class:`tuple`, ``int``, ``boolean``, ``dict``, etc.), or 
:class:`list`,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on code in PR #45271:
URL: https://github.com/apache/spark/pull/45271#discussion_r1503557591


##
python/pyspark/sql/session.py:
##
@@ -1299,7 +1299,7 @@ def createDataFrame(  # type: ignore[misc]
 --
 data : :class:`RDD` or iterable
 an RDD of any kind of SQL data representation (:class:`Row`,
-:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`,
+:class:`tuple`, ``int``, ``boolean``, ``dict``` etc.), or 
:class:`list`,

Review Comment:
   ```suggestion
   :class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or 
:class:`list`,
   ```



##
python/pyspark/sql/session.py:
##
@@ -1299,7 +1299,7 @@ def createDataFrame(  # type: ignore[misc]
 --
 data : :class:`RDD` or iterable
 an RDD of any kind of SQL data representation (:class:`Row`,
-:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`,
+:class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or 
:class:`list`,

Review Comment:
   ```suggestion
   :class:`tuple`, ``int``, ``boolean``, ``dict``, etc.), or 
:class:`list`,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on code in PR #45271:
URL: https://github.com/apache/spark/pull/45271#discussion_r1503557591


##
python/pyspark/sql/session.py:
##
@@ -1299,7 +1299,7 @@ def createDataFrame(  # type: ignore[misc]
 --
 data : :class:`RDD` or iterable
 an RDD of any kind of SQL data representation (:class:`Row`,
-:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`,
+:class:`tuple`, ``int``, ``boolean``, ``dict``` etc.), or 
:class:`list`,

Review Comment:
   ```suggestion
   :class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or 
:class:`list`,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45268:
URL: https://github.com/apache/spark/pull/45268#issuecomment-1965679830

   Ur, this seems to be a partial fix. Could you reduce these together, 
@HyukjinKwon ?
   
   ```
(1 to 20).foreach { taskNum =>
 val gpuTaskAmount = 
ResourceAmountUtils.toFractionalResource(ONE_ENTIRE_RESOURCE / taskNum)
 test(s"SPARK-45527 TaskResourceProfile with 
task.gpu.amount=${gpuTaskAmount} can " +
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45268:
URL: https://github.com/apache/spark/pull/45268#issuecomment-1965679171

   I manually ran the tests, and linter to make sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


HyukjinKwon closed pull request #45268: [SPARK-45527][CORE][TESTS][FOLLOW-UP] 
Reduce the number of test cases in fraction resource calculation
URL: https://github.com/apache/spark/pull/45268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]

2024-02-26 Thread via GitHub


HyukjinKwon commented on PR #45268:
URL: https://github.com/apache/spark/pull/45268#issuecomment-1965678623

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47178][PYTHON][TESTS]Add a test case for createDataFrame with dataclasses [spark]

2024-02-26 Thread via GitHub


HyukjinKwon opened a new pull request, #45271:
URL: https://github.com/apache/spark/pull/45271

   ### What changes were proposed in this pull request?
   
   This PR adds a test case for dataclasses with createDataFrame that is 
already supported:
   
   ```python
   from dataclasses import dataclass
   
   @dataclass
   class User:
   name: str
   age: int
   is_active: bool
   
   user = User(name="John", age=30, is_active=True)
   spark.createDataFrame([user]).show()
   ```
   
   ### Why are the changes needed?
   
   To make sure dataclasses work with PySpark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only
   
   ### How was this patch tested?
   
   Added test case.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45256:
URL: https://github.com/apache/spark/pull/45256#issuecomment-1965673466

   Thank you, @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]

2024-02-26 Thread via GitHub


itholic commented on code in PR #45244:
URL: https://github.com/apache/spark/pull/45244#discussion_r1503550776


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -298,6 +298,12 @@
 ],
 "sqlState" : "22018"
   },
+  "CANNOT_PARSE_STRING_AS_DATATYPE" : {

Review Comment:
   According to https://github.com/apache/spark/pull/45244/files#r1503550041, 
let me just turn this into an internal error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` [spark]

2024-02-26 Thread via GitHub


dongjoon-hyun commented on PR #45265:
URL: https://github.com/apache/spark/pull/45265#issuecomment-1965673018

   Thank you, @HyukjinKwon and @yaooqinn !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]

2024-02-26 Thread via GitHub


itholic commented on code in PR #45244:
URL: https://github.com/apache/spark/pull/45244#discussion_r1503550041


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##
@@ -502,4 +504,25 @@ class TimestampFormatterSuite extends 
DatetimeFormatterSuite {
 assert(formatter.parseOptional("-12-31 23:59:59.999").isEmpty)
 assert(formatter.parseWithoutTimeZoneOptional("-12-31 23:59:59.999", 
true).isEmpty)
   }
+
+  test("fail to parse string as TimestampNTZ with invalid format") {
+val zoneId = ZoneId.systemDefault()
+val locale = Locale.getDefault()
+val formatter = new DefaultTimestampFormatter(
+  zoneId, locale, LENIENT_SIMPLE_DATE_FORMAT, isParsing = true)
+
+val invalidTimestampStr = "2021-13-01T25:61:61"
+
+checkError(
+  exception = intercept[SparkRuntimeException] {
+formatter.parseWithoutTimeZone(invalidTimestampStr, allowTimeZone = 
false)

Review Comment:
   I just realized the comment from the source code below:
   
   ```
   * The formatter for timestamps which doesn't require users to specify a 
pattern. While formatting,
* it uses the default pattern [[TimestampFormatter.defaultPattern()]]. In 
parsing, it follows
* the CAST logic in conversion of strings to Catalyst's TimestampType.
   ```
   
   so I suspect that this cannot be reproduced from user space, so we should 
mark it as an internal error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]

2024-02-26 Thread via GitHub


itholic commented on code in PR #45244:
URL: https://github.com/apache/spark/pull/45244#discussion_r1503550041


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##
@@ -502,4 +504,25 @@ class TimestampFormatterSuite extends 
DatetimeFormatterSuite {
 assert(formatter.parseOptional("-12-31 23:59:59.999").isEmpty)
 assert(formatter.parseWithoutTimeZoneOptional("-12-31 23:59:59.999", 
true).isEmpty)
   }
+
+  test("fail to parse string as TimestampNTZ with invalid format") {
+val zoneId = ZoneId.systemDefault()
+val locale = Locale.getDefault()
+val formatter = new DefaultTimestampFormatter(
+  zoneId, locale, LENIENT_SIMPLE_DATE_FORMAT, isParsing = true)
+
+val invalidTimestampStr = "2021-13-01T25:61:61"
+
+checkError(
+  exception = intercept[SparkRuntimeException] {
+formatter.parseWithoutTimeZone(invalidTimestampStr, allowTimeZone = 
false)

Review Comment:
   I just realized the comment from the source code below:
   
   ```
   * The formatter for timestamps which doesn't require users to specify a 
pattern. While formatting,
* it uses the default pattern [[TimestampFormatter.defaultPattern()]]. In 
parsing, it follows
* the CAST logic in conversion of strings to Catalyst's TimestampType.
   ```
   
   so I believe this cannot be reproduced from user space, so we should mark it 
as an internal error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]

2024-02-26 Thread via GitHub


itholic commented on code in PR #45244:
URL: https://github.com/apache/spark/pull/45244#discussion_r1503547976


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1272,6 +1278,12 @@
 ],
 "sqlState" : "42710"
   },
+  "FIELD_INDEX_ON_ROW_WITHOUT_SCHEMA" : {
+"message" : [
+  "fieldIndex on a Row without schema is undefined."
+],
+"sqlState" : "42000"

Review Comment:
   Sounds good. I think we can use `42601` - A character, token, or clause is 
invalid or missing -



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on code in PR #45180:
URL: https://github.com/apache/spark/pull/45180#discussion_r1503547763


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -2876,22 +2876,39 @@ class HiveDDLSuite
 }
   }
 
-  test("SPARK-24681 checks if nested column names do not include ',', ':', and 
';'") {
-Seq("nested,column", "nested:column", "nested;column").foreach { 
nestedColumnName =>
+  test("SPARK-47101 checks if nested column names do not include invalid 
characters") {
+// delimiter characters
+Seq(",", ":").foreach { c =>
+  val typ = s"array>"
+  // The regex is from HiveClientImpl.getSparkSQLDataType, please keep 
them in sync.
+  val replaced = typ.replaceAll("`", 
"").replaceAll("(?<=struct<|,)([^,<:]+)(?=:)", "`$1`")
+  withTable("t") {
+checkError(
+  exception = intercept[SparkException] {
+sql(s"CREATE TABLE t (a $typ) USING hive")

Review Comment:
   Still fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-24815] [CORE] Trigger Interval based DRA for Structured Streaming [spark]

2024-02-26 Thread via GitHub


pkotikalapudi commented on PR #42352:
URL: https://github.com/apache/spark/pull/42352#issuecomment-1965668989

   > If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!
   
   @rdblue can you please re-open the PR and remove the stale tag. I think 
github bot will auto-close it if the tag exists. 
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on PR #45265:
URL: https://github.com/apache/spark/pull/45265#issuecomment-1965663502

   Thanks @dongjoon-hyun @HyukjinKwon, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` [spark]

2024-02-26 Thread via GitHub


yaooqinn closed pull request #45265: [SPARK-47175][SS][TESTS] Remove 
ZOOKEEPER-1844 comment from `KafkaTestUtils`
URL: https://github.com/apache/spark/pull/45265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on PR #45255:
URL: https://github.com/apache/spark/pull/45255#issuecomment-1965662096

   Thanks, @dongjoon-hyun merged to master 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent [spark]

2024-02-26 Thread via GitHub


yaooqinn closed pull request #45255: [SPARK-47165][SQL][TESTS] Pull docker 
image only when its' absent
URL: https://github.com/apache/spark/pull/45255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on code in PR #45255:
URL: https://github.com/apache/spark/pull/45255#discussion_r1503540284


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala:
##
@@ -148,10 +148,6 @@ abstract class DockerJDBCIntegrationSuite
   pulled = true
   }
 
-  docker.pullImageCmd(db.imageName)
-.start()
-.awaitCompletion(connectionTimeout.value.toSeconds, TimeUnit.SECONDS)

Review Comment:
   @dongjoon-hyun, I am not quite sure:)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-02-26 Thread via GitHub


szehon-ho commented on code in PR #45267:
URL: https://github.com/apache/spark/pull/45267#discussion_r1503537598


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1537,6 +1537,18 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val V2_BUCKETING_ALLOW_COMPATIBLE_TRANSFORMS =
+buildConf("spark.sql.sources.v2.bucketing.allow.enabled")

Review Comment:
   Thanks, yea it is a copy and paste error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on code in PR #45256:
URL: https://github.com/apache/spark/pull/45256#discussion_r1503537067


##
dev/merge_spark_pr.py:
##
@@ -72,6 +72,13 @@
 BRANCH_PREFIX = "PR_TOOL"
 
 
+def print_error(msg):
+print("\033[91m%s\033[0m" % msg)
+
+
+def bold_input(prompt) -> str:
+return input("\033[1m%s\033[0m" % prompt)
+

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-02-26 Thread via GitHub


JoshRosen commented on code in PR #45266:
URL: https://github.com/apache/spark/pull/45266#discussion_r1503532857


##
core/src/main/scala/org/apache/spark/Dependency.scala:
##
@@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 finalizeTask = Option(task)
   }
 
+  // Set the threshold to 1 billion which represents approximately 1GB of 
memory
+  // allocated to map output statuses.
+  // A large number of shuffle blocks may crash the driver with an OOM error.
+  private val SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD: Long = 10L
+  private val numberOfShuffleBlocks = numPartitions.toLong * 
partitioner.numPartitions.toLong
+
+  if (numberOfShuffleBlocks > SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD) {
+logWarning(
+  s"The number of shuffle blocks (${numberOfShuffleBlocks}) for ${_rdd} " +

Review Comment:
   Maybe we could also log the shuffle ID here as well? e.g. 
   
   ```
   "The number of shuffle blocks (...) for shuffleId $shuffleId for ${_rdd} 
with ${numPartition} partitions "
   ```
   
   or something along those lines?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]

2024-02-26 Thread via GitHub


yaooqinn commented on code in PR #45256:
URL: https://github.com/apache/spark/pull/45256#discussion_r1503535482


##
dev/merge_spark_pr.py:
##
@@ -72,6 +72,13 @@
 BRANCH_PREFIX = "PR_TOOL"
 
 
+def print_error(msg):
+print("\033[91m%s\033[0m" % msg)

Review Comment:
   https://en.wikipedia.org/wiki/ANSI_escape_code#Colors
   
   Not all terminals and operating systems, but most of them support ANSI colors



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >