Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]
dbatomic commented on code in PR #45233: URL: https://github.com/apache/spark/pull/45233#discussion_r1503780793 ## connector/connect/common/src/main/protobuf/spark/connect/types.proto: ## @@ -100,6 +100,7 @@ message DataType { message String { uint32 type_variation_reference = 1; +optional uint32 collation_id = 2; Review Comment: Yes, StringType will always have a collation_id so making it required sounds good. The thing that worries me here is backwards compatibility. E.g. what happens if other side is on an old version without collation support? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-41873][PYTHON][CONNECT][TESTS] Enable `DataFrameParityTests.test_pandas_api` [spark]
zhengruifeng opened a new pull request, #45279: URL: https://github.com/apache/spark/pull/45279 ### What changes were proposed in this pull request? Enable `DataFrameParityTests.test_pandas_api` ### Why are the changes needed? for testing parity, this method had already been implemented ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? enabled ut ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
pan3793 commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1503740095 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask = Option(task) } + // Set the threshold to 1 billion which represents approximately 1GB of memory + // allocated to map output statuses. Review Comment: is the assumption based on `CompressedMapStatus`? For shuffle partition number exceeds `spark.shuffle.minNumPartitionsToHighlyCompress`(default is 2000), `HighlyCompressedMapStatus` is used, is it correct then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]
anishshri-db commented on code in PR #45051: URL: https://github.com/apache/spark/pull/45051#discussion_r1503733977 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala: ## @@ -69,8 +70,20 @@ case class TransformWithStateExec( override def shortName: String = "transformWithStateExec" - // TODO: update this to run no-data batches when timer support is added - override def shouldRunAnotherBatch(newInputWatermark: Long): Boolean = false + override def shouldRunAnotherBatch(newInputWatermark: Long): Boolean = { +timeoutMode match { + // TODO: check if we can return true only if actual timers are registered Review Comment: Yea - but the time at which this is called - I believe we don't have the storeRDD or store instance available directly. We could potentially track some count, but haven't tried the change/optimization yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]
anishshri-db commented on code in PR #45051: URL: https://github.com/apache/spark/pull/45051#discussion_r1503733123 ## sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessor.scala: ## @@ -48,16 +48,19 @@ private[sql] trait StatefulProcessor[K, I, O] extends Serializable { * @param inputRows - iterator of input rows associated with grouping key * @param timerValues - instance of TimerValues that provides access to current processing/event *time if available + * @param expiredTimerInfo - instance of ExpiredTimerInfo that provides access to expired timer + * if applicable * @return - Zero or more output rows */ def handleInputRows( key: K, inputRows: Iterator[I], - timerValues: TimerValues): Iterator[O] + timerValues: TimerValues, + expiredTimerInfo: ExpiredTimerInfo): Iterator[O] Review Comment: Yea thought of that - but we would have the same problem with the Java API in the future -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]
anishshri-db commented on code in PR #45051: URL: https://github.com/apache/spark/pull/45051#discussion_r1503732422 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala: ## @@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl( override def getQueryInfo(): QueryInfo = currQueryInfo + private def getTimerState[T](): TimerStateImpl[T] = { +new TimerStateImpl[T](store, timeoutMode, keyEncoder) + } + + private val timerState = getTimerState[Boolean]() + + override def registerTimer(expiryTimestampMs: Long): Unit = { +verify(timeoutMode == ProcessingTime || timeoutMode == EventTime, Review Comment: The existing one is tied to `GroupState`, do did not want to reuse that. Also, if we add more modes here in the future - thought that its better to keep this generic and separate to this API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]
anishshri-db commented on code in PR #45051: URL: https://github.com/apache/spark/pull/45051#discussion_r1503731606 ## sql/api/src/main/scala/org/apache/spark/sql/streaming/ExpiredTimerInfo.scala: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import java.io.Serializable + +import org.apache.spark.annotation.{Evolving, Experimental} + +/** + * Class used to provide access to expired timer's expiry time and timeout mode. These values + * are only relevant if the ExpiredTimerInfo is valid. + */ +@Experimental +@Evolving +private[sql] trait ExpiredTimerInfo extends Serializable { + /** + * Check if provided ExpiredTimerInfo is valid. + */ + def isValid(): Boolean + + /** + * Get the expired timer's expiry time as milliseconds in epoch time. + */ + def getExpiryTimeInMs(): Long + + /** + * Get the expired timer's timeout mode. + */ + def getTimeoutMode(): TimeoutMode Review Comment: But that is not available to the `StatefulProcessor` though. Do you prefer to pass the timeout mode in the `init` method instead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965901217 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun closed pull request #45278: [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` URL: https://github.com/apache/spark/pull/45278 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
dongjoon-hyun commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1503714884 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask = Option(task) } + // Set the threshold to 1 billion which represents approximately 1GB of memory + // allocated to map output statuses. + // A large number of shuffle blocks may crash the driver with an OOM error. + private val SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD: Long = 1L << 30 + private val numberOfShuffleBlocks = numPartitions.toLong * partitioner.numPartitions.toLong + + if (numberOfShuffleBlocks > SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD) { Review Comment: Just a question. Is there a chance of overflow at line 213? > private val numberOfShuffleBlocks = numPartitions.toLong * partitioner.numPartitions.toLong -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]
sunchao commented on PR #45267: URL: https://github.com/apache/spark/pull/45267#issuecomment-1965888375 Thanks @szehon-ho ! will take a look in a few days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
y-wei commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1503712574 ## core/src/test/scala/org/apache/spark/ShuffleSuite.scala: ## @@ -483,6 +483,17 @@ abstract class ShuffleSuite extends SparkFunSuite with Matchers with LocalRootDi } assert(e.getMessage.contains("corrupted due to DISK_ISSUE")) } + + test("SPARK-39771: warn when shuffle block number is too large") { +sc = new SparkContext("local", "test", conf) +val logAppender = new LogAppender("deprecated Avro option 'ignoreExtension'") Review Comment: True, thanks for catching this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47176][SQL] Have a ResolveAllExpressionsUpWithPruning helper function [spark]
cloud-fan closed pull request #45270: [SPARK-47176][SQL] Have a ResolveAllExpressionsUpWithPruning helper function URL: https://github.com/apache/spark/pull/45270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]
HyukjinKwon closed pull request #45275: [MINOR][SS] Minor string representation improvement in AssertOnQuery URL: https://github.com/apache/spark/pull/45275 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]
HyukjinKwon commented on PR #45275: URL: https://github.com/apache/spark/pull/45275#issuecomment-1965884273 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]
HyukjinKwon commented on code in PR #45275: URL: https://github.com/apache/spark/pull/45275#discussion_r1503711045 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala: ## @@ -284,7 +284,11 @@ trait StreamTest extends QueryTest with SharedSparkSession with TimeLimits with /** Assert that a condition on the active query is true */ class AssertOnQuery(val condition: StreamExecution => Boolean, val message: String) extends StreamAction with StreamMustBeRunning { -override def toString: String = s"AssertOnQuery(, $message)" +override def toString: String = if (message == "") { Review Comment: It won't be :-). This is for the case below: ``` def apply(condition: StreamExecution => Boolean, message: String = ""): AssertOnQuery = { new AssertOnQuery(condition, message) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46913][SS] Add support for processing/event time based timers with transformWithState operator [spark]
sahnib commented on code in PR #45051: URL: https://github.com/apache/spark/pull/45051#discussion_r1503581091 ## sql/api/src/main/scala/org/apache/spark/sql/streaming/ExpiredTimerInfo.scala: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import java.io.Serializable + +import org.apache.spark.annotation.{Evolving, Experimental} + +/** + * Class used to provide access to expired timer's expiry time and timeout mode. These values + * are only relevant if the ExpiredTimerInfo is valid. + */ +@Experimental +@Evolving +private[sql] trait ExpiredTimerInfo extends Serializable { + /** + * Check if provided ExpiredTimerInfo is valid. + */ + def isValid(): Boolean + + /** + * Get the expired timer's expiry time as milliseconds in epoch time. + */ + def getExpiryTimeInMs(): Long + + /** + * Get the expired timer's timeout mode. + */ + def getTimeoutMode(): TimeoutMode Review Comment: Would this ever be different than the timeout mode provided to `transformWithState` API? If no, do we need this here? ## sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessor.scala: ## @@ -48,16 +48,19 @@ private[sql] trait StatefulProcessor[K, I, O] extends Serializable { * @param inputRows - iterator of input rows associated with grouping key * @param timerValues - instance of TimerValues that provides access to current processing/event *time if available + * @param expiredTimerInfo - instance of ExpiredTimerInfo that provides access to expired timer + * if applicable * @return - Zero or more output rows */ def handleInputRows( key: K, inputRows: Iterator[I], - timerValues: TimerValues): Iterator[O] + timerValues: TimerValues, + expiredTimerInfo: ExpiredTimerInfo): Iterator[O] /** * Function called as the last method that allows for users to perform * any cleanup or teardown operations. */ - def close (): Unit + def close (): Unit = {} Review Comment: Nice idea to provide a default implementation here. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala: ## @@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl( override def getQueryInfo(): QueryInfo = currQueryInfo + private def getTimerState[T](): TimerStateImpl[T] = { +new TimerStateImpl[T](store, timeoutMode, keyEncoder) + } + + private val timerState = getTimerState[Boolean]() + + override def registerTimer(expiryTimestampMs: Long): Unit = { +verify(timeoutMode == ProcessingTime || timeoutMode == EventTime, +s"Cannot register timers with incorrect TimeoutMode") +verify(currState == INITIALIZED || currState == DATA_PROCESSED, +s"Cannot register timers with " + + s"expiryTimestampMs=$expiryTimestampMs in current state=$currState") Review Comment: We should use the NERF framework for these user errors. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala: ## @@ -121,6 +123,42 @@ class StatefulProcessorHandleImpl( override def getQueryInfo(): QueryInfo = currQueryInfo + private def getTimerState[T](): TimerStateImpl[T] = { +new TimerStateImpl[T](store, timeoutMode, keyEncoder) + } + + private val timerState = getTimerState[Boolean]() + + override def registerTimer(expiryTimestampMs: Long): Unit = { +verify(timeoutMode == ProcessingTime || timeoutMode == EventTime, Review Comment: Shouldn't this be same as the timeoutMode in transformWithState API? ## sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala: ## @@ -51,6 +51,25 @@ private[sql] trait StatefulProcessorHandle extends Serializable { /** Function to return queryInfo for currently running task */ def getQueryInfo(): QueryInfo + /** + * Function to register a processing/event time based timer for given implicit key Review Comment: [nit] `implicit key` -> `implicit grouping key`. ##
Re: [PR] [SPARK-47176][SQL] Have a ResolveAllExpressionsUpWithPruning helper function [spark]
cloud-fan commented on PR #45270: URL: https://github.com/apache/spark/pull/45270#issuecomment-1965883152 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47164][SQL] Make Default Value From Wider Type Narrow Literal of v2 behave the same as v1 [spark]
cloud-fan commented on PR #45254: URL: https://github.com/apache/spark/pull/45254#issuecomment-1965882323 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
dongjoon-hyun commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1503709429 ## core/src/test/scala/org/apache/spark/ShuffleSuite.scala: ## @@ -483,6 +483,17 @@ abstract class ShuffleSuite extends SparkFunSuite with Matchers with LocalRootDi } assert(e.getMessage.contains("corrupted due to DISK_ISSUE")) } + + test("SPARK-39771: warn when shuffle block number is too large") { +sc = new SparkContext("local", "test", conf) +val logAppender = new LogAppender("deprecated Avro option 'ignoreExtension'") Review Comment: Maybe, is this simply a copy-and-paste mistake from here? https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala#L2229 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
dongjoon-hyun commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1503708211 ## core/src/test/scala/org/apache/spark/ShuffleSuite.scala: ## @@ -483,6 +483,17 @@ abstract class ShuffleSuite extends SparkFunSuite with Matchers with LocalRootDi } assert(e.getMessage.contains("corrupted due to DISK_ISSUE")) } + + test("SPARK-39771: warn when shuffle block number is too large") { +sc = new SparkContext("local", "test", conf) +val logAppender = new LogAppender("deprecated Avro option 'ignoreExtension'") Review Comment: `"deprecated Avro option 'ignoreExtension'"` seems to be wrong to me. Is this intentional, @y-wei ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965865838 Thank you, @LuciferYang . I fully understand your concern and have the same feeling about this kind of tedious code changes which sometimes looks redundant. This is just a way of communication in order to give a clarify of Apache Spark 4.0.0 dependency issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
LuciferYang commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965863663 > To give you the context, I'm currently working on this for Apache Spark 4.0.0 preparation, @LuciferYang . > > * [SPARK-47046](https://issues.apache.org/jira/browse/SPARK-47046) Apache Spark 4.0.0 Dependency Audit and Cleanup > > * https://github.com/apache/spark/security Thanks @dongjoon-hyun This is a great work!!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965862146 No, I don't think so if there is no other CVE related stuff bites us again, @LuciferYang . > Okay, I understand your intention. So, will there be more similar PRs in the future to explicitly exclude the transitive dependencies declared by Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965861469 To give you the context, I'm currently working on this for Apache Spark 4.0.0 preparation, @LuciferYang . - SPARK-47046 Apache Spark 4.0.0 Dependency Audit and Cleanup - https://github.com/apache/spark/security -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
LuciferYang commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965861359 Okay, I understand your intention. So, will there be more similar PRs in the future to explicitly exclude the transitive dependencies declared by Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965859447 Specifically, the following dependency clarification is the goal of this PR. **BEFORE** ``` $ build/sbt "core/dependencyTree" | grep -C2 commons-compress | tail -n3 Using SPARK_LOCAL_IP=localhost [info] +-org.apache.commons:commons-compress:1.26.0 [info] | +-commons-io:commons-io:2.15.1 [info] | +-org.apache.commons:commons-lang3:3.14.0 ``` **AFTER** ``` $ build/sbt "core/dependencyTree" | grep -C2 commons-compress | tail -n3 Using SPARK_LOCAL_IP=localhost [info] +-org.apache.commons:commons-compress:1.26.0 [info] +-org.apache.commons:commons-crypto:1.1.0 [info] +-org.apache.commons:commons-lang3:3.14.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47153][CORE] Guard serialize/deserialize in JavaSerializer with try-with-resource block [spark]
LuciferYang commented on code in PR #45238: URL: https://github.com/apache/spark/pull/45238#discussion_r1503688318 ## core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala: ## @@ -118,22 +118,24 @@ private[spark] class JavaSerializerInstance( override def serialize[T: ClassTag](t: T): ByteBuffer = { val bos = new ByteBufferOutputStream() -val out = serializeStream(bos) -out.writeObject(t) -out.close() +Utils.tryWithResource(serializeStream(bos)) { out => Review Comment: There should be other similar cases, for example: https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L92-L94 https://github.com/apache/spark/blob/1a408033daf458f1ceebbe14a560355a1a2c0a70/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L473 cc @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
dongjoon-hyun commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965855776 Currently, we are in sync with the latest versions, but the latest version might be different from the Apache Spark's versions. For both Maven and SBT, this PR want to exclude the transitive dependency clearly without depending on the dependency resolution for them, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]
LuciferYang commented on code in PR #45275: URL: https://github.com/apache/spark/pull/45275#discussion_r1503682301 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala: ## @@ -284,7 +284,11 @@ trait StreamTest extends QueryTest with SharedSparkSession with TimeLimits with /** Assert that a condition on the active query is true */ class AssertOnQuery(val condition: StreamExecution => Boolean, val message: String) extends StreamAction with StreamMustBeRunning { -override def toString: String = s"AssertOnQuery(, $message)" +override def toString: String = if (message == "") { Review Comment: Is it only for `""`? What if the message is `""`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro*` [spark]
LuciferYang commented on PR #45278: URL: https://github.com/apache/spark/pull/45278#issuecomment-1965844769 If the groupId and artifactId are the same, Maven should choose the latest version among multiple versions. What problems could occur if we don't exclude it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]
itholic commented on code in PR #45273: URL: https://github.com/apache/spark/pull/45273#discussion_r1503675810 ## core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala: ## @@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite { } } else { assert(subErrorDoc.trim == errorsInDoc.trim, - "The error class document is not up to date. Please regenerate it.") + "The error class document is not up to date. Please regenerate it by running " + +"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + +"\\\"Error classes match with document\\\"\"`") Review Comment: Sounds good. Will update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` [spark]
zhengruifeng commented on PR #45277: URL: https://github.com/apache/spark/pull/45277#issuecomment-1965829385 @HyukjinKwon @MaxGekk @grundprinzip @hvanhovell would you mind taking a look? thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47182][BUILD] Exclude `commons-(io|lang3)` transitive dependencies from `commons-compress` and `avro-*` [spark]
dongjoon-hyun opened a new pull request, #45278: URL: https://github.com/apache/spark/pull/45278 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` [spark]
zhengruifeng commented on PR #45277: URL: https://github.com/apache/spark/pull/45277#issuecomment-1965815610 I found introducing a CTE message may make it more complex, since a `SQLCommand` is executed eagerly. ``` UnresolvedWith (CTE) - SQLCommand - cteRelations ``` So I switch to adding views fields in existing `SQLCommand` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` [spark]
zhengruifeng opened a new pull request, #45277: URL: https://github.com/apache/spark/pull/45277 ### What changes were proposed in this pull request? Implement SQLStringFormatter for Python Client ### Why are the changes needed? for parity ### Does this PR introduce _any_ user-facing change? yes, new feature ``` In [1]: mydf = spark.range(10) In [2]: spark.sql("SELECT {col} FROM {mydf} WHERE id IN {x}", col=mydf.id, mydf=mydf, x=tuple(range(4))).show() +---+ | id| +---+ | 0| | 1| | 2| | 3| +---+ ``` ### How was this patch tested? enabled doc tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]
amaliujia commented on code in PR #45273: URL: https://github.com/apache/spark/pull/45273#discussion_r1503632310 ## core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala: ## @@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite { } } else { assert(subErrorDoc.trim == errorsInDoc.trim, - "The error class document is not up to date. Please regenerate it.") + "The error class document is not up to date. Please regenerate it by running " + +"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + +"\\\"Error classes match with document\\\"\"`") Review Comment: E.g. ``` /* Used to regenerate the error class file. Run: {{{ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \ "core/testOnly *SparkThrowableSuite -- -t \"Error classes are correctly formatted\"" }}} To regenerate the error class document. Run: {{{ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \ "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with document\"" }}} */ val error_message_instrcution = "Please regenerate it by running " + "`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + "\\\"Error classes match with document\\\"\" ``` The use the val in the error message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]
amaliujia commented on code in PR #45273: URL: https://github.com/apache/spark/pull/45273#discussion_r1503632310 ## core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala: ## @@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite { } } else { assert(subErrorDoc.trim == errorsInDoc.trim, - "The error class document is not up to date. Please regenerate it.") + "The error class document is not up to date. Please regenerate it by running " + +"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + +"\\\"Error classes match with document\\\"\"`") Review Comment: E.g. ``` /* Used to regenerate the error class file. Run: {{{ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \ "core/testOnly *SparkThrowableSuite -- -t \"Error classes are correctly formatted\"" }}} To regenerate the error class document. Run: {{{ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \ "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with document\"" }}} */ val error_message_instrcution = "Please regenerate it by running " + "`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + "\\\"Error classes match with document\\\"\" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]
amaliujia commented on code in PR #45273: URL: https://github.com/apache/spark/pull/45273#discussion_r1503632310 ## core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala: ## @@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite { } } else { assert(subErrorDoc.trim == errorsInDoc.trim, - "The error class document is not up to date. Please regenerate it.") + "The error class document is not up to date. Please regenerate it by running " + +"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + +"\\\"Error classes match with document\\\"\"`") Review Comment: E.g. ``` /* Used to regenerate the error class file. Run: {{{ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \ "core/testOnly *SparkThrowableSuite -- -t \"Error classes are correctly formatted\"" }}} To regenerate the error class document. Run: {{{ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \ "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with document\"" }}} */ val error_message_instrcution = Please regenerate it by running " + "`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + "\\\"Error classes match with document\\\"\" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]
amaliujia commented on code in PR #45273: URL: https://github.com/apache/spark/pull/45273#discussion_r1503631655 ## core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala: ## @@ -328,7 +328,9 @@ class SparkThrowableSuite extends SparkFunSuite { } } else { assert(subErrorDoc.trim == errorsInDoc.trim, - "The error class document is not up to date. Please regenerate it.") + "The error class document is not up to date. Please regenerate it by running " + +"`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt \"core/testOnly *SparkThrowableSuite -- -t " + +"\\\"Error classes match with document\\\"\"`") Review Comment: Can we use a variable of string to replace this hardcoded instruction and put that val at the begging of this class and under the comment of how to re-generate the golden file? The reason it is better because it is possible that the instruction of how to re-gen is updated but the developer fail to update the instruction in the such messages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Tweak column error names and text [spark]
amaliujia commented on PR #45276: URL: https://github.com/apache/spark/pull/45276#issuecomment-1965775599 cc @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]
amaliujia commented on code in PR #45233: URL: https://github.com/apache/spark/pull/45233#discussion_r1503625723 ## connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectServiceE2ESuite.scala: ## @@ -202,4 +202,17 @@ class SparkConnectServiceE2ESuite extends SparkConnectServerTest { } } } + + test("SPARK-47144: Verify collations work with Connect client") { +withClient { client => + Seq("UCS_BASIC", "UCS_BASIC_LCASE", "UNICODE", "UNICODE_CI").foreach(collationName => { +val query1 = client.execute(buildPlan(s"select 'abc' collate '$collationName'")) +query1.hasNext // guarantees the request was received by server. + +val query2 = client.execute(buildPlan(s"select collation('abc' collate '$collationName')")) +query2.hasNext // guarantees the request was received by server. + }) + client.releaseSession() Review Comment: Looks like the purpose of your test here is to make sure the request is passing through without an error. You do not need this test. What you need is to add a test in `PlanGenerationTestSuite` as your purpose is to verify the client generated a correct plan. You can check the class doc of `PlanGenerationTestSuite` to understand how to add such a test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]
amaliujia commented on code in PR #45233: URL: https://github.com/apache/spark/pull/45233#discussion_r1503622539 ## connector/connect/common/src/main/protobuf/spark/connect/types.proto: ## @@ -100,6 +100,7 @@ message DataType { message String { uint32 type_variation_reference = 1; +optional uint32 collation_id = 2; Review Comment: The String type should always have a collation_id so this field should be non optional? The default value is 0 which is the default value for the proto field if that is not set. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]
amaliujia commented on code in PR #45233: URL: https://github.com/apache/spark/pull/45233#discussion_r1503622112 ## connector/connect/common/src/main/protobuf/spark/connect/types.proto: ## @@ -100,6 +100,7 @@ message DataType { message String { uint32 type_variation_reference = 1; +optional uint32 collation_id = 2; Review Comment: The String type should always have a collation_id so this field should be non `optional`? The default value is 0. ## connector/connect/common/src/main/protobuf/spark/connect/types.proto: ## @@ -100,6 +100,7 @@ message DataType { message String { uint32 type_variation_reference = 1; +optional uint32 collation_id = 2; Review Comment: The String type should always have a collation_id so this field should be non optional? The default value is 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47144][CONNECT][SQL] Fix Spark Connect collation error by adding collateId protobuf field [spark]
amaliujia commented on code in PR #45233: URL: https://github.com/apache/spark/pull/45233#discussion_r1503622112 ## connector/connect/common/src/main/protobuf/spark/connect/types.proto: ## @@ -100,6 +100,7 @@ message DataType { message String { uint32 type_variation_reference = 1; +optional uint32 collation_id = 2; Review Comment: The String type should always have a collation_id so this field might be non `optional`? The default value is 0. ## connector/connect/common/src/main/protobuf/spark/connect/types.proto: ## @@ -100,6 +100,7 @@ message DataType { message String { uint32 type_variation_reference = 1; +optional uint32 collation_id = 2; Review Comment: The String type should always have a collation_id so this field should be non `optional`? The default value is 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]
dongjoon-hyun closed pull request #45274: [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers URL: https://github.com/apache/spark/pull/45274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]
dongjoon-hyun commented on PR #45274: URL: https://github.com/apache/spark/pull/45274#issuecomment-1965764570 Thank you, @yaooqinn , @HyukjinKwon , @viirya . Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
srowen commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1965751067 Wait, this was merged to master? Let's not merge to 3.5, no, because i don't think this actually fixes the problem (see the thread here - I didn't grok that this was a backport) @tgravescs not sure about this one in master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
HeartSaVioR commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1965745567 @srowen No, this was merged to master via https://github.com/apache/spark/pull/43494. I'm just speculating as the change does not look to be trivial enough to worth taking the risk on adopting this change in maintenance version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
srowen closed pull request #44690: [SPARK-45527][CORE] Use fraction to do the resource calculation URL: https://github.com/apache/spark/pull/44690 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
srowen commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1965741499 No, he's asking why it's open vs the 3.5 branch not master. It is not going into master and so is not to be back ported either. The main reason is that this is not a fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Tweak column error names and text [spark]
nchammas commented on code in PR #45276: URL: https://github.com/apache/spark/pull/45276#discussion_r1503601045 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -492,15 +492,15 @@ }, "sqlState" : "54000" }, - "COLUMN_ALIASES_IS_NOT_ALLOWED" : { + "COLUMN_ALIASES_NOT_ALLOWED" : { Review Comment: I considered naming this `COLUMN_ALIASES_ARE_NOT_ALLOWED`, but since these are slugs they don't really need proper grammar. This change at least corrects the incorrect combination of the plural "aliases" with the singular "is". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][SQL] Tweak column error names and text [spark]
nchammas opened a new pull request, #45276: URL: https://github.com/apache/spark/pull/45276 ### What changes were proposed in this pull request? Tweak the names and text for a few errors so they read more naturally (and correctly). ### Why are the changes needed? Just minor English improvements. ### Does this PR introduce _any_ user-facing change? Yes, these are user-facing error messages. ### How was this patch tested? No testing apart from CI. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]
HyukjinKwon closed pull request #45271: [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses URL: https://github.com/apache/spark/pull/45271 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]
HyukjinKwon commented on PR #45271: URL: https://github.com/apache/spark/pull/45271#issuecomment-1965738440 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]
HyukjinKwon commented on PR #45275: URL: https://github.com/apache/spark/pull/45275#issuecomment-1965738139 cc @HeartSaVioR FYI. I found this out while debugging sth :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][SS] Minor string representation improvement in AssertOnQuery [spark]
HyukjinKwon opened a new pull request, #45275: URL: https://github.com/apache/spark/pull/45275 ### What changes were proposed in this pull request? This PR changes `AssertOnQuery(, )` to `AssertOnQuery()` when the message is empty. ### Why are the changes needed? Just to make it a little bit more prettier and readale. ### Does this PR introduce _any_ user-facing change? No, dev-inly. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]
wbo4958 commented on PR #44690: URL: https://github.com/apache/spark/pull/44690#issuecomment-1965726911 Hi @HeartSaVioR, In essence, it's actually a genuine bug in Spark rather than a feature request. If we attempt to restrict the execution of only one task per executor by modifying the GPU resources, it won't work as expected. This scenario is quite common in ETL (Extract, Transform, Load) + ML (Machine Learning) workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` [spark]
dongjoon-hyun commented on code in PR #44906: URL: https://github.com/apache/spark/pull/44906#discussion_r1503590785 ## core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala: ## @@ -661,6 +662,52 @@ class MasterSuite extends SparkFunSuite scheduleExecutorsForAppWithMultiRPs(withMaxCores = true) } + + private val workerSelectionPolicyTestCases = Seq( +(CORES_FREE_ASC, true, List("10001", "10002")), +(CORES_FREE_ASC, false, List("10001")), +(CORES_FREE_DESC, true, List("10004", "10005")), +(CORES_FREE_DESC, false, List("10005")), +(MEMORY_FREE_ASC, true, List("10001", "10005")), +(MEMORY_FREE_ASC, false, List("10001")), +(MEMORY_FREE_DESC, true, List("10002", "10003")), +(MEMORY_FREE_DESC, false, Seq("10002")), +(WORKER_ID, true, Seq("10001", "10002")), +(WORKER_ID, false, Seq("10001"))) + + workerSelectionPolicyTestCases.foreach { case (policy, spreadOut, expected) => +test(s"SPARK-46881: scheduling with workerSelectionPolicy - $policy ($spreadOut)") { Review Comment: To @HyukjinKwon , here is the PR. - #45274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` [spark]
dongjoon-hyun commented on code in PR #44906: URL: https://github.com/apache/spark/pull/44906#discussion_r1503590785 ## core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala: ## @@ -661,6 +662,52 @@ class MasterSuite extends SparkFunSuite scheduleExecutorsForAppWithMultiRPs(withMaxCores = true) } + + private val workerSelectionPolicyTestCases = Seq( +(CORES_FREE_ASC, true, List("10001", "10002")), +(CORES_FREE_ASC, false, List("10001")), +(CORES_FREE_DESC, true, List("10004", "10005")), +(CORES_FREE_DESC, false, List("10005")), +(MEMORY_FREE_ASC, true, List("10001", "10005")), +(MEMORY_FREE_ASC, false, List("10001")), +(MEMORY_FREE_DESC, true, List("10002", "10003")), +(MEMORY_FREE_DESC, false, Seq("10002")), +(WORKER_ID, true, Seq("10001", "10002")), +(WORKER_ID, false, Seq("10001"))) + + workerSelectionPolicyTestCases.foreach { case (policy, spreadOut, expected) => +test(s"SPARK-46881: scheduling with workerSelectionPolicy - $policy ($spreadOut)") { Review Comment: To @HyukjinKwon , here is the follow-up. - #45274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]
dongjoon-hyun commented on PR #45274: URL: https://github.com/apache/spark/pull/45274#issuecomment-1965726176 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47181][CORE][TESTS] Fix `MasterSuite` to validate the number of registered workers [spark]
dongjoon-hyun opened a new pull request, #45274: URL: https://github.com/apache/spark/pull/45274 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]
nchammas commented on PR #44872: URL: https://github.com/apache/spark/pull/44872#issuecomment-1965720034 I've filed SPARK-47180 to track potentially migrating off of Univocity to something else. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]
yaooqinn commented on PR #45256: URL: https://github.com/apache/spark/pull/45256#issuecomment-1965706216 Thank you @dongjoon-hyun, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]
yaooqinn closed pull request #45256: [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py URL: https://github.com/apache/spark/pull/45256 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47179][SQL] Improve error message from SparkThrowableSuite for better debuggability [spark]
itholic opened a new pull request, #45273: URL: https://github.com/apache/spark/pull/45273 ### What changes were proposed in this pull request? This PR proposes to improve error message from SparkThrowableSuite for better debuggability ### Why are the changes needed? The current error message is not very actionable for developer who need regenerating the error class documentation. ### Does this PR introduce _any_ user-facing change? No API change, but the error message is changed: **Before** ``` The error class document is not up to date. Please regenerate it. ``` **After** ``` he error class document is not up to date. Please regenerate it by running `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly *SparkThrowableSuite -- -t \"Error classes match with document\""` ``` ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
dongjoon-hyun closed pull request #45272: [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation URL: https://github.com/apache/spark/pull/45272 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
dongjoon-hyun commented on PR #45272: URL: https://github.com/apache/spark/pull/45272#issuecomment-1965695504 ``` [info] TaskSchedulerImplSuite: [info] - SPARK-32653: Decommissioned host/executor should be considered as inactive (409 milliseconds) [info] - Scheduler does not always schedule tasks on the same workers (295 milliseconds) [info] - Scheduler correctly accounts for multiple CPUs per task (26 milliseconds) [info] - SPARK-18886 - partial offers (isAllFreeResources = false) reset timer before any resources have been rejected (27 milliseconds) [info] - SPARK-18886 - delay scheduling timer is reset when it accepts all resources offered when isAllFreeResources = true (22 milliseconds) [info] - SPARK-18886 - task set with no locality requirements should not starve one with them (21 milliseconds) [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) reset time if last full resource offer (isAllResources = true) was accepted as well as any following partial resource offers (20 milliseconds) [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) do not reset time if any offer was rejected since last full offer was fully accepted (19 milliseconds) [info] - Scheduler does not crash when tasks are not serializable (23 milliseconds) [info] - concurrent attempts for the same stage only have one active taskset (21 milliseconds) [info] - don't schedule more tasks after a taskset is zombie (21 milliseconds) [info] - if a zombie attempt finishes, continue scheduling tasks for non-zombie attempts (20 milliseconds) [info] - tasks are not re-scheduled while executor loss reason is pending (24 milliseconds) OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended [info] - scheduled tasks obey task and stage excludelist (731 milliseconds) [info] - scheduled tasks obey node and executor excludelists (49 milliseconds) [info] - abort stage when all executors are excluded and we cannot acquire new executor (38 milliseconds) [info] - SPARK-22148 abort timer should kick in when task is completely excluded & no new executor can be acquired (32 milliseconds) [info] - SPARK-22148 try to acquire a new executor when task is unschedulable with 1 executor (29 milliseconds) [info] - SPARK-22148 abort timer should clear unschedulableTaskSetToExpiryTime for all TaskSets (45 milliseconds) [info] - SPARK-22148 Ensure we don't abort the taskSet if we haven't been completely excluded (31 milliseconds) [info] - SPARK-31418 abort timer should kick in when task is completely excluded manager could not acquire a new executor before the timeout (24 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 0 (50 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 1 (37 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 2 (36 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 3 (37 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 4 (35 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 5 (33 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 6 (44 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 7 (33 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 8 (33 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 9 (31 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 0 (34 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 1 (30 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 2 (31 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 3 (29 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 4 (31 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 5 (30 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 6 (29 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 7 (27 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 8 (39 milliseconds) [info] - Excluded
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
HyukjinKwon commented on PR #45272: URL: https://github.com/apache/spark/pull/45272#issuecomment-1965693056 @dongjoon-hyun I am sorry. This PR fixes the leftover. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
HyukjinKwon opened a new pull request, #45272: URL: https://github.com/apache/spark/pull/45272 ### What changes were proposed in this pull request? There are two more instances to fix in https://github.com/apache/spark/pull/45268 mistakenly missed. This PR fixes both. ### Why are the changes needed? See https://github.com/apache/spark/pull/45268 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
HyukjinKwon commented on PR #45268: URL: https://github.com/apache/spark/pull/45268#issuecomment-1965690534 ahh .. I am sorry for rushing it. I will make a followup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46881][CORE] Support `spark.deploy.workerSelectionPolicy` [spark]
dongjoon-hyun commented on code in PR #44906: URL: https://github.com/apache/spark/pull/44906#discussion_r1503563961 ## core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala: ## @@ -661,6 +662,52 @@ class MasterSuite extends SparkFunSuite scheduleExecutorsForAppWithMultiRPs(withMaxCores = true) } + + private val workerSelectionPolicyTestCases = Seq( +(CORES_FREE_ASC, true, List("10001", "10002")), +(CORES_FREE_ASC, false, List("10001")), +(CORES_FREE_DESC, true, List("10004", "10005")), +(CORES_FREE_DESC, false, List("10005")), +(MEMORY_FREE_ASC, true, List("10001", "10005")), +(MEMORY_FREE_ASC, false, List("10001")), +(MEMORY_FREE_DESC, true, List("10002", "10003")), +(MEMORY_FREE_DESC, false, Seq("10002")), +(WORKER_ID, true, Seq("10001", "10002")), +(WORKER_ID, false, Seq("10001"))) + + workerSelectionPolicyTestCases.foreach { case (policy, spreadOut, expected) => +test(s"SPARK-46881: scheduling with workerSelectionPolicy - $policy ($spreadOut)") { Review Comment: Thanks. Ya, let me take a look at. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]
itholic commented on PR #45244: URL: https://github.com/apache/spark/pull/45244#issuecomment-1965685388 Thanks, @MaxGekk for the review! Just adjusted comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]
himadripal commented on PR #45267: URL: https://github.com/apache/spark/pull/45267#issuecomment-1965681861 was trying this test case, `val partition1 = Array(Expressions.years("ts"), bucket(2, "id")) val partition2 = Array(Expressions.days("ts"), bucket(4, "id"))` this throws an exception when paritiallyClustered = false and allowCompatibleTransform=true. exception : `class java.lang.Integer cannot be cast to class java.lang.Long (java.lang.Integer and java.lang.Long are in module java.base of loader 'bootstrap') java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.Long (java.lang.Integer and java.lang.Long are in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:103) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
dongjoon-hyun commented on PR #45268: URL: https://github.com/apache/spark/pull/45268#issuecomment-1965681672 Just for the record, after this commit, the following is the result on `master` branch. It reduced much but we still have more. ``` [info] TaskSchedulerImplSuite: [info] - SPARK-32653: Decommissioned host/executor should be considered as inactive (414 milliseconds) [info] - Scheduler does not always schedule tasks on the same workers (304 milliseconds) [info] - Scheduler correctly accounts for multiple CPUs per task (23 milliseconds) [info] - SPARK-18886 - partial offers (isAllFreeResources = false) reset timer before any resources have been rejected (27 milliseconds) [info] - SPARK-18886 - delay scheduling timer is reset when it accepts all resources offered when isAllFreeResources = true (21 milliseconds) [info] - SPARK-18886 - task set with no locality requirements should not starve one with them (21 milliseconds) [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) reset time if last full resource offer (isAllResources = true) was accepted as well as any following partial resource offers (21 milliseconds) [info] - SPARK-18886 - partial resource offers (isAllFreeResources = false) do not reset time if any offer was rejected since last full offer was fully accepted (20 milliseconds) [info] - Scheduler does not crash when tasks are not serializable (22 milliseconds) [info] - concurrent attempts for the same stage only have one active taskset (20 milliseconds) [info] - don't schedule more tasks after a taskset is zombie (21 milliseconds) [info] - if a zombie attempt finishes, continue scheduling tasks for non-zombie attempts (20 milliseconds) [info] - tasks are not re-scheduled while executor loss reason is pending (24 milliseconds) OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended [info] - scheduled tasks obey task and stage excludelist (726 milliseconds) [info] - scheduled tasks obey node and executor excludelists (47 milliseconds) [info] - abort stage when all executors are excluded and we cannot acquire new executor (37 milliseconds) [info] - SPARK-22148 abort timer should kick in when task is completely excluded & no new executor can be acquired (30 milliseconds) [info] - SPARK-22148 try to acquire a new executor when task is unschedulable with 1 executor (30 milliseconds) [info] - SPARK-22148 abort timer should clear unschedulableTaskSetToExpiryTime for all TaskSets (48 milliseconds) [info] - SPARK-22148 Ensure we don't abort the taskSet if we haven't been completely excluded (38 milliseconds) [info] - SPARK-31418 abort timer should kick in when task is completely excluded manager could not acquire a new executor before the timeout (25 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 0 (42 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 1 (36 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 2 (35 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 3 (38 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 4 (35 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 5 (33 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 6 (43 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 7 (34 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 8 (33 milliseconds) [info] - Excluded node for entire task set prevents per-task exclusion checks: iteration 9 (31 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 0 (33 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 1 (31 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 2 (29 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 3 (28 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 4 (30 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 5 (29 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 6 (29 milliseconds) [info] - Excluded executor for entire task set prevents per-task exclusion checks: iteration 7 (30 milliseconds) [info]
Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]
HyukjinKwon commented on code in PR #45271: URL: https://github.com/apache/spark/pull/45271#discussion_r1503557832 ## python/pyspark/sql/session.py: ## @@ -1299,7 +1299,7 @@ def createDataFrame( # type: ignore[misc] -- data : :class:`RDD` or iterable an RDD of any kind of SQL data representation (:class:`Row`, -:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`, +:class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or :class:`list`, Review Comment: ```suggestion :class:`tuple`, ``int``, ``boolean``, ``dict``, etc.), or :class:`list`, ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]
HyukjinKwon commented on code in PR #45271: URL: https://github.com/apache/spark/pull/45271#discussion_r1503557591 ## python/pyspark/sql/session.py: ## @@ -1299,7 +1299,7 @@ def createDataFrame( # type: ignore[misc] -- data : :class:`RDD` or iterable an RDD of any kind of SQL data representation (:class:`Row`, -:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`, +:class:`tuple`, ``int``, ``boolean``, ``dict``` etc.), or :class:`list`, Review Comment: ```suggestion :class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or :class:`list`, ``` ## python/pyspark/sql/session.py: ## @@ -1299,7 +1299,7 @@ def createDataFrame( # type: ignore[misc] -- data : :class:`RDD` or iterable an RDD of any kind of SQL data representation (:class:`Row`, -:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`, +:class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or :class:`list`, Review Comment: ```suggestion :class:`tuple`, ``int``, ``boolean``, ``dict``, etc.), or :class:`list`, ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47178][PYTHON][TESTS] Add a test case for createDataFrame with dataclasses [spark]
HyukjinKwon commented on code in PR #45271: URL: https://github.com/apache/spark/pull/45271#discussion_r1503557591 ## python/pyspark/sql/session.py: ## @@ -1299,7 +1299,7 @@ def createDataFrame( # type: ignore[misc] -- data : :class:`RDD` or iterable an RDD of any kind of SQL data representation (:class:`Row`, -:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`, +:class:`tuple`, ``int``, ``boolean``, ``dict``` etc.), or :class:`list`, Review Comment: ```suggestion :class:`tuple`, ``int``, ``boolean``, ``dict`` etc.), or :class:`list`, ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
dongjoon-hyun commented on PR #45268: URL: https://github.com/apache/spark/pull/45268#issuecomment-1965679830 Ur, this seems to be a partial fix. Could you reduce these together, @HyukjinKwon ? ``` (1 to 20).foreach { taskNum => val gpuTaskAmount = ResourceAmountUtils.toFractionalResource(ONE_ENTIRE_RESOURCE / taskNum) test(s"SPARK-45527 TaskResourceProfile with task.gpu.amount=${gpuTaskAmount} can " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
HyukjinKwon commented on PR #45268: URL: https://github.com/apache/spark/pull/45268#issuecomment-1965679171 I manually ran the tests, and linter to make sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
HyukjinKwon closed pull request #45268: [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation URL: https://github.com/apache/spark/pull/45268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][CORE][TESTS][FOLLOW-UP] Reduce the number of test cases in fraction resource calculation [spark]
HyukjinKwon commented on PR #45268: URL: https://github.com/apache/spark/pull/45268#issuecomment-1965678623 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47178][PYTHON][TESTS]Add a test case for createDataFrame with dataclasses [spark]
HyukjinKwon opened a new pull request, #45271: URL: https://github.com/apache/spark/pull/45271 ### What changes were proposed in this pull request? This PR adds a test case for dataclasses with createDataFrame that is already supported: ```python from dataclasses import dataclass @dataclass class User: name: str age: int is_active: bool user = User(name="John", age=30, is_active=True) spark.createDataFrame([user]).show() ``` ### Why are the changes needed? To make sure dataclasses work with PySpark. ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? Added test case. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]
dongjoon-hyun commented on PR #45256: URL: https://github.com/apache/spark/pull/45256#issuecomment-1965673466 Thank you, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]
itholic commented on code in PR #45244: URL: https://github.com/apache/spark/pull/45244#discussion_r1503550776 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -298,6 +298,12 @@ ], "sqlState" : "22018" }, + "CANNOT_PARSE_STRING_AS_DATATYPE" : { Review Comment: According to https://github.com/apache/spark/pull/45244/files#r1503550041, let me just turn this into an internal error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` [spark]
dongjoon-hyun commented on PR #45265: URL: https://github.com/apache/spark/pull/45265#issuecomment-1965673018 Thank you, @HyukjinKwon and @yaooqinn ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]
itholic commented on code in PR #45244: URL: https://github.com/apache/spark/pull/45244#discussion_r1503550041 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala: ## @@ -502,4 +504,25 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite { assert(formatter.parseOptional("-12-31 23:59:59.999").isEmpty) assert(formatter.parseWithoutTimeZoneOptional("-12-31 23:59:59.999", true).isEmpty) } + + test("fail to parse string as TimestampNTZ with invalid format") { +val zoneId = ZoneId.systemDefault() +val locale = Locale.getDefault() +val formatter = new DefaultTimestampFormatter( + zoneId, locale, LENIENT_SIMPLE_DATE_FORMAT, isParsing = true) + +val invalidTimestampStr = "2021-13-01T25:61:61" + +checkError( + exception = intercept[SparkRuntimeException] { +formatter.parseWithoutTimeZone(invalidTimestampStr, allowTimeZone = false) Review Comment: I just realized the comment from the source code below: ``` * The formatter for timestamps which doesn't require users to specify a pattern. While formatting, * it uses the default pattern [[TimestampFormatter.defaultPattern()]]. In parsing, it follows * the CAST logic in conversion of strings to Catalyst's TimestampType. ``` so I suspect that this cannot be reproduced from user space, so we should mark it as an internal error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]
itholic commented on code in PR #45244: URL: https://github.com/apache/spark/pull/45244#discussion_r1503550041 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala: ## @@ -502,4 +504,25 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite { assert(formatter.parseOptional("-12-31 23:59:59.999").isEmpty) assert(formatter.parseWithoutTimeZoneOptional("-12-31 23:59:59.999", true).isEmpty) } + + test("fail to parse string as TimestampNTZ with invalid format") { +val zoneId = ZoneId.systemDefault() +val locale = Locale.getDefault() +val formatter = new DefaultTimestampFormatter( + zoneId, locale, LENIENT_SIMPLE_DATE_FORMAT, isParsing = true) + +val invalidTimestampStr = "2021-13-01T25:61:61" + +checkError( + exception = intercept[SparkRuntimeException] { +formatter.parseWithoutTimeZone(invalidTimestampStr, allowTimeZone = false) Review Comment: I just realized the comment from the source code below: ``` * The formatter for timestamps which doesn't require users to specify a pattern. While formatting, * it uses the default pattern [[TimestampFormatter.defaultPattern()]]. In parsing, it follows * the CAST logic in conversion of strings to Catalyst's TimestampType. ``` so I believe this cannot be reproduced from user space, so we should mark it as an internal error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47158][SQL] Assign proper name and `sqlState` to `_LEGACY_ERROR_TEMP_(2134|2231)` [spark]
itholic commented on code in PR #45244: URL: https://github.com/apache/spark/pull/45244#discussion_r1503547976 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -1272,6 +1278,12 @@ ], "sqlState" : "42710" }, + "FIELD_INDEX_ON_ROW_WITHOUT_SCHEMA" : { +"message" : [ + "fieldIndex on a Row without schema is undefined." +], +"sqlState" : "42000" Review Comment: Sounds good. I think we can use `42601` - A character, token, or clause is invalid or missing - -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` [spark]
yaooqinn commented on code in PR #45180: URL: https://github.com/apache/spark/pull/45180#discussion_r1503547763 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala: ## @@ -2876,22 +2876,39 @@ class HiveDDLSuite } } - test("SPARK-24681 checks if nested column names do not include ',', ':', and ';'") { -Seq("nested,column", "nested:column", "nested;column").foreach { nestedColumnName => + test("SPARK-47101 checks if nested column names do not include invalid characters") { +// delimiter characters +Seq(",", ":").foreach { c => + val typ = s"array>" + // The regex is from HiveClientImpl.getSparkSQLDataType, please keep them in sync. + val replaced = typ.replaceAll("`", "").replaceAll("(?<=struct<|,)([^,<:]+)(?=:)", "`$1`") + withTable("t") { +checkError( + exception = intercept[SparkException] { +sql(s"CREATE TABLE t (a $typ) USING hive") Review Comment: Still fine -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-24815] [CORE] Trigger Interval based DRA for Structured Streaming [spark]
pkotikalapudi commented on PR #42352: URL: https://github.com/apache/spark/pull/42352#issuecomment-1965668989 > If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! @rdblue can you please re-open the PR and remove the stale tag. I think github bot will auto-close it if the tag exists. Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` [spark]
yaooqinn commented on PR #45265: URL: https://github.com/apache/spark/pull/45265#issuecomment-1965663502 Thanks @dongjoon-hyun @HyukjinKwon, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` [spark]
yaooqinn closed pull request #45265: [SPARK-47175][SS][TESTS] Remove ZOOKEEPER-1844 comment from `KafkaTestUtils` URL: https://github.com/apache/spark/pull/45265 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent [spark]
yaooqinn commented on PR #45255: URL: https://github.com/apache/spark/pull/45255#issuecomment-1965662096 Thanks, @dongjoon-hyun merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent [spark]
yaooqinn closed pull request #45255: [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent URL: https://github.com/apache/spark/pull/45255 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47165][SQL][TESTS] Pull docker image only when its' absent [spark]
yaooqinn commented on code in PR #45255: URL: https://github.com/apache/spark/pull/45255#discussion_r1503540284 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala: ## @@ -148,10 +148,6 @@ abstract class DockerJDBCIntegrationSuite pulled = true } - docker.pullImageCmd(db.imageName) -.start() -.awaitCompletion(connectionTimeout.value.toSeconds, TimeUnit.SECONDS) Review Comment: @dongjoon-hyun, I am not quite sure:) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]
szehon-ho commented on code in PR #45267: URL: https://github.com/apache/spark/pull/45267#discussion_r1503537598 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1537,6 +1537,18 @@ object SQLConf { .booleanConf .createWithDefault(false) + val V2_BUCKETING_ALLOW_COMPATIBLE_TRANSFORMS = +buildConf("spark.sql.sources.v2.bucketing.allow.enabled") Review Comment: Thanks, yea it is a copy and paste error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]
yaooqinn commented on code in PR #45256: URL: https://github.com/apache/spark/pull/45256#discussion_r1503537067 ## dev/merge_spark_pr.py: ## @@ -72,6 +72,13 @@ BRANCH_PREFIX = "PR_TOOL" +def print_error(msg): +print("\033[91m%s\033[0m" % msg) + + +def bold_input(prompt) -> str: +return input("\033[1m%s\033[0m" % prompt) + Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
JoshRosen commented on code in PR #45266: URL: https://github.com/apache/spark/pull/45266#discussion_r1503532857 ## core/src/main/scala/org/apache/spark/Dependency.scala: ## @@ -206,6 +206,21 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( finalizeTask = Option(task) } + // Set the threshold to 1 billion which represents approximately 1GB of memory + // allocated to map output statuses. + // A large number of shuffle blocks may crash the driver with an OOM error. + private val SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD: Long = 10L + private val numberOfShuffleBlocks = numPartitions.toLong * partitioner.numPartitions.toLong + + if (numberOfShuffleBlocks > SHUFFLE_BLOCK_NUMBER_WARNING_THRESHOLD) { +logWarning( + s"The number of shuffle blocks (${numberOfShuffleBlocks}) for ${_rdd} " + Review Comment: Maybe we could also log the shuffle ID here as well? e.g. ``` "The number of shuffle blocks (...) for shuffleId $shuffleId for ${_rdd} with ${numPartition} partitions " ``` or something along those lines? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47166][INFRA] Improves the HINTs of merge_spark_pr.py [spark]
yaooqinn commented on code in PR #45256: URL: https://github.com/apache/spark/pull/45256#discussion_r1503535482 ## dev/merge_spark_pr.py: ## @@ -72,6 +72,13 @@ BRANCH_PREFIX = "PR_TOOL" +def print_error(msg): +print("\033[91m%s\033[0m" % msg) Review Comment: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors Not all terminals and operating systems, but most of them support ANSI colors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org