Re: [PR] [SPARK-45777][CORE] Support `spark.test.appId` in `LocalSchedulerBackend` [spark]
dongjoon-hyun commented on code in PR #43645: URL: https://github.com/apache/spark/pull/43645#discussion_r1381186064 ## core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala: ## @@ -110,7 +110,7 @@ private[spark] class LocalSchedulerBackend( val totalCores: Int) extends SchedulerBackend with ExecutorBackend with Logging { - private val appId = "local-" + System.currentTimeMillis + private val appId = sys.props.getOrElse("spark.test.appId", "local-" + System.currentTimeMillis) Review Comment: Although I did to avoid `SparkConf(false)` case, in my use case, I always use as environment variables and it will be okay to load from `SparkConf`. Do you want me to change, @yaooqinn ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Fixed typo [spark]
yaooqinn commented on code in PR #43634: URL: https://github.com/apache/spark/pull/43634#discussion_r1381181952 ## sql/README.md: ## @@ -6,7 +6,7 @@ This module provides support for executing relational queries expressed in eithe Spark SQL is broken up into four subprojects: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files. - - Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs. + - Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveSQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs. Review Comment: From what I have gathered, HiveQL appears to be a widely accepted term. e.g. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ProcessingHiveQL.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][DOCS] Fixed typo [spark]
yaooqinn commented on PR #43634: URL: https://github.com/apache/spark/pull/43634#issuecomment-1791921276 Hi @YuanHanzhong, thanks for your first contribution to spark. However, could you please include #43636 in this patch as well? Also, kindly answer the questions in the PR template instead of removing them all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45740][SQL] Relax the node prefix of SparkPlanGraphCluster [spark]
yaooqinn commented on PR #43602: URL: https://github.com/apache/spark/pull/43602#issuecomment-1791914386 Can you attach the screenshot of ui after this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
mridulm commented on code in PR #43627: URL: https://github.com/apache/spark/pull/43627#discussion_r1381167046 ## core/src/main/scala/org/apache/spark/SparkEnv.scala: ## @@ -71,6 +69,10 @@ class SparkEnv ( val outputCommitCoordinator: OutputCommitCoordinator, val conf: SparkConf) extends Logging { + // We initialize the ShuffleManager later, in SparkContext and Executor, to allow + // user jars to define custom ShuffleManagers. + var shuffleManager: ShuffleManager = _ Review Comment: Given `SparkEnv` is a `DeveloperApi`, let us not expose this for mutation. ```suggestion private var _shuffleManager: ShuffleManager = _ def shuffleManager: ShuffleManager = _shuffleManager ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]
mridulm closed pull request #43596: [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky URL: https://github.com/apache/spark/pull/43596 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]
mridulm commented on PR #43596: URL: https://github.com/apache/spark/pull/43596#issuecomment-1791900605 Merged to master. Thanks for fixing this @hasnain-db ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]
mridulm commented on code in PR #43596: URL: https://github.com/apache/spark/pull/43596#discussion_r1381165616 ## common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java: ## @@ -161,14 +161,17 @@ public void testReload() throws Exception { // At this point we haven't reloaded, just the initial load assertEquals(0, tm.reloadCount); + // Wait so that the file modification time is different + Thread.sleep((tm.getReloadInterval() + 200)); + // Add another cert Map certs = new HashMap(); certs.put("cert1", cert1); certs.put("cert2", cert2); createTrustStore(trustStore, "password", certs); - // Wait up to 5s until we reload - waitForReloadCount(tm, 1, 50); + // Wait up to 10s until we reload + waitForReloadCount(tm, 1, 100); Review Comment: That should do it :-) I would expect a lot more tests (outside of this effort) to be impacted with machines that loaded ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45777][CORE] Support `spark.test.appId` in `LocalSchedulerBackend` [spark]
yaooqinn commented on code in PR #43645: URL: https://github.com/apache/spark/pull/43645#discussion_r1381162154 ## core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala: ## @@ -110,7 +110,7 @@ private[spark] class LocalSchedulerBackend( val totalCores: Int) extends SchedulerBackend with ExecutorBackend with Logging { - private val appId = "local-" + System.currentTimeMillis + private val appId = sys.props.getOrElse("spark.test.appId", "local-" + System.currentTimeMillis) Review Comment: Is there a particular reason to retrieve this from the system properties instead of the SparkConf? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45774][CORE][UI] Support `spark.ui.historyServerUrl` in `ApplicationPage` [spark]
yaooqinn commented on code in PR #43643: URL: https://github.com/apache/spark/pull/43643#discussion_r1381161116 ## core/src/main/scala/org/apache/spark/deploy/master/ui/ApplicationPage.scala: ## @@ -98,6 +98,11 @@ private[ui] class ApplicationPage(parent: MasterWebUI) extends WebUIPage("app") Application Detail UI + } else if (parent.master.historyServerUrl.nonEmpty) { + + Review Comment: The history server is configured application by application, not spark master. So, IMO, we shall get the history server from the app itself instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45777][CORE] Support `spark.test.appId` in `LocalSchedulerBackend` [spark]
dongjoon-hyun commented on PR #43645: URL: https://github.com/apache/spark/pull/43645#issuecomment-1791879563 When you have some time, could you review this test conf PR, too, @LuciferYang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45527][core] Use fraction to do the resource calculation [spark]
wbo4958 commented on code in PR #43494: URL: https://github.com/apache/spark/pull/43494#discussion_r1379714430 ## core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala: ## @@ -29,59 +65,54 @@ private[spark] trait ResourceAllocator { protected def resourceName: String protected def resourceAddresses: Seq[String] - protected def slotsPerAddress: Int /** - * Map from an address to its availability, a value > 0 means the address is available, - * while value of 0 means the address is fully assigned. - * - * For task resources ([[org.apache.spark.scheduler.ExecutorResourceInfo]]), this value - * can be a multiple, such that each address can be allocated up to [[slotsPerAddress]] - * times. + * Map from an address to its availability default to RESOURCE_TOTAL_AMOUNT, a value > 0 means + * the address is available, while value of 0 means the address is fully assigned. */ private lazy val addressAvailabilityMap = { -mutable.HashMap(resourceAddresses.map(_ -> slotsPerAddress): _*) +mutable.HashMap(resourceAddresses.map(address => address -> RESOURCE_TOTAL_AMOUNT): _*) } /** - * Sequence of currently available resource addresses. - * - * With [[slotsPerAddress]] greater than 1, [[availableAddrs]] can contain duplicate addresses - * e.g. with [[slotsPerAddress]] == 2, availableAddrs for addresses 0 and 1 can look like - * Seq("0", "0", "1"), where address 0 has two assignments available, and 1 has one. + * Get the resources and its amounts. + * @return the resources amounts + */ + def resourcesAmounts: Map[String, Double] = addressAvailabilityMap.map { Review Comment: Hi Tom, for "if we start getting into the requests where user could ask for 25 resources then we could hit overflow issues" I couldn't understand why hitting the overflow issue? ## core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala: ## @@ -29,59 +65,54 @@ private[spark] trait ResourceAllocator { protected def resourceName: String protected def resourceAddresses: Seq[String] - protected def slotsPerAddress: Int /** - * Map from an address to its availability, a value > 0 means the address is available, - * while value of 0 means the address is fully assigned. - * - * For task resources ([[org.apache.spark.scheduler.ExecutorResourceInfo]]), this value - * can be a multiple, such that each address can be allocated up to [[slotsPerAddress]] - * times. + * Map from an address to its availability default to RESOURCE_TOTAL_AMOUNT, a value > 0 means + * the address is available, while value of 0 means the address is fully assigned. */ private lazy val addressAvailabilityMap = { -mutable.HashMap(resourceAddresses.map(_ -> slotsPerAddress): _*) +mutable.HashMap(resourceAddresses.map(address => address -> RESOURCE_TOTAL_AMOUNT): _*) } /** - * Sequence of currently available resource addresses. - * - * With [[slotsPerAddress]] greater than 1, [[availableAddrs]] can contain duplicate addresses - * e.g. with [[slotsPerAddress]] == 2, availableAddrs for addresses 0 and 1 can look like - * Seq("0", "0", "1"), where address 0 has two assignments available, and 1 has one. + * Get the resources and its amounts. + * @return the resources amounts + */ + def resourcesAmounts: Map[String, Double] = addressAvailabilityMap.map { Review Comment: Yes, Leave in the Long should be more effective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]
ivoson commented on PR #43642: URL: https://github.com/apache/spark/pull/43642#issuecomment-1791865923 > Could you check again? IIRC, there should be more than 40+ files involved in this issue... Thanks @LuciferYang ... checking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45774][CORE][UI] Support `spark.ui.historyServerUrl` in `ApplicationPage` [spark]
dongjoon-hyun commented on PR #43643: URL: https://github.com/apache/spark/pull/43643#issuecomment-1791863587 Could you review this, @LuciferYang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45776][CORE] Remove the defensive null check for `MapOutputTrackerMaster#unregisterShuffle` added in SPARK-39553 [spark]
LuciferYang opened a new pull request, #43644: URL: https://github.com/apache/spark/pull/43644 ### What changes were proposed in this pull request? This pr Remove the defensive null check for `MapOutputTrackerMaster#unregisterShuffle` added in SPARK-39553. ### Why are the changes needed? https://github.com/scala/bug/issues/12613 has been fixed in Scala 2.13.9. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test like `SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE` in `MapOutputTrackerSuite` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
LuciferYang commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1381132422 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest // The data set has 2 partitions, so Spark will write at least 2 json files. // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2 // partitions. -.write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath) +.write.partitionBy("p") +.option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath) Review Comment: @pan3793 Could these cases possibly manifest as compilation warnings? Or are they merely suggestions for best practices? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45774][CORE][UI] Support `spark.ui.historyServerUrl` in `ApplicationPage` [spark]
dongjoon-hyun opened a new pull request, #43643: URL: https://github.com/apache/spark/pull/43643 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]
LuciferYang commented on PR #43637: URL: https://github.com/apache/spark/pull/43637#issuecomment-1791845304 Merged into master for Spark 4.0. Thanks @ivoson and @dongjoon-hyun ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]
LuciferYang closed pull request #43637: [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` URL: https://github.com/apache/spark/pull/43637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]
LuciferYang commented on PR #43642: URL: https://github.com/apache/spark/pull/43642#issuecomment-1791842647 for example: https://github.com/apache/spark/blob/eda9911057b893e42f49dbd7448f20f91f2798c4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L182 https://github.com/apache/spark/blob/eda9911057b893e42f49dbd7448f20f91f2798c4/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1047 and https://github.com/apache/spark/blob/eda9911057b893e42f49dbd7448f20f91f2798c4/examples/src/main/scala/org/apache/spark/examples/graphx/Analytics.scala#L54 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]
LuciferYang commented on PR #43642: URL: https://github.com/apache/spark/pull/43642#issuecomment-1791840682 Could you check again? IIRC, there should be more than 40+ files involved in this issue... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]
ivoson commented on PR #43642: URL: https://github.com/apache/spark/pull/43642#issuecomment-1791831105 cc @LuciferYang please take a look at this PR. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]
ivoson opened a new pull request, #43642: URL: https://github.com/apache/spark/pull/43642 ### What changes were proposed in this pull request? Fix the deprecated behavior below: `Passing an explicit array value to a Scala varargs method is deprecated (since 2.13.0) and will result in a defensive copy; Use the more efficient non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call` It exists in two test suites: `AggregationQuerySuite ` and `ObjectHashAggregateSuite `, for the cases in these 2 test suites, we can take the non-copying method: `ArraySeq.unsafeWrapArray` ### Why are the changes needed? Eliminate compile warnings and no longer use deprecated scala APIs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]
ivoson commented on PR #43637: URL: https://github.com/apache/spark/pull/43637#issuecomment-1791823766 > https://user-images.githubusercontent.com/1475305/280160599-56b53eae-d7fc-4b21-9424-317c5a3daee5.png;> > @ivoson Please update the pr description. Thanks, @LuciferYang . done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]
ivoson commented on PR #43637: URL: https://github.com/apache/spark/pull/43637#issuecomment-1791823590 > Could you re-trigger the failed test pipeline, @ivoson ? Hi @dongjoon-hyun done, the test passed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
beliefer commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1381060163 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest // The data set has 2 partitions, so Spark will write at least 2 json files. // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2 // partitions. -.write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath) +.write.partitionBy("p") +.option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath) Review Comment: Because there are already a lot of Java no-arg method called with (). Let's update them if we really have the strong requirements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
beliefer commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1381041537 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala: ## @@ -21,19 +21,15 @@ import java.util.Locale import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io.SequenceFile.CompressionType -import org.apache.hadoop.io.compress._ import org.apache.spark.util.Utils object CompressionCodecs { - private val shortCompressionCodecNames = Map( -"none" -> null, -"uncompressed" -> null, -"bzip2" -> classOf[BZip2Codec].getName, -"deflate" -> classOf[DeflateCodec].getName, -"gzip" -> classOf[GzipCodec].getName, -"lz4" -> classOf[Lz4Codec].getName, -"snappy" -> classOf[SnappyCodec].getName) + private val shortCompressionCodecNames = HadoopCompressionCodec.values().map { codec => +val className = + if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName +codec.lowerCaseName() -> className Review Comment: Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
beliefer commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1381028768 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util; + +import java.util.Arrays; +import java.util.Locale; +import java.util.Map; +import java.util.stream.Collectors; + +import org.apache.hadoop.io.compress.BZip2Codec; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.io.compress.DeflateCodec; +import org.apache.hadoop.io.compress.GzipCodec; +import org.apache.hadoop.io.compress.Lz4Codec; +import org.apache.hadoop.io.compress.SnappyCodec; + +/** + * A mapper class from Spark supported hadoop compression codecs to hadoop compression codecs. + */ +public enum HadoopCompressionCodec { + NONE(null), + UNCOMPRESSED(null), + BZIP2(new BZip2Codec()), + DEFLATE(new DeflateCodec()), + GZIP(new GzipCodec()), + LZ4(new Lz4Codec()), + SNAPPY(new SnappyCodec()); + + // TODO supports ZStandardCodec + + private final CompressionCodec compressionCodec; + + HadoopCompressionCodec(CompressionCodec compressionCodec) { +this.compressionCodec = compressionCodec; + } + + public CompressionCodec getCompressionCodec() { +return this.compressionCodec; + } + + private static final Map codecNameMap = +Arrays.stream(HadoopCompressionCodec.values()).collect( + Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT))); Review Comment: I feel like both are OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]
panbingkun commented on PR #43578: URL: https://github.com/apache/spark/pull/43578#issuecomment-1791770240 > let's rebase this one @panbingkun Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]
panbingkun commented on code in PR #43578: URL: https://github.com/apache/spark/pull/43578#discussion_r1380960587 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala: ## @@ -46,7 +46,8 @@ case class DescribeNamespaceExec( } if (isExtended) { - val properties = metadata.asScala -- CatalogV2Util.NAMESPACE_RESERVED_PROPERTIES + val properties = metadata.asScala.filterNot( Review Comment: Good suggestion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]
panbingkun commented on code in PR #43578: URL: https://github.com/apache/spark/pull/43578#discussion_r1380960343 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalogSuite.scala: ## @@ -825,9 +825,8 @@ class V2SessionCatalogNamespaceSuite extends V2SessionCatalogBaseSuite { expected: scala.collection.Map[String, String], actual: scala.collection.Map[String, String]): Unit = { // remove location and comment that are automatically added by HMS unless they are expected -val toRemove = - CatalogV2Util.NAMESPACE_RESERVED_PROPERTIES.filter(expected.contains) -assert(expected -- toRemove === actual) +val toRemove = CatalogV2Util.NAMESPACE_RESERVED_PROPERTIES.toSet Review Comment: Okay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]
LuciferYang commented on PR #43637: URL: https://github.com/apache/spark/pull/43637#issuecomment-1791748313 https://github.com/apache/spark/assets/1475305/56b53eae-d7fc-4b21-9424-317c5a3daee5;> @ivoson Please update the pr description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]
HyukjinKwon closed pull request #43635: [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL URL: https://github.com/apache/spark/pull/43635 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]
HyukjinKwon commented on PR #43635: URL: https://github.com/apache/spark/pull/43635#issuecomment-1791743103 All related tests passed. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45742][CORE][FOLLOWUP] Remove unnecessary null check from `ArrayImplicits.SparkArrayOps#toImmutableArraySeq` [spark]
LuciferYang opened a new pull request, #43641: URL: https://github.com/apache/spark/pull/43641 ### What changes were proposed in this pull request? The implementation of the `mmutable.ArraySeq.unsafeWrapArray` function is as follows: ```scala def unsafeWrapArray[T](x: Array[T]): ArraySeq[T] = ((x: @unchecked) match { case null => null case x: Array[AnyRef] => new ofRef[AnyRef](x) case x: Array[Int] => new ofInt(x) case x: Array[Double] => new ofDouble(x) case x: Array[Long]=> new ofLong(x) case x: Array[Float] => new ofFloat(x) case x: Array[Char]=> new ofChar(x) case x: Array[Byte]=> new ofByte(x) case x: Array[Short] => new ofShort(x) case x: Array[Boolean] => new ofBoolean(x) case x: Array[Unit]=> new ofUnit(x) }).asInstanceOf[ArraySeq[T]] ``` The first case of match is null, there is no need to do another manual null check, so this PR removes it. ### Why are the changes needed? Remove unnecessary null check from `ArrayImplicits.SparkArrayOps#toImmutableArraySeq` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test cases, such as `ArrayImplicitsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]
HyukjinKwon commented on code in PR #43630: URL: https://github.com/apache/spark/pull/43630#discussion_r1380920238 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -208,10 +209,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { throw QueryCompilationErrors.pathOptionNotSetCorrectlyWhenReadingError() } -DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).flatMap { provider => - DataSourceV2Utils.loadV2Source(sparkSession, provider, userSpecifiedSchema, extraOptions, -source, paths: _*) -}.getOrElse(loadV1Source(paths: _*)) +val isUserDefinedDataSource = Review Comment: Let's at least separate the logic into a separate function if possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45770][SQL][PYTHON][CONNECT] Fix column resolution in `DataFrame.drop` [spark]
zhengruifeng commented on code in PR #43632: URL: https://github.com/apache/spark/pull/43632#discussion_r1380919985 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -3064,13 +3064,31 @@ class Dataset[T] private[sql]( */ @scala.annotation.varargs def drop(col: Column, cols: Column*): DataFrame = { -val allColumns = col +: cols -val expressions = (for (col <- allColumns) yield col match { +val expressions = (col +: cols).map { case Column(u: UnresolvedAttribute) => -queryExecution.analyzed.resolveQuoted( Review Comment: I think new plan `DropColumns` should be a better approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]
HyukjinKwon commented on code in PR #43630: URL: https://github.com/apache/spark/pull/43630#discussion_r1380919626 ## sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -208,10 +209,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { throw QueryCompilationErrors.pathOptionNotSetCorrectlyWhenReadingError() } -DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).flatMap { provider => - DataSourceV2Utils.loadV2Source(sparkSession, provider, userSpecifiedSchema, extraOptions, -source, paths: _*) -}.getOrElse(loadV1Source(paths: _*)) +val isUserDefinedDataSource = Review Comment: @cloud-fan @allisonwang-db do we want to support this datasource via `USING` syntax unlike DSv2, right? If that's the case, the logics of loading DataSource has to be within `DataSource.lookupDataSource` and/or `DataSource.providingInstance`. I don't think we should mix the logics here with DSv2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44419][SQL] Support to extract partial filters of datasource v2 table and push them down [spark]
github-actions[bot] closed pull request #42000: [SPARK-44419][SQL] Support to extract partial filters of datasource v2 table and push them down URL: https://github.com/apache/spark/pull/42000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44065][SQL] Optimize BroadcastHashJoin skew in OptimizeSkewedJoin [spark]
github-actions[bot] commented on PR #41609: URL: https://github.com/apache/spark/pull/41609#issuecomment-1791732437 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44426][SQL] Optimize adaptive skew join for ExistenceJoin [spark]
github-actions[bot] closed pull request #42003: [SPARK-44426][SQL] Optimize adaptive skew join for ExistenceJoin URL: https://github.com/apache/spark/pull/42003 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first [spark]
github-actions[bot] commented on PR #42117: URL: https://github.com/apache/spark/pull/42117#issuecomment-1791732388 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-4836][UI] Show all stage attempts on UI's job details page [spark]
JoshRosen commented on PR #43640: URL: https://github.com/apache/spark/pull/43640#issuecomment-1791732011 This is an attempt at fixing a nearly nine year old Spark UI nit (https://issues.apache.org/jira/browse/SPARK-4836). I'm opening this draft PR early to get feedback on a couple of design questions: 1. Should this be configuration-flaggable? 2. How should we handle sorting of the stage table? It looks like stage sorting is done server-side and can only be sorted on a single column. By default the table is sorted on Stage ID. 3. Will the "initially-skipped-but-subsequently-retried" stages semantic be confusing to users? - See [the screenshot](https://user-images.githubusercontent.com/50748/280157322-90abb73a-c8f0-41fc-98ba-71de2300b349.png) for an example: in that case, I'm running a job that re-uses the shuffle output of a previous job and hits a fetch failure, causing a recomputation of that initially-skipped stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45773][PYTHON][DOCS] Refine docstring of SparkSession.builder.config [spark]
HyukjinKwon commented on code in PR #43639: URL: https://github.com/apache/spark/pull/43639#discussion_r1380909244 ## python/pyspark/sql/session.py: ## @@ -253,20 +253,32 @@ def config( --- :class:`SparkSession.Builder` +See Also + +:class:`SparkConf` + Examples -For an existing class:`SparkConf`, use `conf` parameter. +For an existing :class:`SparkConf`, use `conf` parameter. >>> from pyspark.conf import SparkConf ->>> SparkSession.builder.config(conf=SparkConf()) +>>> conf = SparkConf().setAppName("example").setMaster("local") +>>> SparkSession.builder.config(conf=conf) >> SparkSession.builder.config("spark.some.config.option", "some-value") >> SparkSession.builder \\ +... .config("spark.some.config.number", 123) \\ +... .config("spark.some.config.float", 0.123) Review Comment: Using backslashes are sort of implicitly discouraged in PEP8. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45773][PYTHON][DOCS] Refine docstring of SparkSession.builder.config [spark]
HyukjinKwon commented on code in PR #43639: URL: https://github.com/apache/spark/pull/43639#discussion_r1380909115 ## python/pyspark/sql/session.py: ## @@ -253,20 +253,32 @@ def config( --- :class:`SparkSession.Builder` +See Also + +:class:`SparkConf` + Examples -For an existing class:`SparkConf`, use `conf` parameter. +For an existing :class:`SparkConf`, use `conf` parameter. >>> from pyspark.conf import SparkConf ->>> SparkSession.builder.config(conf=SparkConf()) +>>> conf = SparkConf().setAppName("example").setMaster("local") +>>> SparkSession.builder.config(conf=conf) >> SparkSession.builder.config("spark.some.config.option", "some-value") >> SparkSession.builder \\ +... .config("spark.some.config.number", 123) \\ +... .config("spark.some.config.float", 0.123) Review Comment: ```suggestion >>> SparkSession.builder.config( ... "spark.some.config.number", 123).config( ... "spark.some.config.float", 0.123) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45773][PYTHON][DOCS] Refine docstring of SparkSession.builder.config [spark]
allisonwang-db opened a new pull request, #43639: URL: https://github.com/apache/spark/pull/43639 ### What changes were proposed in this pull request? This PR refines the docstring of the method `SparkSession.builder.config`. ### Why are the changes needed? To improve PySpark documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doc test ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]
HyukjinKwon commented on PR #43635: URL: https://github.com/apache/spark/pull/43635#issuecomment-1791657207 Thanks all, addressed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]
panbingkun commented on PR #43578: URL: https://github.com/apache/spark/pull/43578#issuecomment-1791642047 > let's rebase this one @panbingkun Okay, thank you very much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
panbingkun commented on PR #37588: URL: https://github.com/apache/spark/pull/37588#issuecomment-1791622292 > I think it's pretty close now, thanks for your patience! I will update it again today. Thank you very much for your patience and seriousness, which has been a great help to me! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]
dongjoon-hyun commented on PR #43637: URL: https://github.com/apache/spark/pull/43637#issuecomment-1791604791 Could you re-trigger the failed test pipeline, @ivoson ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage View.force and ScalaNumberProxy.signum [spark]
dongjoon-hyun commented on code in PR #43637: URL: https://github.com/apache/spark/pull/43637#discussion_r1380825014 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -362,10 +362,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] case s: Seq[_] => s.map(mapChild) case m: Map[_, _] => -// `map.mapValues().view.force` return `Map` in Scala 2.12 but return `IndexedSeq` in Scala -// 2.13, call `toMap` method manually to compatible with Scala 2.12 and Scala 2.13 -// `mapValues` is lazy and we need to force it to materialize -m.view.mapValues(mapChild).view.force.toMap +// `mapValues` is lazy and we need to force it to materialize by converting to Map Review Comment: Is this, `we need to force it to materialize`, correct with the new code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage View.force and ScalaNumberProxy.signum [spark]
dongjoon-hyun commented on code in PR #43637: URL: https://github.com/apache/spark/pull/43637#discussion_r1380825014 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -362,10 +362,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] case s: Seq[_] => s.map(mapChild) case m: Map[_, _] => -// `map.mapValues().view.force` return `Map` in Scala 2.12 but return `IndexedSeq` in Scala -// 2.13, call `toMap` method manually to compatible with Scala 2.12 and Scala 2.13 -// `mapValues` is lazy and we need to force it to materialize -m.view.mapValues(mapChild).view.force.toMap +// `mapValues` is lazy and we need to force it to materialize by converting to Map Review Comment: Is this, `we need to force it to materialize`, correct with the new code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]
mridulm commented on PR #43621: URL: https://github.com/apache/spark/pull/43621#issuecomment-1791582645 The broadcast variable is read once - not n times, and the deserialization n times prevents side effect between tasks, which would be a behavior change if we move away from it. On the performance aspect , as currently formulated, I would expect negligible (if any) difference - though would be happy to see numbers to the contrary to evaluate effectiveness ! Additionally, given the possibility of side effect, these benefits should be very compelling to entertain if it is worth going down this path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45757][ML] Avoid re-computation of NNZ in Binarizer [spark]
dongjoon-hyun commented on PR #43619: URL: https://github.com/apache/spark/pull/43619#issuecomment-1791561507 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45757][ML] Avoid re-computation of NNZ in Binarizer [spark]
dongjoon-hyun closed pull request #43619: [SPARK-45757][ML] Avoid re-computation of NNZ in Binarizer URL: https://github.com/apache/spark/pull/43619 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]
dongjoon-hyun commented on code in PR #43635: URL: https://github.com/apache/spark/pull/43635#discussion_r1380796957 ## python/pyspark/sql/tests/test_udf.py: ## @@ -22,6 +22,8 @@ import unittest import datetime +from py4j.protocol import Py4JJavaError Review Comment: This seems to be a leftover. Shall we remove? ``` ./python/pyspark/sql/tests/test_udf.py:25:1: F401 'py4j.protocol.Py4JJavaError' imported but unused from py4j.protocol import Py4JJavaError ^ 1 F401 'py4j.protocol.Py4JJavaError' imported but unused ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 [spark]
dongjoon-hyun commented on PR #43581: URL: https://github.com/apache/spark/pull/43581#issuecomment-1791556755 Right. Finally, it's fixed and passed. Merged to master. Thank you, @itholic and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 [spark]
dongjoon-hyun closed pull request #43581: [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 URL: https://github.com/apache/spark/pull/43581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41454][PYTHON] Support Python 3.11 [spark]
dongjoon-hyun commented on PR #38987: URL: https://github.com/apache/spark/pull/38987#issuecomment-1791554878 > I've created a task for reverting that change for versions below 3.4. I will notify you in this thread with further information. Thank you for informing that, @mdhont . It's a great news and will be helpful in a way. BTW, just FYI, Apache Spark 3.3 will reach the End-Of-Support next Month (2023-12-15). Apache Spark community currently focuses on Apache Spark 3.4.2 and 3.5.1 and 4.0.0 (next year). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45511][SS] State Data Source - Reader [spark]
rangadi commented on code in PR #43425: URL: https://github.com/apache/spark/pull/43425#discussion_r1380760045 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala: ## @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources.v2.state + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, JoinedRow, UnsafeRow} +import org.apache.spark.sql.connector.read.PartitionReader +import org.apache.spark.sql.execution.datasources.v2.state.utils.SchemaUtil +import org.apache.spark.sql.execution.streaming.state.{StateStore, StateStoreConf, StateStoreId, StateStoreProviderId} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableConfiguration + +class StatePartitionReader( Review Comment: There is a comment, but it does not give any details about what it does. We could have some information useful for future readers of this code (and current reviewers :)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45511][SS] State Data Source - Reader [spark]
anishshri-db commented on code in PR #43425: URL: https://github.com/apache/spark/pull/43425#discussion_r1380729958 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala: ## @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources.v2.state + +import java.util +import java.util.UUID + +import scala.util.control.NonFatal + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path + +import org.apache.spark.sql.{RuntimeConfig, SparkSession} +import org.apache.spark.sql.connector.catalog.{Table, TableProvider} +import org.apache.spark.sql.connector.expressions.Transform +import org.apache.spark.sql.execution.datasources.v2.state.StateDataSource.JoinSideValues.JoinSideValues +import org.apache.spark.sql.execution.streaming.{CommitLog, OffsetSeqLog, OffsetSeqMetadata} +import org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper.{LeftSide, RightSide} +import org.apache.spark.sql.execution.streaming.state.{StateSchemaCompatibilityChecker, StateStore, StateStoreConf, StateStoreId, StateStoreProviderId} +import org.apache.spark.sql.sources.DataSourceRegister +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.util.CaseInsensitiveStringMap + +/** + * An implementation of [[TableProvider]] with [[DataSourceRegister]] for State Store data source. + */ +class StateDataSource extends TableProvider with DataSourceRegister { + import StateDataSource._ + + private lazy val session: SparkSession = SparkSession.active + + private lazy val hadoopConf: Configuration = session.sessionState.newHadoopConf() + + override def shortName(): String = "statestore" + + override def getTable( + schema: StructType, + partitioning: Array[Transform], + properties: util.Map[String, String]): Table = { +val sourceOptions = StateSourceOptions.apply(session, hadoopConf, properties) +val stateConf = buildStateStoreConf(sourceOptions.resolvedCpLocation, sourceOptions.batchId) +new StateTable(session, schema, sourceOptions, stateConf) + } + + override def inferSchema(options: CaseInsensitiveStringMap): StructType = { +val partitionId = StateStore.PARTITION_ID_TO_CHECK_SCHEMA +val sourceOptions = StateSourceOptions.apply(session, hadoopConf, options) +if (sourceOptions.joinSide != JoinSideValues.none && +sourceOptions.storeName != StateStoreId.DEFAULT_STORE_NAME) { + throw new IllegalArgumentException(s"The options '$PARAM_JOIN_SIDE' and " + +s"'$PARAM_STORE_NAME' cannot be specified together. Please specify either one.") +} + +val stateCheckpointLocation = sourceOptions.stateCheckpointLocation + +try { + val (keySchema, valueSchema) = sourceOptions.joinSide match { +case JoinSideValues.left => + StreamStreamJoinStateHelper.readKeyValueSchema(session, stateCheckpointLocation.toString, +sourceOptions.operatorId, LeftSide) + +case JoinSideValues.right => + StreamStreamJoinStateHelper.readKeyValueSchema(session, stateCheckpointLocation.toString, +sourceOptions.operatorId, RightSide) + +case JoinSideValues.none => + val storeId = new StateStoreId(stateCheckpointLocation.toString, sourceOptions.operatorId, +partitionId, sourceOptions.storeName) + val providerId = new StateStoreProviderId(storeId, UUID.randomUUID()) + val manager = new StateSchemaCompatibilityChecker(providerId, hadoopConf) + manager.readSchemaFile() + } + + new StructType() +.add("key", keySchema) +.add("value", valueSchema) +} catch { + case NonFatal(e) => +throw new IllegalArgumentException("Fail to read the state schema. Either the file " + Review Comment: Nit: `Failed to read the` ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala: ## @@ -0,0 +1,779 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + *
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
dongjoon-hyun closed pull request #43638: [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default URL: https://github.com/apache/spark/pull/43638 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
viirya commented on PR #43638: URL: https://github.com/apache/spark/pull/43638#issuecomment-1791474994 Oh, got it, existing one looks good. From the diff, I cannot see it so I thought `SingleEventLogFileWriter` isn't tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
dongjoon-hyun commented on PR #43638: URL: https://github.com/apache/spark/pull/43638#issuecomment-1791475462 Thank you for your confirmation! Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
dongjoon-hyun commented on PR #43638: URL: https://github.com/apache/spark/pull/43638#issuecomment-1791468911 Just for your confirmation. I keep the existing test structure. ``` buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) conf.set(EVENT_LOG_ENABLE_ROLLING, true) buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) conf.set(EVENT_LOG_ENABLE_ROLLING, false) buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter]) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
dongjoon-hyun commented on code in PR #43638: URL: https://github.com/apache/spark/pull/43638#discussion_r1380713182 ## core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala: ## @@ -163,6 +163,7 @@ class EventLoggingListenerSuite extends SparkFunSuite with LocalSparkContext wit test("SPARK-31764: isBarrier should be logged in event log") { val conf = new SparkConf() conf.set(EVENT_LOG_ENABLED, true) +conf.set(EVENT_LOG_ENABLE_ROLLING, false) Review Comment: Yes, this test case tries to read the event log file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
viirya commented on code in PR #43638: URL: https://github.com/apache/spark/pull/43638#discussion_r1380704945 ## core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala: ## @@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite with LocalSparkCon conf.set(EVENT_LOG_DIR, testDir.toString) // default config -buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter]) +buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) conf.set(EVENT_LOG_ENABLE_ROLLING, true) Review Comment: Or we want to: ```suggestion conf.set(EVENT_LOG_ENABLE_ROLLING, false) buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter]) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
viirya commented on code in PR #43638: URL: https://github.com/apache/spark/pull/43638#discussion_r1380705437 ## core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala: ## @@ -163,6 +163,7 @@ class EventLoggingListenerSuite extends SparkFunSuite with LocalSparkContext wit test("SPARK-31764: isBarrier should be logged in event log") { val conf = new SparkConf() conf.set(EVENT_LOG_ENABLED, true) +conf.set(EVENT_LOG_ENABLE_ROLLING, false) Review Comment: Is it failed without setting to false? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
viirya commented on code in PR #43638: URL: https://github.com/apache/spark/pull/43638#discussion_r1380704238 ## core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala: ## @@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite with LocalSparkCon conf.set(EVENT_LOG_DIR, testDir.toString) // default config -buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter]) +buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) conf.set(EVENT_LOG_ENABLE_ROLLING, true) buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) Review Comment: ```suggestion ``` ## core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala: ## @@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite with LocalSparkCon conf.set(EVENT_LOG_DIR, testDir.toString) // default config -buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter]) +buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) conf.set(EVENT_LOG_ENABLE_ROLLING, true) Review Comment: Is it redundant then? ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
viirya commented on code in PR #43638: URL: https://github.com/apache/spark/pull/43638#discussion_r1380703869 ## core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala: ## @@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite with LocalSparkCon conf.set(EVENT_LOG_DIR, testDir.toString) // default config -buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter]) +buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter]) conf.set(EVENT_LOG_ENABLE_ROLLING, true) Review Comment: Is it redundant then? ```suggestion conf.set(EVENT_LOG_ENABLE_ROLLING, true) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
dongjoon-hyun commented on PR #43638: URL: https://github.com/apache/spark/pull/43638#issuecomment-1791433846 AppVeyor failure (SparkR) is irrelevant to this PR. Could you review this PR when you have some time, @viirya? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45733][CONNECT][PYTHON] Support multiple retry policies [spark]
juliuszsompolski commented on code in PR #43591: URL: https://github.com/apache/spark/pull/43591#discussion_r1380491812 ## python/pyspark/sql/connect/client/core.py: ## @@ -550,13 +545,11 @@ def fromProto(cls, pb: pb2.ConfigResponse) -> "ConfigResult": ) -class SparkConnectClient(object): -""" -Conceptually the remote spark session that communicates with the server -""" +class DefaultPolicy(RetryPolicy): Review Comment: nit: DefaultRetryPolicy ## python/pyspark/sql/connect/client/core.py: ## @@ -688,6 +695,14 @@ def enable_reattachable_execute(self) -> "SparkConnectClient": self._use_reattachable_execute = True return self +def register_retry_policy(self, policy: RetryPolicy): +if policy.name in self._known_retry_policies: +raise ValueError("Already known policy") +self._known_retry_policies[policy.name] = policy + +def set_retry_policies(self, policies: List[str]): +self._retry_policies = [self._known_retry_policies[name] for name in policies] Review Comment: do we need the functionality of registering policies that we are not going to be using? is there a use case for adding and removing policies? unless there's a good reason to have a separate `register` and `set`, maybe simplify it and just set is enough? ## python/pyspark/sql/connect/client/retries.py: ## @@ -0,0 +1,228 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import random +import time +import typing +from typing import Optional, Callable, Generator, List, Type +from types import TracebackType +from pyspark.sql.connect.client.logging import logger + +""" +This module contains retry system. The system is designed to be +significantly customizable. + +A key aspect of retries is RetryPolicy class, describing a single policy. +There can be more than one policy defined at the same time. Each policy +determines which error types it can retry and how exactly. + +For instance, networking errors should likely be retried differently that +remote resource being available. + +Given a sequence of policies, retry logic applies all of them in sequential +order, keeping track of different policies budgets. +""" + + +class RetryPolicy: +""" +Describes key aspects of RetryPolicy. + +It's advised that different policies are implemented as different subclasses. +""" + +def __init__( +self, +max_retries: Optional[int] = None, +initial_backoff: int = 1000, +max_backoff: Optional[int] = None, +backoff_multiplier: float = 1.0, +jitter: int = 0, +min_jitter_threshold: int = 0, +): +self.max_retries = max_retries +self.initial_backoff = initial_backoff +self.max_backoff = max_backoff +self.backoff_multiplier = backoff_multiplier +self.jitter = jitter +self.min_jitter_threshold = min_jitter_threshold + +@property +def name(self): +return self.__class__.__name__ + +def can_retry(self, exception: BaseException): +return False + +def to_state(self) -> "RetryPolicyState": +return RetryPolicyState(self) + + +class RetryPolicyState: +""" +This class represents stateful part of the specific policy. +""" + +def __init__(self, policy: RetryPolicy): +self._policy = policy + +# Will allow attempts [0, self._policy.max_retries) +self._attempt = 0 +self._next_wait: float = self._policy.initial_backoff + +@property +def policy(self): +return self._policy + +@property +def name(self): +return self.policy.name + +def can_retry(self, exception: BaseException): +return self.policy.can_retry(exception) + +def next_attempt(self) -> Optional[int]: +""" +Returns +--- +Randomized time (in milliseconds) to wait until this attempt +or None if this policy doesn't allow more retries. +""" + +if self.policy.max_retries is not None and self._attempt >= self.policy.max_retries: +# No more retries under this policy +return None + +
Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]
ueshin commented on code in PR #43635: URL: https://github.com/apache/spark/pull/43635#discussion_r1380579484 ## python/pyspark/sql/tests/test_udf.py: ## @@ -1020,6 +1022,15 @@ def test_udf(a): with self.assertRaisesRegex(PythonException, "StopIteration"): self.spark.range(10).select(test_udf(col("id"))).show() +def test_python_udf_segfault(self): +with self.sql_conf({"spark.sql.execution.pyspark.udf.faulthandler.enabled": True}): +try: +import ctypes + +self.spark.range(1).select(udf(lambda x: ctypes.string_at(0))("id")).collect() +except Exception as e: +self.assertRegex(str(e), "Segmentation fault") Review Comment: nit: wondering if we can use `with self.assertRaisesRegex( ... )` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]
dongjoon-hyun commented on code in PR #43638: URL: https://github.com/apache/spark/pull/43638#discussion_r1380456215 ## core/src/test/scala/org/apache/spark/deploy/history/EventLogTestHelper.scala: ## @@ -38,6 +38,7 @@ object EventLogTestHelper { def getLoggingConf(logDir: Path, compressionCodec: Option[String] = None): SparkConf = { val conf = new SparkConf conf.set(EVENT_LOG_ENABLED, true) +conf.set(EVENT_LOG_ENABLE_ROLLING, false) Review Comment: This is consistent with the existing function description. https://github.com/apache/spark/blob/5970d353360d4fb6647c8fbc10f733abf009eca1/core/src/test/scala/org/apache/spark/deploy/history/EventLogTestHelper.scala#L35 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]
LuciferYang commented on PR #43578: URL: https://github.com/apache/spark/pull/43578#issuecomment-1791089211 let's rebase this one @panbingkun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage View.force and ScalaNumberProxy.signum [spark]
ivoson commented on code in PR #43637: URL: https://github.com/apache/spark/pull/43637#discussion_r1380412775 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -362,10 +362,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] case s: Seq[_] => s.map(mapChild) case m: Map[_, _] => -// `map.mapValues().view.force` return `Map` in Scala 2.12 but return `IndexedSeq` in Scala -// 2.13, call `toMap` method manually to compatible with Scala 2.12 and Scala 2.13 -// `mapValues` is lazy and we need to force it to materialize -m.view.mapValues(mapChild).view.force.toMap +// `mapValues` is lazy and we need to force it to materialize by invoking `toIndexedSeq` +m.view.mapValues(mapChild).view.toIndexedSeq.toMap Review Comment: Yeah, it looks better. Thanks, done. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -784,13 +782,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] arg.asInstanceOf[BaseType].clone() case Some(arg: TreeNode[_]) if containsChild(arg) => Some(arg.asInstanceOf[BaseType].clone()) - // `map.mapValues().view.force` return `Map` in Scala 2.12 but return `IndexedSeq` in Scala - // 2.13, call `toMap` method manually to compatible with Scala 2.12 and Scala 2.13 + // `mapValues` is lazy and we need to force it to materialize by invoking `toIndexedSeq` case m: Map[_, _] => m.view.mapValues { case arg: TreeNode[_] if containsChild(arg) => arg.asInstanceOf[BaseType].clone() case other => other - }.view.force.toMap // `mapValues` is lazy and we need to force it to materialize + }.view.toIndexedSeq.toMap Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
pan3793 commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1380365514 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala: ## @@ -21,19 +21,15 @@ import java.util.Locale import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io.SequenceFile.CompressionType -import org.apache.hadoop.io.compress._ import org.apache.spark.util.Utils object CompressionCodecs { - private val shortCompressionCodecNames = Map( -"none" -> null, -"uncompressed" -> null, -"bzip2" -> classOf[BZip2Codec].getName, -"deflate" -> classOf[DeflateCodec].getName, -"gzip" -> classOf[GzipCodec].getName, -"lz4" -> classOf[Lz4Codec].getName, -"snappy" -> classOf[SnappyCodec].getName) + private val shortCompressionCodecNames = HadoopCompressionCodec.values().map { codec => +val className = + if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName +codec.lowerCaseName() -> className Review Comment: ```suggestion val className = if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName codec.lowerCaseName() -> Option(codec.getCompressionCodec).map(_.getClass.getName).orNull ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
pan3793 commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1380368351 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest // The data set has 2 partitions, so Spark will write at least 2 json files. // Use a non-splittable compression (gzip), to make sure the json scan RDD has at least 2 // partitions. -.write.partitionBy("p").option("compression", "gzip").json(path.getCanonicalPath) +.write.partitionBy("p") +.option("compression", GZIP.lowerCaseName()).json(path.getCanonicalPath) Review Comment: it's recommended to eliminate the `()` on calling Java no-arg method which has no side-effects ```suggestion .option("compression", GZIP.lowerCaseName).json(path.getCanonicalPath) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
abellina commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1791035949 @tgravescs thanks for the review. I have handled your comments in this commit: https://github.com/apache/spark/pull/43627/commits/0bd7e990812d23166509ad6585c8d352f78e569f -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45771][CORE] Enable spark.eventLog.rolling.enabled by default [spark]
dongjoon-hyun opened a new pull request, #43638: URL: https://github.com/apache/spark/pull/43638 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]
LuciferYang commented on PR #43633: URL: https://github.com/apache/spark/pull/43633#issuecomment-1790910694 Merged into master for Spark 4.0. Thanks @panbingkun @dongjoon-hyun and @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
pan3793 commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1380365514 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala: ## @@ -21,19 +21,15 @@ import java.util.Locale import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io.SequenceFile.CompressionType -import org.apache.hadoop.io.compress._ import org.apache.spark.util.Utils object CompressionCodecs { - private val shortCompressionCodecNames = Map( -"none" -> null, -"uncompressed" -> null, -"bzip2" -> classOf[BZip2Codec].getName, -"deflate" -> classOf[DeflateCodec].getName, -"gzip" -> classOf[GzipCodec].getName, -"lz4" -> classOf[Lz4Codec].getName, -"snappy" -> classOf[SnappyCodec].getName) + private val shortCompressionCodecNames = HadoopCompressionCodec.values().map { codec => +val className = + if (codec.getCompressionCodec == null) null else codec.getCompressionCodec.getClass.getName +codec.lowerCaseName() -> className Review Comment: ```suggestion codec.lowerCaseName() -> Option(codec.getCompressionCodec).map(_.getClass.getName).orNull ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]
hasnain-db commented on code in PR #43596: URL: https://github.com/apache/spark/pull/43596#discussion_r1380349554 ## common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java: ## @@ -161,14 +161,17 @@ public void testReload() throws Exception { // At this point we haven't reloaded, just the initial load assertEquals(0, tm.reloadCount); + // Wait so that the file modification time is different + Thread.sleep((tm.getReloadInterval() + 200)); Review Comment: thanks, fixing. also, for future people seeing this: the reason this is needed is that in certain scenarios this can run too fast, so the modification time of the original trust store and the one we create right after this is the same - in which case we do not reload the file, which fails this test. ## common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java: ## @@ -161,14 +161,17 @@ public void testReload() throws Exception { // At this point we haven't reloaded, just the initial load assertEquals(0, tm.reloadCount); + // Wait so that the file modification time is different + Thread.sleep((tm.getReloadInterval() + 200)); + // Add another cert Map certs = new HashMap(); certs.put("cert1", cert1); certs.put("cert2", cert2); createTrustStore(trustStore, "password", certs); - // Wait up to 5s until we reload - waitForReloadCount(tm, 1, 50); + // Wait up to 10s until we reload + waitForReloadCount(tm, 1, 100); Review Comment: yeah, on a heavily loaded system (~40 load average on ~40 cores) this can take longer than 5s -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380287495 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { } } } + + test("show table in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLES IN $catalog.nonexist") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing table") { +val namespace = "ns1" +val table = "nonexist" +withNamespaceAndTable(namespace, table, catalog) { _ => + val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '*$table*'") + assert(result.schema.fieldNames === +Seq("namespace", "tableName", "isTemporary", "information")) + assert(result.collect().isEmpty) +} + } + + test("show table extended in a not existing partition") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing PARTITIONED BY (id)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id = 1)") + checkError( +exception = intercept[AnalysisException] { + sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' PARTITION(id = 2)") +}, +errorClass = "PARTITIONS_NOT_FOUND", +parameters = Map( + "partitionList" -> "PARTITION (`id` = 2)", + "tableName" -> "`ns1`.`tbl`" +) + ) +} + } + + test("show table extended in non-partitioned table") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing") + val e = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' PARTITION(id = 1)") + } + val (errorClass, parameters) = extendedPartInNonPartedTableError(catalog, namespace, table) + checkError(exception = e, errorClass = errorClass, parameters = parameters) +} + } + + test("show table extended in multi partition key - " + +"the command's partition parameters are complete") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id1, id2)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)") + + val result = sql(s"SHOW TABLE EXTENDED FROM $catalog.$namespace " + +s"LIKE '$table' PARTITION(id1 = 1, id2 = 2)") + assert(result.schema.fieldNames === +Seq("namespace", "tableName", "isTemporary", "information")) + val resultCollect = result.collect() + assert(resultCollect(0).length == 4) + assert(resultCollect(0)(0) === namespace) + assert(resultCollect(0)(1) === table) + assert(resultCollect(0)(2) === false) + val actualResult = replace(resultCollect(0)(3).toString) + assert(actualResult === extendedPartExpectedResult) +} + } + + test("show table extended in multi partition key - " + +"the command's partition parameters are incomplete") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id1, id2)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)") + + checkError( +exception = intercept[AnalysisException] { + sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace " + +s"LIKE '$table' PARTITION(id1 = 1)") +}, +errorClass = "_LEGACY_ERROR_TEMP_1232", +parameters = Map( + "specKeys" -> "id1", + "partitionColumnNames" -> "id1, id2", + "tableName" -> s"`$catalog`.`$namespace`.`$table`") + ) +} + } + + test("show table extended in multi tables") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { _ => + sql(s"CREATE TABLE $catalog.$namespace.$table (id bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id)") + val table1 = "tbl1" + val table2 = "tbl2" + withTable(table1, table2) { +sql(s"CREATE TABLE $catalog.$namespace.$table1 (id1 bigint,
Re: [PR] [SPARK-45614][SQL] Assign names to error _LEGACY_ERROR_TEMP_215[6,7,8] [spark]
cloud-fan commented on code in PR #43481: URL: https://github.com/apache/spark/pull/43481#discussion_r1380320255 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -36,6 +36,7 @@ import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext import org.apache.spark.sql.catalyst.expressions.objects.InitializeJavaBean import org.apache.spark.sql.catalyst.rules.RuleIdCollection import org.apache.spark.sql.catalyst.util.BadRecordException +import org.apache.spark.sql.errors.DataTypeErrorsBase Review Comment: nit: We are already in the package `org.apache.spark.sql.errors` and this import is useless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380274949 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -40,6 +40,38 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { checkAnswer(df, expected) } + // the error class & error parameters of + // `SHOW TABLE EXTENDED ... PARTITION ... in non-partitioned table` + protected def extendedPartInNonPartedTableError( + catalog: String, + namespace: String, + table: String): (String, Map[String, String]) + + protected def extendedPartExpectedResult: String = +"Partition Values: [id1=1, id2=2]" + + protected def namespaceKey: String = "Database" + + protected def extendedTableExpectedResultDiff: String + + private def extendedTableExpectedResult( + catalog: String, + namespaceName: String, Review Comment: does this need to be a parameter? isn't it always the value of `def namespaceKey`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]
LuciferYang commented on code in PR #43620: URL: https://github.com/apache/spark/pull/43620#discussion_r1380243958 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util; + +import java.util.Arrays; +import java.util.Locale; +import java.util.Map; +import java.util.stream.Collectors; + +import org.apache.hadoop.io.compress.BZip2Codec; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.io.compress.DeflateCodec; +import org.apache.hadoop.io.compress.GzipCodec; +import org.apache.hadoop.io.compress.Lz4Codec; +import org.apache.hadoop.io.compress.SnappyCodec; + +/** + * A mapper class from Spark supported hadoop compression codecs to hadoop compression codecs. + */ +public enum HadoopCompressionCodec { + NONE(null), + UNCOMPRESSED(null), + BZIP2(new BZip2Codec()), + DEFLATE(new DeflateCodec()), + GZIP(new GzipCodec()), + LZ4(new Lz4Codec()), + SNAPPY(new SnappyCodec()); + + // TODO supports ZStandardCodec + + private final CompressionCodec compressionCodec; + + HadoopCompressionCodec(CompressionCodec compressionCodec) { +this.compressionCodec = compressionCodec; + } + + public CompressionCodec getCompressionCodec() { +return this.compressionCodec; + } + + private static final Map codecNameMap = +Arrays.stream(HadoopCompressionCodec.values()).collect( + Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT))); Review Comment: ```suggestion Collectors.toMap(Enum::name, codec -> codec.name().toLowerCase(Locale.ROOT))); ``` ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java: ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util; + +import java.util.Arrays; +import java.util.Locale; +import java.util.Map; +import java.util.stream.Collectors; + +import org.apache.hadoop.io.compress.BZip2Codec; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.io.compress.DeflateCodec; +import org.apache.hadoop.io.compress.GzipCodec; +import org.apache.hadoop.io.compress.Lz4Codec; +import org.apache.hadoop.io.compress.SnappyCodec; + +/** + * A mapper class from Spark supported hadoop compression codecs to hadoop compression codecs. + */ +public enum HadoopCompressionCodec { + NONE(null), + UNCOMPRESSED(null), + BZIP2(new BZip2Codec()), + DEFLATE(new DeflateCodec()), + GZIP(new GzipCodec()), + LZ4(new Lz4Codec()), + SNAPPY(new SnappyCodec()); + + // TODO supports ZStandardCodec + + private final CompressionCodec compressionCodec; + + HadoopCompressionCodec(CompressionCodec compressionCodec) { +this.compressionCodec = compressionCodec; + } + + public CompressionCodec getCompressionCodec() { +return this.compressionCodec; + } + + private static final Map codecNameMap = +Arrays.stream(HadoopCompressionCodec.values()).collect( + Collectors.toMap(codec -> codec.name(), codec -> codec.name().toLowerCase(Locale.ROOT))); Review Comment: nit: we can use method reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380290514 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { } } } + + test("show table in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLES IN $catalog.nonexist") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing table") { +val namespace = "ns1" +val table = "nonexist" +withNamespaceAndTable(namespace, table, catalog) { _ => + val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '*$table*'") + assert(result.schema.fieldNames === +Seq("namespace", "tableName", "isTemporary", "information")) + assert(result.collect().isEmpty) +} + } + + test("show table extended in a not existing partition") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing PARTITIONED BY (id)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id = 1)") + checkError( +exception = intercept[AnalysisException] { + sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' PARTITION(id = 2)") +}, +errorClass = "PARTITIONS_NOT_FOUND", +parameters = Map( + "partitionList" -> "PARTITION (`id` = 2)", + "tableName" -> "`ns1`.`tbl`" +) + ) +} + } + + test("show table extended in non-partitioned table") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing") + val e = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' PARTITION(id = 1)") + } + val (errorClass, parameters) = extendedPartInNonPartedTableError(catalog, namespace, table) + checkError(exception = e, errorClass = errorClass, parameters = parameters) +} + } + + test("show table extended in multi partition key - " + +"the command's partition parameters are complete") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id1, id2)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)") + + val result = sql(s"SHOW TABLE EXTENDED FROM $catalog.$namespace " + +s"LIKE '$table' PARTITION(id1 = 1, id2 = 2)") + assert(result.schema.fieldNames === +Seq("namespace", "tableName", "isTemporary", "information")) + val resultCollect = result.collect() + assert(resultCollect(0).length == 4) + assert(resultCollect(0)(0) === namespace) + assert(resultCollect(0)(1) === table) + assert(resultCollect(0)(2) === false) + val actualResult = replace(resultCollect(0)(3).toString) + assert(actualResult === extendedPartExpectedResult) +} + } + + test("show table extended in multi partition key - " + +"the command's partition parameters are incomplete") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id1, id2)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)") + + checkError( +exception = intercept[AnalysisException] { + sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace " + +s"LIKE '$table' PARTITION(id1 = 1)") +}, +errorClass = "_LEGACY_ERROR_TEMP_1232", +parameters = Map( + "specKeys" -> "id1", + "partitionColumnNames" -> "id1, id2", + "tableName" -> s"`$catalog`.`$namespace`.`$table`") + ) +} + } + + test("show table extended in multi tables") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { _ => + sql(s"CREATE TABLE $catalog.$namespace.$table (id bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id)") + val table1 = "tbl1" + val table2 = "tbl2" + withTable(table1, table2) { +sql(s"CREATE TABLE $catalog.$namespace.$table1 (id1 bigint,
Re: [PR] [SPARK-45760][SQL] Add With expression to avoid duplicating expressions [spark]
wangyum commented on code in PR #43623: URL: https://github.com/apache/spark/pull/43623#discussion_r1380292748 ## connector/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala: ## @@ -181,8 +183,15 @@ class ProtoToParsedPlanTestSuite val planner = new SparkConnectPlanner(SessionHolder.forTesting(spark)) val catalystPlan = analyzer.executeAndCheck(planner.transformRelation(relation), new QueryPlanningTracker) - val actual = - removeMemoryAddress(normalizeExprIds(ReplaceExpressions(catalystPlan)).treeString) + val finalAnalyzedPlan = { +object Helper extends RuleExecutor[LogicalPlan] { + val batches = +Batch("Finish Analysis", Once, ReplaceExpressions) :: +Batch("Rewrite With expression", FixedPoint(10), RewriteWithExpression) :: Nil Review Comment: Please fix the Scala style. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380291630 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala: ## @@ -165,4 +154,63 @@ class ShowTablesSuite extends ShowTablesSuiteBase with CommandSuiteBase { } } } + + override protected def extendedPartInNonPartedTableError( + catalog: String, + namespace: String, + table: String): (String, Map[String, String]) = { +("_LEGACY_ERROR_TEMP_1251", Review Comment: how hard it is to unify this error between v1 and v2 tables? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]
LuciferYang closed pull request #43633: [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT URL: https://github.com/apache/spark/pull/43633 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380268144 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.jdk.CollectionConverters._ + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.ResolvedPartitionSpec +import org.apache.spark.sql.catalyst.catalog.CatalogTableType +import org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal} +import org.apache.spark.sql.catalyst.util.{quoteIdentifier, StringUtils} +import org.apache.spark.sql.connector.catalog.{CatalogV2Util, Identifier, SupportsPartitionManagement, Table, TableCatalog} +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ +import org.apache.spark.sql.errors.QueryCompilationErrors +import org.apache.spark.sql.execution.LeafExecNode +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits.TableHelper +import org.apache.spark.sql.types.{StringType, StructType} + +/** + * Physical plan node for showing tables without partition, Show the information of tables. + */ +case class ShowTablesExtendedExec( +output: Seq[Attribute], +catalog: TableCatalog, +namespace: Seq[String], +pattern: String) extends V2CommandExec with LeafExecNode { + override protected def run(): Seq[InternalRow] = { +val rows = new ArrayBuffer[InternalRow]() + +// fetch tables +// TODO We need a new listTable overload that takes a pattern string. +val tables = catalog.listTables(namespace.toArray) +tables.map { tableIdent => + if (StringUtils.filterPattern(Seq(tableIdent.name()), pattern).nonEmpty) { +val table = catalog.loadTable(tableIdent) +val information = getTableDetails(catalog.name, tableIdent, table) +rows += toCatalystRow(tableIdent.namespace().quoted, tableIdent.name(), false, + s"$information\n") +} + } + +// fetch temp views, includes: global temp view, local temp view +val sessionCatalog = session.sessionState.catalog +val db = namespace match { + case Seq(db) => Some(db) + case _ => None +} +val views = sessionCatalog.listTempViews(db.get, pattern) +views.map { viewIdent => + val database = viewIdent.database.getOrElse("") + val tableName = viewIdent.table + val isTemp = sessionCatalog.isTempView(viewIdent) + val view = sessionCatalog.getTempViewOrPermanentTableMetadata(viewIdent) + val information = view.simpleString + rows += toCatalystRow(database, tableName, isTemp, s"$information\n") +} + +rows.toSeq + } + + private def getTableDetails( + catalogName: String, + identifier: Identifier, + table: Table): String = { +val results = new mutable.LinkedHashMap[String, String]() + +results.put("Catalog", catalogName) +results.put("Namespace", identifier.namespace().quoted) +results.put("Table", identifier.name()) +val tableType = if (table.properties().containsKey(TableCatalog.PROP_EXTERNAL)) { + CatalogTableType.EXTERNAL +} else { + CatalogTableType.MANAGED +} +results.put("Type", tableType.name) + +CatalogV2Util.TABLE_RESERVED_PROPERTIES + .filterNot(_ == TableCatalog.PROP_EXTERNAL) + .foreach(propKey => { +if (table.properties.containsKey(propKey)) { + results.put(propKey.capitalize, table.properties.get(propKey)) +} + }) + +val properties = + conf.redactOptions(table.properties.asScala.toMap).toList +.filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1)) +.sortBy(_._1).map { +case (key, value) => key + "=" + value + }.mkString("[", ",", "]") +if (table.properties().isEmpty) { + results.put("Table Properties", properties.mkString("[", ", ", "]")) +} + +// Partition
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380283326 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { } } } + + test("show table in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLES IN $catalog.nonexist") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing table") { +val namespace = "ns1" +val table = "nonexist" +withNamespaceAndTable(namespace, table, catalog) { _ => + val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '*$table*'") + assert(result.schema.fieldNames === +Seq("namespace", "tableName", "isTemporary", "information")) + assert(result.collect().isEmpty) +} + } + + test("show table extended in a not existing partition") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing PARTITIONED BY (id)") + sql(s"ALTER TABLE $tbl ADD PARTITION (id = 1)") + checkError( +exception = intercept[AnalysisException] { + sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' PARTITION(id = 2)") +}, +errorClass = "PARTITIONS_NOT_FOUND", +parameters = Map( + "partitionList" -> "PARTITION (`id` = 2)", + "tableName" -> "`ns1`.`tbl`" +) + ) +} + } + + test("show table extended in non-partitioned table") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing") + val e = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' PARTITION(id = 1)") + } + val (errorClass, parameters) = extendedPartInNonPartedTableError(catalog, namespace, table) + checkError(exception = e, errorClass = errorClass, parameters = parameters) +} + } + + test("show table extended in multi partition key - " + +"the command's partition parameters are complete") { +val namespace = "ns1" +val table = "tbl" +withNamespaceAndTable(namespace, table, catalog) { tbl => + sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " + +s"$defaultUsing PARTITIONED BY (id1, id2)") Review Comment: nit: I think tests in the base suite should always partition by the ending columns. We can add a new simple test in v1 suite to prove that partition columns are always at the end, and a new simple test in v2 to prove that we respect the original table schema. I think the v1 behavior is probably a bug, but we may never fix it as it becomes a feature :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380247392 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -2483,7 +2477,7 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat messageParameters = Map( "specKeys" -> specKeys, "partitionColumnNames" -> partitionColumnNames.mkString(", "), -"tableName" -> tableName)) +"tableName" -> toSQLId(tableName))) Review Comment: This change is necessary as it helps to unify the v1/v2 command behavior, which is an important goal of adding new v2 commands -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380278877 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { } } } + + test("show table in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLES IN $catalog.nonexist") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing table") { +val namespace = "ns1" +val table = "nonexist" +withNamespaceAndTable(namespace, table, catalog) { _ => + val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '*$table*'") + assert(result.schema.fieldNames === +Seq("namespace", "tableName", "isTemporary", "information")) + assert(result.collect().isEmpty) +} + } + + test("show table extended in a not existing partition") { Review Comment: ```suggestion test("show table extended with a not existing partition") { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380278496 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { } } } + + test("show table in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLES IN $catalog.nonexist") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing namespace") { +checkError( + exception = intercept[AnalysisException] { +sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'") + }, + errorClass = "SCHEMA_NOT_FOUND", + parameters = Map("schemaName" -> "`nonexist`")) + } + + test("show table extended in a not existing table") { Review Comment: ```suggestion test("show table extended with no matching table") { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380276558 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala: ## @@ -40,6 +40,38 @@ trait ShowTablesSuiteBase extends QueryTest with DDLCommandTestUtils { checkAnswer(df, expected) } + // the error class & error parameters of + // `SHOW TABLE EXTENDED ... PARTITION ... in non-partitioned table` + protected def extendedPartInNonPartedTableError( + catalog: String, + namespace: String, + table: String): (String, Map[String, String]) + + protected def extendedPartExpectedResult: String = +"Partition Values: [id1=1, id2=2]" + + protected def namespaceKey: String = "Database" + + protected def extendedTableExpectedResultDiff: String + + private def extendedTableExpectedResult( + catalog: String, + namespaceName: String, + namespace: String, + table: String, + partColName: String, + dataColName: String): String = { +s"""Catalog: $catalog + |$namespaceName: $namespace + |Table: $table + |$extendedTableExpectedResultDiff Review Comment: this name is a bit confusing. Do you mean `extendedTableInfo`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380266348 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.jdk.CollectionConverters._ + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.ResolvedPartitionSpec +import org.apache.spark.sql.catalyst.catalog.CatalogTableType +import org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal} +import org.apache.spark.sql.catalyst.util.{quoteIdentifier, StringUtils} +import org.apache.spark.sql.connector.catalog.{CatalogV2Util, Identifier, SupportsPartitionManagement, Table, TableCatalog} +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ +import org.apache.spark.sql.errors.QueryCompilationErrors +import org.apache.spark.sql.execution.LeafExecNode +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits.TableHelper +import org.apache.spark.sql.types.{StringType, StructType} + +/** + * Physical plan node for showing tables without partition, Show the information of tables. + */ +case class ShowTablesExtendedExec( +output: Seq[Attribute], +catalog: TableCatalog, +namespace: Seq[String], +pattern: String) extends V2CommandExec with LeafExecNode { + override protected def run(): Seq[InternalRow] = { +val rows = new ArrayBuffer[InternalRow]() + +// fetch tables +// TODO We need a new listTable overload that takes a pattern string. +val tables = catalog.listTables(namespace.toArray) +tables.map { tableIdent => + if (StringUtils.filterPattern(Seq(tableIdent.name()), pattern).nonEmpty) { +val table = catalog.loadTable(tableIdent) +val information = getTableDetails(catalog.name, tableIdent, table) +rows += toCatalystRow(tableIdent.namespace().quoted, tableIdent.name(), false, + s"$information\n") +} + } + +// fetch temp views, includes: global temp view, local temp view +val sessionCatalog = session.sessionState.catalog +val db = namespace match { + case Seq(db) => Some(db) + case _ => None +} +val views = sessionCatalog.listTempViews(db.get, pattern) +views.map { viewIdent => + val database = viewIdent.database.getOrElse("") + val tableName = viewIdent.table + val isTemp = sessionCatalog.isTempView(viewIdent) + val view = sessionCatalog.getTempViewOrPermanentTableMetadata(viewIdent) + val information = view.simpleString + rows += toCatalystRow(database, tableName, isTemp, s"$information\n") +} + +rows.toSeq + } + + private def getTableDetails( + catalogName: String, + identifier: Identifier, + table: Table): String = { +val results = new mutable.LinkedHashMap[String, String]() + +results.put("Catalog", catalogName) +results.put("Namespace", identifier.namespace().quoted) +results.put("Table", identifier.name()) +val tableType = if (table.properties().containsKey(TableCatalog.PROP_EXTERNAL)) { + CatalogTableType.EXTERNAL +} else { + CatalogTableType.MANAGED +} +results.put("Type", tableType.name) + +CatalogV2Util.TABLE_RESERVED_PROPERTIES + .filterNot(_ == TableCatalog.PROP_EXTERNAL) + .foreach(propKey => { +if (table.properties.containsKey(propKey)) { + results.put(propKey.capitalize, table.properties.get(propKey)) +} + }) + +val properties = + conf.redactOptions(table.properties.asScala.toMap).toList +.filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1)) +.sortBy(_._1).map { +case (key, value) => key + "=" + value + }.mkString("[", ",", "]") +if (table.properties().isEmpty) { + results.put("Table Properties", properties.mkString("[", ", ", "]")) +} + +// Partition
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380265564 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala: ## @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.jdk.CollectionConverters._ + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.ResolvedPartitionSpec +import org.apache.spark.sql.catalyst.catalog.CatalogTableType +import org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal} +import org.apache.spark.sql.catalyst.util.{quoteIdentifier, StringUtils} +import org.apache.spark.sql.connector.catalog.{CatalogV2Util, Identifier, SupportsPartitionManagement, Table, TableCatalog} +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ +import org.apache.spark.sql.errors.QueryCompilationErrors +import org.apache.spark.sql.execution.LeafExecNode +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits.TableHelper +import org.apache.spark.sql.types.{StringType, StructType} + +/** + * Physical plan node for showing tables without partition, Show the information of tables. + */ +case class ShowTablesExtendedExec( +output: Seq[Attribute], +catalog: TableCatalog, +namespace: Seq[String], +pattern: String) extends V2CommandExec with LeafExecNode { + override protected def run(): Seq[InternalRow] = { +val rows = new ArrayBuffer[InternalRow]() + +// fetch tables +// TODO We need a new listTable overload that takes a pattern string. +val tables = catalog.listTables(namespace.toArray) +tables.map { tableIdent => + if (StringUtils.filterPattern(Seq(tableIdent.name()), pattern).nonEmpty) { +val table = catalog.loadTable(tableIdent) +val information = getTableDetails(catalog.name, tableIdent, table) +rows += toCatalystRow(tableIdent.namespace().quoted, tableIdent.name(), false, + s"$information\n") +} + } + +// fetch temp views, includes: global temp view, local temp view +val sessionCatalog = session.sessionState.catalog +val db = namespace match { + case Seq(db) => Some(db) + case _ => None +} +val views = sessionCatalog.listTempViews(db.get, pattern) +views.map { viewIdent => + val database = viewIdent.database.getOrElse("") + val tableName = viewIdent.table + val isTemp = sessionCatalog.isTempView(viewIdent) + val view = sessionCatalog.getTempViewOrPermanentTableMetadata(viewIdent) + val information = view.simpleString + rows += toCatalystRow(database, tableName, isTemp, s"$information\n") +} + +rows.toSeq + } + + private def getTableDetails( + catalogName: String, + identifier: Identifier, + table: Table): String = { +val results = new mutable.LinkedHashMap[String, String]() + +results.put("Catalog", catalogName) +results.put("Namespace", identifier.namespace().quoted) +results.put("Table", identifier.name()) +val tableType = if (table.properties().containsKey(TableCatalog.PROP_EXTERNAL)) { + CatalogTableType.EXTERNAL +} else { + CatalogTableType.MANAGED +} +results.put("Type", tableType.name) + +CatalogV2Util.TABLE_RESERVED_PROPERTIES + .filterNot(_ == TableCatalog.PROP_EXTERNAL) + .foreach(propKey => { +if (table.properties.containsKey(propKey)) { + results.put(propKey.capitalize, table.properties.get(propKey)) +} + }) + +val properties = + conf.redactOptions(table.properties.asScala.toMap).toList +.filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1)) +.sortBy(_._1).map { +case (key, value) => key + "=" + value + }.mkString("[", ",", "]") +if (table.properties().isEmpty) { + results.put("Table Properties", properties.mkString("[", ", ", "]")) +} + +// Partition
Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]
cloud-fan commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r1380269913 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -1090,6 +1090,22 @@ class SessionCatalog( dbViews ++ listLocalTempViews(pattern) } + /** + * List all matching temp views in the specified database, including global/local temporary views. + */ + def listTempViews(db: String, pattern: String): Seq[TableIdentifier] = { Review Comment: shall we just return `Seq[CatalogTable]`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org