[GitHub] [spark] dongjoon-hyun commented on pull request #36863: [SPARK-39459][CORE] `local*HostName*` methods should support `IPv6`
dongjoon-hyun commented on PR #36863: URL: https://github.com/apache/spark/pull/36863#issuecomment-1154793223 Just FYI, `scala.io.Source.fromURL` seems to be unable to support IPv6 for now. I'm digging that part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cxzl25 commented on a diff in pull request #36740: [SPARK-39355][SQL] Avoid UnresolvedAttribute.apply throwing ParseException
cxzl25 commented on code in PR #36740: URL: https://github.com/apache/spark/pull/36740#discussion_r896444180 ## sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala: ## @@ -2176,4 +2176,32 @@ class SubquerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark } } } + + test("SPARK-39355: Avoid UnresolvedAttribute.apply throwing ParseException") { +checkAnswer( + sql(""" +|SELECT * Review Comment: It seems that there may be a problem here, although I didn't reproduce it using SQL, do I need to fix it? https://github.com/apache/spark/blob/2349175e1b81b0a61e1ed90c2d051c01cf78de9b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L120 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36863: [SPARK-39459][CORE] `local*HostName*` methods should support `IPv6`
dongjoon-hyun commented on PR #36863: URL: https://github.com/apache/spark/pull/36863#issuecomment-1154791746 There are totally independent from this patch. I'm working on them, too, @LuciferYang . :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36863: [SPARK-39459][CORE] `local*HostName*` methods should support `IPv6`
LuciferYang commented on PR #36863: URL: https://github.com/apache/spark/pull/36863#issuecomment-1154784120 - RocksDBBackendHistoryServerSuite ``` - application list json *** FAILED *** java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at java.net.Socket.connect(Socket.java:561) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ``` It seems that HTTP related cases will fail? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36863: [SPARK-39459][CORE] `local*HostName*` methods should support `IPv6`
LuciferYang commented on PR #36863: URL: https://github.com/apache/spark/pull/36863#issuecomment-1154782012 I found other failed cases, like - UISeleniumSuite ``` - effects of unpersist() / persist() should be reflected *** FAILED *** java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at java.net.Socket.connect(Socket.java:561) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ... - failed stages should not appear to be active *** FAILED *** java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at java.net.Socket.connect(Socket.java:561) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ``` - UISuite ``` - SPARK-36237: Attach and start handler after application started in UI *** FAILED *** java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:613) at java.net.Socket.connect(Socket.java:561) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36855: [SPARK-39432][SQL] element_at(*, 0) does not return INVALID_ARRAY_INDEX_IN_ELEMENT_AT
MaxGekk commented on code in PR #36855: URL: https://github.com/apache/spark/pull/36855#discussion_r896426977 ## sql/core/src/test/resources/sql-tests/results/ansi/array.sql.out: ## @@ -191,8 +191,8 @@ select element_at(array(1, 2, 3), 0) -- !query schema struct<> -- !query output -java.lang.ArrayIndexOutOfBoundsException -SQL array indices start at 1 +org.apache.spark.SparkArrayIndexOutOfBoundsException +[INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index 0 is out of bounds. The array has 3 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. Review Comment: The error might confuse users, I guess. Why 0 is out of bound? We should say that indexes can be <0 (counting from the end) and >0 (the first element has index 1). I would introduce separate error class for the case. WDYT @cloud-fan @srielau ? ## sql/core/src/test/resources/sql-tests/results/ansi/try_element_at.sql.out: ## @@ -7,8 +7,8 @@ SELECT try_element_at(array(1, 2, 3), 0) -- !query schema struct<> -- !query output -java.lang.ArrayIndexOutOfBoundsException -SQL array indices start at 1 +org.apache.spark.SparkArrayIndexOutOfBoundsException +[INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index 0 is out of bounds. The array has 3 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. Review Comment: > Use `try_element_at` to tolerate The query uses `try_element_at` already. The recommendation is not useful again. ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala: ## @@ -2348,7 +2348,7 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper // SQL array indices start at 1 exception throws for both mode. expr = ElementAt(array, Literal(0)) -val errMsg = "SQL array indices start at 1" +val errMsg = "The index 0 is out of bounds. The array has 3 elements." Review Comment: The test `SPARK-33386: element_at ArrayIndexOutOfBoundsException` checks errors only. Should we move it to `QueryExecutionAnsiErrorsSuite` or/and `QueryExecutionErrorsSuite`? ## sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out: ## @@ -177,8 +177,8 @@ SELECT split_part('11.12.13', '.', 0) -- !query schema struct<> -- !query output -java.lang.ArrayIndexOutOfBoundsException -SQL array indices start at 1 +org.apache.spark.SparkArrayIndexOutOfBoundsException +[INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index 0 is out of bounds. The array has 3 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. Review Comment: Use `try_element_at` to tolerate accessing element ... this might confuse users. The recommendation is not useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`
dongjoon-hyun commented on PR #36864: URL: https://github.com/apache/spark/pull/36864#issuecomment-1154773786 Thank you, @huaxingao . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
beliefer commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154758012 @sadikovi You can run the test case I added above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a diff in pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.
Ngone51 commented on code in PR #36162: URL: https://github.com/apache/spark/pull/36162#discussion_r896408996 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -1069,25 +1084,56 @@ private[spark] class TaskSetManager( * Check if the task associated with the given tid has past the time threshold and should be * speculative run. */ - private def checkAndSubmitSpeculatableTask( - tid: Long, + private def checkAndSubmitSpeculatableTasks( currentTimeMillis: Long, - threshold: Double): Boolean = { -val info = taskInfos(tid) -val index = info.index -if (!successful(index) && copiesRunning(index) == 1 && -info.timeRunning(currentTimeMillis) > threshold && !speculatableTasks.contains(index)) { - addPendingTask(index, speculatable = true) - logInfo( -("Marking task %d in stage %s (on %s) as speculatable because it ran more" + - " than %.0f ms(%d speculatable tasks in this taskset now)") - .format(index, taskSet.id, info.host, threshold, speculatableTasks.size + 1)) - speculatableTasks += index - sched.dagScheduler.speculativeTaskSubmitted(tasks(index)) - true -} else { - false + threshold: Double, + numSuccessfulTasks: Int, + customizedThreshold: Boolean = false): Boolean = { +var foundTasksResult = false +for (tid <- runningTasksSet) { + val info = taskInfos(tid) + val index = info.index + if (!successful(index) && copiesRunning(index) == 1 && !speculatableTasks.contains(index)) { +val runtimeMs = info.timeRunning(currentTimeMillis) + +def checkMaySpeculate(): Boolean = { + if (customizedThreshold || taskProcessRateCalculator.isEmpty) { +true + } else { +val longTimeTask = runtimeMs > efficientTaskDurationFactor * threshold +longTimeTask || taskProcessRateCalculator.exists(_.isInefficient(tid, runtimeMs, info)) + } +} + +def shouldSpeculateForExecutorDecomissioning(): Boolean = { Review Comment: This should be only called when `customizedThreshold=false`, right? Given that it depends on `successfulTaskDurations.median`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a diff in pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.
Ngone51 commented on code in PR #36162: URL: https://github.com/apache/spark/pull/36162#discussion_r896407476 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -1069,25 +1084,56 @@ private[spark] class TaskSetManager( * Check if the task associated with the given tid has past the time threshold and should be * speculative run. */ - private def checkAndSubmitSpeculatableTask( - tid: Long, + private def checkAndSubmitSpeculatableTasks( currentTimeMillis: Long, - threshold: Double): Boolean = { -val info = taskInfos(tid) -val index = info.index -if (!successful(index) && copiesRunning(index) == 1 && -info.timeRunning(currentTimeMillis) > threshold && !speculatableTasks.contains(index)) { - addPendingTask(index, speculatable = true) - logInfo( -("Marking task %d in stage %s (on %s) as speculatable because it ran more" + - " than %.0f ms(%d speculatable tasks in this taskset now)") - .format(index, taskSet.id, info.host, threshold, speculatableTasks.size + 1)) - speculatableTasks += index - sched.dagScheduler.speculativeTaskSubmitted(tasks(index)) - true -} else { - false + threshold: Double, + numSuccessfulTasks: Int, + customizedThreshold: Boolean = false): Boolean = { +var foundTasksResult = false +for (tid <- runningTasksSet) { + val info = taskInfos(tid) + val index = info.index + if (!successful(index) && copiesRunning(index) == 1 && !speculatableTasks.contains(index)) { +val runtimeMs = info.timeRunning(currentTimeMillis) + +def checkMaySpeculate(): Boolean = { + if (customizedThreshold || taskProcessRateCalculator.isEmpty) { +true + } else { +val longTimeTask = runtimeMs > efficientTaskDurationFactor * threshold Review Comment: This seems to strict the speculatable condition of a task. Previously, a task can be speculate as long as the `runtimeMs > speculationMultiplier * medianDuration`, but now it has to be `runtimeMs > efficientTaskDurationFactor * speculationMultiplier * medianDuration`. And you can't fallback to the original behaviour even if `spark.speculation.efficiency.enabled` is disabled, while it's supposed to be a global control flag for the whole feature. Besides, with my understanding, should the condition be like `(runtimeMs > speculationMultiplier * medianDuration) && taskProcessRateCalculator.exists(_.isInefficient(tid, runtimeMs, info))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
sadikovi commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154752252 @beliefer Maybe we can address your concerns in the follow-up work, what do you think? We can open a follow-up ticket and try to polish the implementation - it is not perfect by any means! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a diff in pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.
Ngone51 commented on code in PR #36162: URL: https://github.com/apache/spark/pull/36162#discussion_r896402450 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -1069,25 +1084,56 @@ private[spark] class TaskSetManager( * Check if the task associated with the given tid has past the time threshold and should be * speculative run. */ - private def checkAndSubmitSpeculatableTask( - tid: Long, + private def checkAndSubmitSpeculatableTasks( currentTimeMillis: Long, - threshold: Double): Boolean = { -val info = taskInfos(tid) -val index = info.index -if (!successful(index) && copiesRunning(index) == 1 && -info.timeRunning(currentTimeMillis) > threshold && !speculatableTasks.contains(index)) { - addPendingTask(index, speculatable = true) - logInfo( -("Marking task %d in stage %s (on %s) as speculatable because it ran more" + - " than %.0f ms(%d speculatable tasks in this taskset now)") - .format(index, taskSet.id, info.host, threshold, speculatableTasks.size + 1)) - speculatableTasks += index - sched.dagScheduler.speculativeTaskSubmitted(tasks(index)) - true -} else { - false + threshold: Double, + numSuccessfulTasks: Int, Review Comment: not used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
sadikovi commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154747718 I think we can. JDBC dialects can configure how they map TimestampNTZ type. In the case you mentioned, both timestamps will be read as timestamp_ntz in MySQL and Postgres. In fact, the current timestamp type is stored as timestamp_ntz in those database systems. Even with dialects managing timestamp_ntz writes and reads, this would be the same problem unless you store them as different types. Also, the test passes in master: ``` [ivan.sadikov@C02DV1TGMD6R spark-oss (master)]$ git log -n1 commit 2349175e1b81b0a61e1ed90c2d051c01cf78de9b (HEAD -> master, upstream/master) Author: Ivan Sadikov Date: Mon Jun 13 21:22:15 2022 -0700 [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source [info] JDBCSuite: 05:53:04.233 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 05:53:07.936 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built-in provider. [info] - SPARK-39339: Handle TimestampNTZType null values (1 second, 555 milliseconds) [info] - SPARK-39339: TimestampNTZType with different local time zones (4 seconds, 48 milliseconds) 05:53:14.022 WARN org.apache.spark.sql.jdbc.JDBCSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.jdbc.JDBCSuite, threads: Timer-2 (daemon=true), rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) = [info] Run completed in 11 seconds, 724 milliseconds. [info] Total number of tests run: 2 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 81 s (01:21), completed Jun 14, 2022 5:53:14 AM ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a diff in pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.
Ngone51 commented on code in PR #36162: URL: https://github.com/apache/spark/pull/36162#discussion_r896400772 ## core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala: ## @@ -1069,25 +1084,56 @@ private[spark] class TaskSetManager( * Check if the task associated with the given tid has past the time threshold and should be * speculative run. */ - private def checkAndSubmitSpeculatableTask( - tid: Long, + private def checkAndSubmitSpeculatableTasks( currentTimeMillis: Long, - threshold: Double): Boolean = { -val info = taskInfos(tid) -val index = info.index -if (!successful(index) && copiesRunning(index) == 1 && -info.timeRunning(currentTimeMillis) > threshold && !speculatableTasks.contains(index)) { - addPendingTask(index, speculatable = true) - logInfo( -("Marking task %d in stage %s (on %s) as speculatable because it ran more" + - " than %.0f ms(%d speculatable tasks in this taskset now)") - .format(index, taskSet.id, info.host, threshold, speculatableTasks.size + 1)) - speculatableTasks += index - sched.dagScheduler.speculativeTaskSubmitted(tasks(index)) - true -} else { - false + threshold: Double, + numSuccessfulTasks: Int, + customizedThreshold: Boolean = false): Boolean = { +var foundTasksResult = false +for (tid <- runningTasksSet) { + val info = taskInfos(tid) + val index = info.index + if (!successful(index) && copiesRunning(index) == 1 && !speculatableTasks.contains(index)) { +val runtimeMs = info.timeRunning(currentTimeMillis) + +def checkMaySpeculate(): Boolean = { + if (customizedThreshold || taskProcessRateCalculator.isEmpty) { +true + } else { +val longTimeTask = runtimeMs > efficientTaskDurationFactor * threshold +longTimeTask || taskProcessRateCalculator.exists(_.isInefficient(tid, runtimeMs, info)) + } +} + +def shouldSpeculateForExecutorDecomissioning(): Boolean = { Review Comment: ```suggestion def shouldSpeculateForExecutorDecommissioning(): Boolean = { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on pull request #36716: [SPARK-39062][CORE] Add stage level resource scheduling support for standalone cluster
ivoson commented on PR #36716: URL: https://github.com/apache/spark/pull/36716#issuecomment-1154747061 > I kind of disagree because it doesn't work as expected compared to other resource manager. This to me is very confusing. I kind of hate to add more features on what I would consider a broken feature and then people think its ok. If people are using this successfully and finding it useful, the very least we need to do if document how it works and its limitations. I somewhat remember hitting other issues with dynamic allocation in standalone but then giving up on it as I talked to a few people that said it wasn't used and figured someone would go through and test it. ie on that same issue I mention "Note that there are other places in the code that uses executor cores which could also be wrong in standalone mode. for instance PythonRunner is using it to split memory." > > @ivoson I'm assuming you are using this successfully in production, how much testing have you done with dynamic allocation? Hi @tgravescs , thanks for pointing out the issue. Actually, we are not using this in production right now. Just find this feature very useful and we can leverage it, but it can't work with standalone cluster right now, so I want to support it. For your concern, I'd like to create another ticket to update the doc first to make people be aware of it. Thanks @tgravescs and @Ngone51 for your guidance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
beliefer commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154742731 > I updated the test case as you suggested and it passes on my machine. Can you share the error message? It also passed the build. ``` == Results == !== Correct Answer - 1 == == Spark Answer - 1 == struct struct ![1500-01-20T00:00:00.123456] [1500-01-20T00:16:08.123456] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
beliefer commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154742475 I think we can't support timestamp ntz with the option. We should let JDBC dialect to decide how to supports timestamp ntz. If one table have ts1 is timestamp and ts2 is timestamp ntz, what is the output when we specify the `inferTimestampNTZType` option ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode
MaxGekk closed pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode URL: https://github.com/apache/spark/pull/36852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
sadikovi commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154740681 I updated the test case as you suggested and it passes on my machine. Can you share the error message? It also passed the build. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode
MaxGekk commented on PR #36852: URL: https://github.com/apache/spark/pull/36852#issuecomment-1154740365 +1, LGTM. Merging to 3.3. Thank you, @panbingkun and @dongjoon-hyun for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
gengliangwang commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154737984 @beliefer Your test case is testing ORC, while this PR is about JDBC... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`
dongjoon-hyun commented on code in PR #36864: URL: https://github.com/apache/spark/pull/36864#discussion_r896391807 ## core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java: ## @@ -32,6 +33,7 @@ import org.junit.Test; public class JavaJdbcRDDSuite implements Serializable { + private String dbName = "db_" + UUID.randomUUID().toString().replace('-', '_'); Review Comment: I borrowed the logic. Thanks, @HyukjinKwon . https://github.com/apache/spark/blob/2349175e1b81b0a61e1ed90c2d051c01cf78de9b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala#L354 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
beliefer commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154735759 > Thanks, merging to master I update this test case and it will fail ! ``` test("SPARK-37463: read/write Timestamp ntz to Orc with different time zone") { DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA) { val sqlText = """ |select | timestamp_ntz '2021-06-01 00:00:00' ts_ntz1, | timestamp_ntz '1883-11-16 00:00:00.0' as ts_ntz2, | timestamp_ntz '2021-03-14 02:15:00.0' as ts_ntz3 |""".stripMargin withTempPath { dir => val path = dir.getCanonicalPath val df = sql(sqlText) df.write.mode("overwrite").orc(path) val query = s"select * from `orc`.`$path`" DateTimeTestUtils.outstandingZoneIds.foreach { zoneId => DateTimeTestUtils.withDefaultTimeZone(zoneId) { withAllNativeOrcReaders { checkAnswer(sql(query), df) } } } } } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`
dongjoon-hyun commented on code in PR #36864: URL: https://github.com/apache/spark/pull/36864#discussion_r896389409 ## core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java: ## @@ -32,6 +33,7 @@ import org.junit.Test; public class JavaJdbcRDDSuite implements Serializable { + private String dbName = UUID.randomUUID().toString(); Review Comment: Oh, thank you. Sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`
HyukjinKwon commented on code in PR #36864: URL: https://github.com/apache/spark/pull/36864#discussion_r896378643 ## core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java: ## @@ -32,6 +33,7 @@ import org.junit.Test; public class JavaJdbcRDDSuite implements Serializable { + private String dbName = UUID.randomUUID().toString(); Review Comment: I remember it becomes flaky when the UUID starts with numbers (in Spark SQL), e.g.. `SQLTestUtils.withTempDatabase`. Should we better copy and paste that here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #36716: [SPARK-39062][CORE] Add stage level resource scheduling support for standalone cluster
Ngone51 commented on PR #36716: URL: https://github.com/apache/spark/pull/36716#issuecomment-1154714325 @tgravescs Thanks for your feedback. I agree it's a kind of dirty way to build new features upon the features with known issues, which hides the issues deeply in further. I'm also +1 on the proposal to update docs and warn users in the case of None cores. For a long term, we could think of a better solution for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36855: [SPARK-39432][SQL] element_at(*, 0) does not return INVALID_ARRAY_INDEX_IN_ELEMENT_AT
AmplabJenkins commented on PR #36855: URL: https://github.com/apache/spark/pull/36855#issuecomment-1154710152 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
ulysses-you commented on code in PR #36856: URL: https://github.com/apache/spark/pull/36856#discussion_r896371376 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -53,6 +53,17 @@ case class UnaryMinus( override def toString: String = s"-$child" private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError) + private lazy val unaryMinusFunc: Any => Any = dataType match { Review Comment: @srowen @LuciferYang This issue mostly happens in arithmetic like expression, and I also tried to collect other related expressions. Hope I'm not missing someone. We can still keep in mind for catching and reviewing the added expression or the new supported data type which may introduce this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
gengliangwang closed pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source URL: https://github.com/apache/spark/pull/36726 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
gengliangwang commented on PR #36726: URL: https://github.com/apache/spark/pull/36726#issuecomment-1154692745 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`
dongjoon-hyun opened a new pull request, #36864: URL: https://github.com/apache/spark/pull/36864 ### What changes were proposed in this pull request? This PR aims to use UUID instead of a fixed test database location in `JavaJdbcRDDSuite`. ### Why are the changes needed? Although there exists a clean-up logic in `JavaJdbcRDDSuite`, the location is not removed cleanly when the tests are interrupted. After this PR, we can avoid the conflicts due to the leftover. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
LuciferYang commented on PR #36856: URL: https://github.com/apache/spark/pull/36856#issuecomment-1154677592 Are there any other similar cases? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36863: [SPARK-39459][CORE] `localHostName*` methods should support `IPv6`
dongjoon-hyun commented on PR #36863: URL: https://github.com/apache/spark/pull/36863#issuecomment-1154672837 cc @dbtsai -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
srowen commented on code in PR #36856: URL: https://github.com/apache/spark/pull/36856#discussion_r896341536 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -53,6 +53,17 @@ case class UnaryMinus( override def toString: String = s"-$child" private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError) + private lazy val unaryMinusFunc: Any => Any = dataType match { Review Comment: Oh I see now it depends on dataType. I understand what this does now. How many other places would you have to change though to cover all functions? or are these the important ones? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36863: [SPARK-39459][CORE] localHostName* methods should support IPv6
dongjoon-hyun opened a new pull request, #36863: URL: https://github.com/apache/spark/pull/36863 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
ulysses-you commented on code in PR #36856: URL: https://github.com/apache/spark/pull/36856#discussion_r89692 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -53,6 +53,17 @@ case class UnaryMinus( override def toString: String = s"-$child" private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError) + private lazy val unaryMinusFunc: Any => Any = dataType match { Review Comment: Do you mean rewrite it to ? ```scala private def unaryMinusFunc: Any => Any = dataType match { .. ``` The reason I pull out and make it as lazy val is: the data type is known before do eval in an expression. Let's say if the data type is integer, - then the function can be elimiated to `input => numeric.negate(input)` during execution - if you declare it as a function, then the function would be during execution: ```scala dataType match { case _ => input => numeric.negate(input) ``` The overhead is not about creating a function, but the data type matching elimination. So I think the lazy val is more efficient ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
ulysses-you commented on code in PR #36856: URL: https://github.com/apache/spark/pull/36856#discussion_r89692 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -53,6 +53,17 @@ case class UnaryMinus( override def toString: String = s"-$child" private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError) + private lazy val unaryMinusFunc: Any => Any = dataType match { Review Comment: Do you mean rewrite it to ? ```scala private def unaryMinusFunc: Any => Any = dataType match { .. ``` The reason I pull out and make it as lazy val is: the data type is known before do eval in an expression. Let's say if the data type is integer, - then the function can be elimiated to `input => numeric.negate(input)` during execution - if you declare it as a function, then the function would be during execution: ```scala dataType match { case _ => input => numeric.negate(input) ``` The overhead if not about creating a function, but the data type matching elimination. So I think the lazy val is more efficient ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`
dongjoon-hyun closed pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)` URL: https://github.com/apache/spark/pull/36862 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`
dongjoon-hyun commented on PR #36862: URL: https://github.com/apache/spark/pull/36862#issuecomment-1154655021 Thank you so much, @sunchao ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled
mridulm commented on PR #35683: URL: https://github.com/apache/spark/pull/35683#issuecomment-1154652812 Merged to master. Thanks for working on this @abhishekd0907 ! Thanks for the review @attilapiros :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled
mridulm closed pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled URL: https://github.com/apache/spark/pull/35683 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
srowen commented on code in PR #36856: URL: https://github.com/apache/spark/pull/36856#discussion_r896325472 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -53,6 +53,17 @@ case class UnaryMinus( override def toString: String = s"-$child" private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError) + private lazy val unaryMinusFunc: Any => Any = dataType match { Review Comment: I mean, why declare it this way instead of a function? I don't see what difference it makes. There is nothing expensive about creating it, so why lazy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #36830: [SPARK-39453][SQL] DS V2 supports push down misc non-aggregate functions(non ANSI)
beliefer commented on PR #36830: URL: https://github.com/apache/spark/pull/36830#issuecomment-1154641204 > So the statement here [#36039 (comment)](https://github.com/apache/spark/pull/36039#issuecomment-1089567136) is not true any more, we will push down both ANSI functions and commonly used non-ANSI functions? Yes. The ANSI function too few and the commonly used non-ANSI functions also needed. A discussion between @cloud-fan and me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #36830: [SPARK-39453][SQL] DS V2 supports push down misc non-aggregate functions(non ANSI)
huaxingao commented on PR #36830: URL: https://github.com/apache/spark/pull/36830#issuecomment-1154634939 So the statement here https://github.com/apache/spark/pull/36039#issuecomment-1089567136 is not true any more, we will push down both ANSI functions and commonly used non-ANSI functions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #36564: [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status
AngersZh commented on PR #36564: URL: https://github.com/apache/spark/pull/36564#issuecomment-1154633468 > LGTM if tests pass GA failed not related to this PR ``` __w/spark/spark/docs/_plugins/copy_api_dirs.rb:130:in `': Python doc generation failed (RuntimeError) 2022-06-11T07:16:53.8252035Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:60:in `require' 2022-06-11T07:16:53.8252686Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:60:in `block in require_with_graceful_fail' 2022-06-11T07:16:53.8253306Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:57:in `each' 2022-06-11T07:16:53.8253947Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:57:in `require_with_graceful_fail' 2022-06-11T07:16:53.8254613Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:89:in `block in require_plugin_files' 2022-06-11T07:16:53.8255235Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:87:in `each' 2022-06-11T07:16:53.8255870Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:87:in `require_plugin_files' 2022-06-11T07:16:53.8256520Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:21:in `conscientious_require' 2022-06-11T07:16:53.8257116Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/site.rb:131:in `setup' 2022-06-11T07:16:53.8257695Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/site.rb:36:in `initialize' 2022-06-11T07:16:53.8258278Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/commands/build.rb:30:in `new' 2022-06-11T07:16:53.825Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/commands/build.rb:30:in `process' 2022-06-11T07:16:53.8259731Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail' 2022-06-11T07:16:53.8260335Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/command.rb:91:in `each' 2022-06-11T07:16:53.8260952Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/command.rb:91:in `process_with_graceful_fail' 2022-06-11T07:16:53.8261625Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/commands/build.rb:18:in `block (2 levels) in init_with_program' 2022-06-11T07:16:53.8262274Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute' 2022-06-11T07:16:53.8262921Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each' 2022-06-11T07:16:53.8263524Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute' 2022-06-11T07:16:53.8264134Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go' 2022-06-11T07:16:53.8264718Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program' 2022-06-11T07:16:53.8265284Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/exe/jekyll:15:in `' 2022-06-11T07:16:53.8265698Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/bin/jekyll:23:in `load' 2022-06-11T07:16:53.8266159Z from /__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/bin/jekyll:23:in `' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #36564: [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status
AngersZh commented on code in PR #36564: URL: https://github.com/apache/spark/pull/36564#discussion_r896311302 ## core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala: ## @@ -270,6 +263,16 @@ class OutputCommitCoordinatorSuite extends SparkFunSuite with BeforeAndAfter { .stageStart(meq(retriedStage.head), any()) verify(sc.env.outputCommitCoordinator).stageEnd(meq(retriedStage.head)) } + + test("SPARK-39195: Spark OutputCommitCoordinator should abort stage " + Review Comment: Removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #36785: [SPARK-39397][SQL] Relax AliasAwareOutputExpression to support alias with expression
ulysses-you commented on code in PR #36785: URL: https://github.com/apache/spark/pull/36785#discussion_r896303277 ## sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala: ## @@ -25,15 +25,15 @@ import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partition trait AliasAwareOutputExpression extends UnaryExecNode { protected def outputExpressions: Seq[NamedExpression] - private lazy val aliasMap = AttributeMap(outputExpressions.collect { -case a @ Alias(child: AttributeReference, _) => (child, a.toAttribute) - }) + private lazy val aliasMap = outputExpressions.collect { +case a @ Alias(child, _) => child.canonicalized -> a.toAttribute + }.toMap protected def hasAlias: Boolean = aliasMap.nonEmpty protected def normalizeExpression(exp: Expression): Expression = { exp.transform { Review Comment: changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching
ulysses-you commented on code in PR #36856: URL: https://github.com/apache/spark/pull/36856#discussion_r896301945 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -53,6 +53,17 @@ case class UnaryMinus( override def toString: String = s"-$child" private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError) + private lazy val unaryMinusFunc: Any => Any = dataType match { Review Comment: hmm, this would be called inside `eval` so a cached function is expected to avoid invoke by every row -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiuzhu9527 commented on pull request #36784: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid credentials'
xiuzhu9527 commented on PR #36784: URL: https://github.com/apache/spark/pull/36784#issuecomment-1154621166 @wangyum @pan3793 Hi, What should we do next?Now I am very confused about whether this problem is allowed to be fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36859: DTW: new distance measure for clustering
AmplabJenkins commented on PR #36859: URL: https://github.com/apache/spark/pull/36859#issuecomment-1154620125 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers
AmplabJenkins commented on PR #36861: URL: https://github.com/apache/spark/pull/36861#issuecomment-1154620110 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36740: [SPARK-39355][SQL] Avoid UnresolvedAttribute.apply throwing ParseException
cloud-fan commented on code in PR #36740: URL: https://github.com/apache/spark/pull/36740#discussion_r896290278 ## sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala: ## @@ -2176,4 +2176,32 @@ class SubquerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark } } } + + test("SPARK-39355: Avoid UnresolvedAttribute.apply throwing ParseException") { +checkAnswer( + sql(""" +|SELECT * Review Comment: I think we should fix `Alias.toAttribute`. It should create an attribute to reference the output of this alias, which is a single column, and we need to use `UnresolvedAttribute.quoted`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
cloud-fan commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896288305 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -287,16 +294,44 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * Checks if the database with the specified name exists. */ override def databaseExists(dbName: String): Boolean = { -sessionCatalog.databaseExists(dbName) +// To maintain backwards compatibility, we first treat the input is a simple dbName and check +// if sessionCatalog contains it. If no, we try to parse it as 3 part name. If the parased +// identifier contains both catalog name and database name, we then search the database in the +// catalog. +if (!sessionCatalog.databaseExists(dbName)) { + val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName) + if (ident.length == 2) { Review Comment: We can follow what we did in `makeTable`: Create a `UnresolvedNamespace` logical plan and resolve it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode
panbingkun commented on PR #36852: URL: https://github.com/apache/spark/pull/36852#issuecomment-1154604668 > @panbingkun Could you fix the test failure, please: > > ``` > QueryExecutionErrorsSuite.UNSUPPORTED_SAVE_MODE: unsupported null saveMode whether the path exists or not > org.scalatest.exceptions.TestFailedException: "... supported for: a no[n-]existent path." did not equal "... supported for: a no[t ]existent > ``` Done @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896287031 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean cleanupLocalDirs) { logger.info("Application {} removed, cleanupLocalDirs = {}", appId, cleanupLocalDirs); AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId); if (null != appShuffleInfo) { - mergedShuffleCleaner.execute( -() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, cleanupLocalDirs)); + submitCleanupTask( +() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, true)); } +removeAppAttemptPathInfoFromDB( +new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId)); Review Comment: Updated as suggested -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896286905 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -655,6 +742,206 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + + @Override + public void close() { +if (db != null) { + try { +db.close(); + } catch (IOException e) { +logger.error("Exception closing leveldb with registered app paths info and " ++ "shuffle partition info", e); + } +} + } + + private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo appPathsInfo) { +if (db != null) { + try { +byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, attemptId)); +String valueStr = mapper.writeValueAsString(appPathsInfo); +byte[] value = valueStr.getBytes(StandardCharsets.UTF_8); +db.put(key, value); + } catch (Exception e) { +logger.error("Error saving registered app paths info", e); + } +} + } + + private void writeAppAttemptShuffleMergeInfoToDB( + String appId, + int appAttemptId, + int shuffleId, + int shuffleMergeId) { +if (db != null) { + // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles + try{ +byte[] dbKey = getDbAppAttemptShufflePartitionKey( +new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, shuffleMergeId)); +db.put(dbKey, new byte[0]); + } catch (Exception e) { +logger.error("Error saving active app shuffle partition", e); + } +} + + } + + private T parseDbKey(String key, String prefix, Class valueType) { +try { + String json = key.substring(prefix.length() + 1); + return mapper.readValue(json, valueType); +} catch (Exception exception) { + logger.error("Exception while parsing the DB key {}", key); + return null; +} + } + + private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId appAttemptId) { +try { + return mapper.readValue(value, AppPathsInfo.class); +} catch (Exception exception) { + logger.error("Exception while parsing the DB value for {}", appAttemptId); + return null; +} + } + + private AppAttemptId parseDbAppAttemptPathsKey(String key) { +return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class); + } + + private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey( + String key, + String prefix) { +return parseDbKey(key, prefix, AppAttemptShuffleMergeId.class); + } + + private byte[] getDbKey(Object key, String prefix) { +// We add a common prefix on all the keys so we can find them in the DB +try { + String keyJsonString = prefix + DB_KEY_DELIMITER + mapper.writeValueAsString(key); + return keyJsonString.getBytes(StandardCharsets.UTF_8); +} catch (Exception exception) { + logger.error("Exception while generating the DB key {}", key); + return null; +} + } + + private byte[] getDbAppAttemptShufflePartitionKey( + AppAttemptShuffleMergeId appAttemptShuffleMergeId) { +return getDbKey(appAttemptShuffleMergeId, APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX); + } + + private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId){ +return getDbKey(appAttemptId, APP_ATTEMPT_PATH_KEY_PREFIX); + } + + @VisibleForTesting + void reloadAndCleanUpAppShuffleInfo(DB db) { +logger.info("Reload applications merged shuffle information from DB"); +List dbKeysToBeRemoved = reloadActiveAppAttemptsPathInfo(db); +dbKeysToBeRemoved.addAll(reloadFinalizedAppAttemptsShuffleMergeInfo(db)); +// Clean up invalid data stored in DB +submitCleanupTask(() -> +dbKeysToBeRemoved.forEach( +(key) -> { + try { +db.delete(key); + } catch (Exception e) { +logger.error("Error deleting data in DB", e); + } +} +) +); + } + + private List reloadActiveAppAttemptsPathInfo(DB db) { +List dbKeysToBeRemoved = new ArrayList<>(); +if (db != null) { + DBIterator itr = db.iterator(); + itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8)); + while (itr.hasNext()) { +Map.Entry e = itr.next(); +String key = new String(e.getKey(), StandardCharsets.UTF_8); +if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) { + break; +} +AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key); +if (appAttemptId == null) { + dbKeysToBeRemoved.add(e.getKey()); + break; +} +AppPathsInfo appPathsInfo = parseDBAppAttemptPathsValue(e.getValue(), appAttemptId); +if (appPathsInfo == null) { Review Comment: Removed. ## common/network-shuffle/sr
[GitHub] [spark] cloud-fan commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
cloud-fan commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896286734 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -250,8 +251,18 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * table/view. This throws an `AnalysisException` when no `Table` can be found. */ override def getTable(tableName: String): Table = { -val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) -getTable(tableIdent.database.orNull, tableIdent.table) +// calling `sqlParser.parseTableIdentifier` to parse tableName. If it contains only table name +// and optionally contains a database name(thus a TableIdentifier), then that is used to get +// the table. Otherwise we try `sqlParser.parseMultipartIdentifier` to have a sequence of string Review Comment: And then one idea is, if we can't find this table in HMS, shall we fall back to lookup the table with v2 code path? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896286777 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -643,8 +725,13 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { AppShuffleInfo appShuffleInfo = originalAppShuffleInfo.get(); logger.warn("Cleanup shuffle info and merged shuffle files for {}_{} as new " + "application attempt registered", appId, appShuffleInfo.attemptId); -mergedShuffleCleaner.execute( - () -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, true)); +// Clean up all the merge shuffle related information in the DB for the former attempt +submitCleanupTask( + () -> { +removeAppAttemptPathInfoFromDB(new AppAttemptId(appId, appShuffleInfo.attemptId)); Review Comment: Updated to make strong consistent between DB and in memory hashmap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896286599 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -342,6 +380,33 @@ void closeAndDeletePartitionFilesIfNeeded( if (cleanupLocalDirs) { deleteExecutorDirs(appShuffleInfo); } +if (removeFromDb){ + removeAppShuffleInfoFromDB(appShuffleInfo); +} + } + + private void removeAppAttemptPathInfoFromDB(AppAttemptId appAttemptId) { +if (db != null) { + try { +db.delete(getDbAppAttemptPathsKey(appAttemptId)); + } catch (Exception e) { +logger.error("Error deleting {} from application paths info in DB", appAttemptId, e); + } +} + } + + private void removeAppShuffleInfoFromDB(AppShuffleInfo appShuffleInfo) { +if (db != null) { + appShuffleInfo.shuffles +.forEach((shuffleId, shuffleInfo) -> shuffleInfo.shuffleMergePartitions + .forEach((shuffleMergeId, partitionInfo) -> { +synchronized (partitionInfo) { + removeAppShufflePartitionInfoFromDB( +new AppAttemptShuffleMergeId( + appShuffleInfo.appId, appShuffleInfo.attemptId, shuffleId, shuffleMergeId)); +} Review Comment: Refactored the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
cloud-fan commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896286460 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -250,8 +251,18 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * table/view. This throws an `AnalysisException` when no `Table` can be found. */ override def getTable(tableName: String): Table = { -val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) -getTable(tableIdent.database.orNull, tableIdent.table) +// calling `sqlParser.parseTableIdentifier` to parse tableName. If it contains only table name +// and optionally contains a database name(thus a TableIdentifier), then that is used to get +// the table. Otherwise we try `sqlParser.parseMultipartIdentifier` to have a sequence of string Review Comment: let's be more specific about the semantics. If the table name has less than 3 parts, we always get the table from HMS first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896286443 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -655,6 +742,206 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + + @Override + public void close() { +if (db != null) { + try { +db.close(); + } catch (IOException e) { +logger.error("Exception closing leveldb with registered app paths info and " ++ "shuffle partition info", e); + } +} + } + + private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo appPathsInfo) { +if (db != null) { + try { +byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, attemptId)); +String valueStr = mapper.writeValueAsString(appPathsInfo); +byte[] value = valueStr.getBytes(StandardCharsets.UTF_8); +db.put(key, value); + } catch (Exception e) { +logger.error("Error saving registered app paths info", e); + } +} + } + + private void writeAppAttemptShuffleMergeInfoToDB( + String appId, + int appAttemptId, + int shuffleId, + int shuffleMergeId) { +if (db != null) { + // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles + try{ +byte[] dbKey = getDbAppAttemptShufflePartitionKey( +new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, shuffleMergeId)); +db.put(dbKey, new byte[0]); + } catch (Exception e) { +logger.error("Error saving active app shuffle partition", e); + } +} + + } + + private T parseDbKey(String key, String prefix, Class valueType) { +try { + String json = key.substring(prefix.length() + 1); + return mapper.readValue(json, valueType); +} catch (Exception exception) { + logger.error("Exception while parsing the DB key {}", key); + return null; +} + } + + private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId appAttemptId) { +try { + return mapper.readValue(value, AppPathsInfo.class); +} catch (Exception exception) { + logger.error("Exception while parsing the DB value for {}", appAttemptId); + return null; +} + } + + private AppAttemptId parseDbAppAttemptPathsKey(String key) { +return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class); + } + + private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey( + String key, + String prefix) { +return parseDbKey(key, prefix, AppAttemptShuffleMergeId.class); + } + + private byte[] getDbKey(Object key, String prefix) { +// We add a common prefix on all the keys so we can find them in the DB +try { + String keyJsonString = prefix + DB_KEY_DELIMITER + mapper.writeValueAsString(key); + return keyJsonString.getBytes(StandardCharsets.UTF_8); +} catch (Exception exception) { + logger.error("Exception while generating the DB key {}", key); + return null; +} + } + + private byte[] getDbAppAttemptShufflePartitionKey( + AppAttemptShuffleMergeId appAttemptShuffleMergeId) { +return getDbKey(appAttemptShuffleMergeId, APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX); + } + + private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId){ +return getDbKey(appAttemptId, APP_ATTEMPT_PATH_KEY_PREFIX); + } + + @VisibleForTesting + void reloadAndCleanUpAppShuffleInfo(DB db) { +logger.info("Reload applications merged shuffle information from DB"); +List dbKeysToBeRemoved = reloadActiveAppAttemptsPathInfo(db); +dbKeysToBeRemoved.addAll(reloadFinalizedAppAttemptsShuffleMergeInfo(db)); +// Clean up invalid data stored in DB +submitCleanupTask(() -> +dbKeysToBeRemoved.forEach( +(key) -> { + try { +db.delete(key); + } catch (Exception e) { +logger.error("Error deleting data in DB", e); + } +} +) +); + } + + private List reloadActiveAppAttemptsPathInfo(DB db) { +List dbKeysToBeRemoved = new ArrayList<>(); +if (db != null) { + DBIterator itr = db.iterator(); + itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8)); + while (itr.hasNext()) { +Map.Entry e = itr.next(); +String key = new String(e.getKey(), StandardCharsets.UTF_8); +if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) { + break; +} +AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key); +if (appAttemptId == null) { + dbKeysToBeRemoved.add(e.getKey()); + break; +} +AppPathsInfo appPathsInfo = parseDBAppAttemptPathsValue(e.getValue(), appAttemptId); +if (appPathsInfo == null) { + dbKeysToBeRemoved.add(e.getKey()); + break; +
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896286390 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -655,6 +742,206 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + + @Override + public void close() { +if (db != null) { + try { +db.close(); + } catch (IOException e) { +logger.error("Exception closing leveldb with registered app paths info and " ++ "shuffle partition info", e); + } +} + } + + private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo appPathsInfo) { +if (db != null) { + try { +byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, attemptId)); +String valueStr = mapper.writeValueAsString(appPathsInfo); +byte[] value = valueStr.getBytes(StandardCharsets.UTF_8); +db.put(key, value); + } catch (Exception e) { +logger.error("Error saving registered app paths info", e); + } +} + } + + private void writeAppAttemptShuffleMergeInfoToDB( + String appId, + int appAttemptId, + int shuffleId, + int shuffleMergeId) { +if (db != null) { + // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles + try{ +byte[] dbKey = getDbAppAttemptShufflePartitionKey( +new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, shuffleMergeId)); +db.put(dbKey, new byte[0]); + } catch (Exception e) { +logger.error("Error saving active app shuffle partition", e); + } +} + + } + + private T parseDbKey(String key, String prefix, Class valueType) { +try { + String json = key.substring(prefix.length() + 1); + return mapper.readValue(json, valueType); +} catch (Exception exception) { + logger.error("Exception while parsing the DB key {}", key); + return null; +} + } + + private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId appAttemptId) { +try { + return mapper.readValue(value, AppPathsInfo.class); +} catch (Exception exception) { + logger.error("Exception while parsing the DB value for {}", appAttemptId); + return null; +} + } + + private AppAttemptId parseDbAppAttemptPathsKey(String key) { +return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class); + } + + private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey( + String key, + String prefix) { +return parseDbKey(key, prefix, AppAttemptShuffleMergeId.class); + } + + private byte[] getDbKey(Object key, String prefix) { +// We add a common prefix on all the keys so we can find them in the DB +try { + String keyJsonString = prefix + DB_KEY_DELIMITER + mapper.writeValueAsString(key); + return keyJsonString.getBytes(StandardCharsets.UTF_8); +} catch (Exception exception) { + logger.error("Exception while generating the DB key {}", key); + return null; +} + } + + private byte[] getDbAppAttemptShufflePartitionKey( + AppAttemptShuffleMergeId appAttemptShuffleMergeId) { +return getDbKey(appAttemptShuffleMergeId, APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX); + } + + private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId){ +return getDbKey(appAttemptId, APP_ATTEMPT_PATH_KEY_PREFIX); + } + + @VisibleForTesting + void reloadAndCleanUpAppShuffleInfo(DB db) { +logger.info("Reload applications merged shuffle information from DB"); +List dbKeysToBeRemoved = reloadActiveAppAttemptsPathInfo(db); +dbKeysToBeRemoved.addAll(reloadFinalizedAppAttemptsShuffleMergeInfo(db)); +// Clean up invalid data stored in DB +submitCleanupTask(() -> +dbKeysToBeRemoved.forEach( +(key) -> { + try { +db.delete(key); + } catch (Exception e) { +logger.error("Error deleting data in DB", e); + } +} +) +); + } + + private List reloadActiveAppAttemptsPathInfo(DB db) { +List dbKeysToBeRemoved = new ArrayList<>(); +if (db != null) { + DBIterator itr = db.iterator(); + itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8)); + while (itr.hasNext()) { +Map.Entry e = itr.next(); +String key = new String(e.getKey(), StandardCharsets.UTF_8); +if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) { + break; +} +AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key); +if (appAttemptId == null) { Review Comment: Removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries abo
[GitHub] [spark] cloud-fan commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers
cloud-fan commented on PR #36861: URL: https://github.com/apache/spark/pull/36861#issuecomment-1154596720 @dtenedor for bug fix, we should create a new JIRA ticket instead of reusing the original one... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers
cloud-fan closed pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers URL: https://github.com/apache/spark/pull/36861 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers
cloud-fan commented on PR #36861: URL: https://github.com/apache/spark/pull/36861#issuecomment-1154596074 thanks, merging to master/3.3! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896281336 ## common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java: ## @@ -230,11 +241,14 @@ protected void serviceInit(Configuration externalConf) throws Exception { // when it comes back if (_recoveryPath != null) { registeredExecutorFile = initRecoveryDb(RECOVERY_FILE_NAME); +mergeManagerFile = initRecoveryDb(SPARK_SHUFFLE_MERGE_RECOVERY_FILE_NAME); } - TransportConf transportConf = new TransportConf("shuffle", new HadoopConfigProvider(_conf)); - MergedShuffleFileManager shuffleMergeManager = newMergedShuffleFileManagerInstance( -transportConf); + TransportConf transportConf = new TransportConf("shuffle",new HadoopConfigProvider(_conf)); + if (shuffleMergeManager == null) { Review Comment: Added comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896279364 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -350,15 +415,27 @@ void closeAndDeletePartitionFilesIfNeeded( * up older shuffleMergeId partitions. The cleanup will be executed in a separate thread. */ @VisibleForTesting - void closeAndDeletePartitionFiles(Map partitions) { + void closeAndDeleteOutdatedPartitions(Map partitions) { partitions .forEach((partitionId, partitionInfo) -> { synchronized (partitionInfo) { partitionInfo.closeAllFilesAndDeleteIfNeeded(true); + removeAppShufflePartitionInfoFromDB(partitionInfo.appAttemptShuffleMergeId); Review Comment: Refactored the signature of closeAndDeleteOutdatedPartitions by adding the AppAttemptShuffleMergeId, since every time it is getting called, there is going to only one call to remove the finalized partitionInfo from DB with the key AppAttemptShuffleMergeId. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r896274975 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean cleanupLocalDirs) { logger.info("Application {} removed, cleanupLocalDirs = {}", appId, cleanupLocalDirs); AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId); if (null != appShuffleInfo) { - mergedShuffleCleaner.execute( -() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, cleanupLocalDirs)); + submitCleanupTask( +() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, true)); } +removeAppAttemptPathInfoFromDB( +new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId)); } - /** * Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo. * If cleanupLocalDirs is true, the merged shuffle files will also be deleted. * The cleanup will be executed in a separate thread. */ @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter") @VisibleForTesting - void closeAndDeletePartitionFilesIfNeeded( + void closeAndDeletePartitions( AppShuffleInfo appShuffleInfo, - boolean cleanupLocalDirs) { + boolean cleanupLocalDirs, + boolean removeFromDb) { Review Comment: This is to handle the case you mentioned earlier, that the merged shuffle data has been removed from the disk through some API(TBD in another ticket for cleaning up merged shuffle during job runtime), but the information in the DB should be kept. Right now, we don't have that API in place, so all the callers will set this flag to true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36841: [SPARK-39444][SQL] Add OptimizeSubqueries into nonExcludableRules list
HyukjinKwon commented on code in PR #36841: URL: https://github.com/apache/spark/pull/36841#discussion_r896274952 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -4456,6 +4456,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark """.stripMargin), Seq(Row(2), Row(1))) } + + test("SPARK-39444: Add OptimizeSubqueries into nonExcludableRules list") { +withSQLConf( + SQLConf.OPTIMIZER_EXCLUDED_RULES.key -> + "org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries") { Review Comment: and let's match the test place with https://github.com/apache/spark/pull/36847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #36273: [SPARK-38960][CORE]Spark should fail fast if initial memory too large…
srowen closed pull request #36273: [SPARK-38960][CORE]Spark should fail fast if initial memory too large… URL: https://github.com/apache/spark/pull/36273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36841: [SPARK-39444][SQL] Add OptimizeSubqueries into nonExcludableRules list
HyukjinKwon commented on code in PR #36841: URL: https://github.com/apache/spark/pull/36841#discussion_r896274142 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -4456,6 +4456,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark """.stripMargin), Seq(Row(2), Row(1))) } + + test("SPARK-39444: Add OptimizeSubqueries into nonExcludableRules list") { +withSQLConf( + SQLConf.OPTIMIZER_EXCLUDED_RULES.key -> + "org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries") { Review Comment: nit but can we use a class reference instead of a string? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35256: [SPARK-37933][SQL] Limit push down for parquet vectorized reader
github-actions[bot] closed pull request #35256: [SPARK-37933][SQL] Limit push down for parquet vectorized reader URL: https://github.com/apache/spark/pull/35256 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35719: [SPARK-38401][SQL][CORE] Unify get preferred locations for shuffle in AQE
github-actions[bot] closed pull request #35719: [SPARK-38401][SQL][CORE] Unify get preferred locations for shuffle in AQE URL: https://github.com/apache/spark/pull/35719 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36729: [SPARK-39295][PYTHON][DOCS] Improve documentation of pandas API support list
HyukjinKwon commented on PR #36729: URL: https://github.com/apache/spark/pull/36729#issuecomment-1154570503 Sure, thanks for tracking this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations
cloud-fan commented on PR #36837: URL: https://github.com/apache/spark/pull/36837#issuecomment-1154563272 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations
cloud-fan closed pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations URL: https://github.com/apache/spark/pull/36837 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`
dongjoon-hyun commented on PR #36862: URL: https://github.com/apache/spark/pull/36862#issuecomment-1154557430 Hi, @sunchao . Could you review this when you have some time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations
dongjoon-hyun closed pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations URL: https://github.com/apache/spark/pull/36860 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations
dongjoon-hyun commented on PR #36860: URL: https://github.com/apache/spark/pull/36860#issuecomment-1154551586 Thank you so much, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations
huaxingao commented on PR #36860: URL: https://github.com/apache/spark/pull/36860#issuecomment-1154541511 LGTM. Thanks for pinging me! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #36843: [SPARK-39446][MLLIB] Add relevance score for nDCG evaluation
srowen commented on code in PR #36843: URL: https://github.com/apache/spark/pull/36843#discussion_r896243825 ## mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala: ## @@ -35,8 +35,16 @@ import org.apache.spark.rdd.RDD * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs. */ @Since("1.2.0") -class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) - extends Logging with Serializable { +class RankingMetrics[T: ClassTag]( +predictionAndLabels: RDD[(Array[T], Array[T], Array[(T, Double)])]) Review Comment: Hm, why does the last need to be (T, Double) pairs? Wouldn't this be an attribute of the ground truth? Looking this up via Map seems clunky later. The problem is not changing the binary signature, but, at least, how about a third array of Double only, that is parallel to the second array? ## mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala: ## @@ -58,9 +66,12 @@ class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])] @Since("1.2.0") def precisionAt(k: Int): Double = { Review Comment: Is there any similar notion of precision@k when the ground truth has a relevance 'weight'? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations
allisonwang-db commented on PR #36837: URL: https://github.com/apache/spark/pull/36837#issuecomment-1154536909 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations
allisonwang-db commented on PR #36837: URL: https://github.com/apache/spark/pull/36837#issuecomment-1154536858 @LuciferYang I am running on M1 as well. Indeed the runtime for the TPCDSQuerySuite can vary over the runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #35290: [SPARK-37865][SQL][3.0]Fix union bug when the first child of union has duplicate columns
srowen closed pull request #35290: [SPARK-37865][SQL][3.0]Fix union bug when the first child of union has duplicate columns URL: https://github.com/apache/spark/pull/35290 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #36334: Refactor tests
srowen closed pull request #36334: Refactor tests URL: https://github.com/apache/spark/pull/36334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/{mvn|sbt}`
dongjoon-hyun opened a new pull request, #36862: URL: https://github.com/apache/spark/pull/36862 ### What changes were proposed in this pull request? This PR aims to print `SPARK_LOCAL_(HOSTNAME|IP)` during building and testing at `build/{mvn|sbt}`. ### Why are the changes needed? `SPARK_LOCAL_HOSTNAME` and `SPARK_LOCAL_IP` are used during testing and the test result depends on them. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and check GitHub Action logs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #35586: [SPARK-38265][DOCS][CORE] Update comments of ExecutorAllocationClient
srowen closed pull request #35586: [SPARK-38265][DOCS][CORE] Update comments of ExecutorAllocationClient URL: https://github.com/apache/spark/pull/35586 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations
dongjoon-hyun commented on PR #36860: URL: https://github.com/apache/spark/pull/36860#issuecomment-1154505057 Hi, @huaxingao . Could you review this test PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers
dtenedor commented on PR #36861: URL: https://github.com/apache/spark/pull/36861#issuecomment-1154503075 Hi @cloud-fan can you take a look at this when you have time, it is a bug fix for the `to_number` and `try_to_number` functions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor opened a new pull request, #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers
dtenedor opened a new pull request, #36861: URL: https://github.com/apache/spark/pull/36861 ### What changes were proposed in this pull request? Update `to_number` and `try_to_number` functions to allow the `PR` format token with input strings comprising positive numbers. Before this bug fix, function calls like `to_number(' 123 ', '999PR')` would fail. Now they succeed, which is helpful since `PR` should allow both positive and negative numbers. This satisfies the following specification: ``` to_number(expr, fmt) fmt { ' [ MI | S ] [ L | $ ] [ 0 | 9 | G | , ] [...] [ . | D ] [ 0 | 9 ] [...] [ L | $ ] [ PR | MI | S ] ' } ``` ### Why are the changes needed? After reviewing the specification, this behavior makes the most sense. ### Does this PR introduce _any_ user-facing change? Yes, a slight change in the behavior of the format string. ### How was this patch tested? Existing and updated unit test coverage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36860: [SPARK-39460][CORE][TESTS] Fix CoarseGrainedSchedulerBackendSuite to handle fast allocations
dongjoon-hyun opened a new pull request, #36860: URL: https://github.com/apache/spark/pull/36860 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
amaliujia commented on PR #36641: URL: https://github.com/apache/spark/pull/36641#issuecomment-1154465576 @cloud-fan comments addressed and there is one that we can discuss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
amaliujia commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896166516 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -681,4 +681,60 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest with BeforeAndAf assert(spark.catalog.listTables("default").collect().map(_.name).toSet == Set("my_table1", "my_table2", "my_temp_table")) } + + test("three layer namespace compatibility - get table") { +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") +val description = "this is a test table" + +spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = description, + options = Map.empty[String, String]) + +val t = spark.catalog.getTable(Array(catalogName, dbName, tableName).mkString(".")) +val expectedTable = + new Table( +tableName, +catalogName, +Array(dbName), +description, +CatalogTableType.MANAGED.name, +false) +assert(expectedTable.toString == t.toString) + } + + test("three layer namespace compatibility - table exists") { +val catalogName = "testcat" +val dbName = "my_db" +val tableName = "my_table" +val tableSchema = new StructType().add("i", "int") + +assert(!spark.catalog.tableExists(Array(catalogName, dbName, tableName).mkString("."))) + +spark.catalog.createTable( + tableName = Array(catalogName, dbName, tableName).mkString("."), + source = classOf[FakeV2Provider].getName, + schema = tableSchema, + description = "", + options = Map.empty[String, String]) + +assert(spark.catalog.tableExists(Array(catalogName, dbName, tableName).mkString("."))) + } + + test("three layer namespace compatibility - database exists") { +val catalogName = "testcat" +val dbName = "my_db" +assert(!spark.catalog.databaseExists(Array(catalogName, dbName).mkString("."))) + +val e = intercept[CatalogNotFoundException] { + val catalogName2 = "catalog_not_exists" + spark.catalog.databaseExists(Array(catalogName2, dbName).mkString(".")) +} +assert(e.getMessage.contains("catalog_not_exists is not defined")) Review Comment: done ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -250,8 +251,14 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * table/view. This throws an `AnalysisException` when no `Table` can be found. */ override def getTable(tableName: String): Table = { -val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) -getTable(tableIdent.database.orNull, tableIdent.table) +try { + val ident = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
amaliujia commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896163957 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -287,16 +294,44 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * Checks if the database with the specified name exists. */ override def databaseExists(dbName: String): Boolean = { -sessionCatalog.databaseExists(dbName) +// To maintain backwards compatibility, we first treat the input is a simple dbName and check +// if sessionCatalog contains it. If no, we try to parse it as 3 part name. If the parased +// identifier contains both catalog name and database name, we then search the database in the +// catalog. +if (!sessionCatalog.databaseExists(dbName)) { + val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName) + if (ident.length == 2) { Review Comment: hmm I don't know how to deal with nested namespace without the help of SQL analyzer. I am using `sparkSession.sessionState.catalogManager.catalog(catalog_name)` to check if a database exists. Maybe we should add a V2 command to handle nested namespace? If I handle by calling `catalogManager`, it seems not to be extensible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
amaliujia commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896161200 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -250,8 +251,14 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * table/view. This throws an `AnalysisException` when no `Table` can be found. */ override def getTable(tableName: String): Table = { -val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) -getTable(tableIdent.database.orNull, tableIdent.table) +try { + val ident = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) + getTable(ident.database.orNull, ident.table) +} catch { + case e: org.apache.spark.sql.catalyst.parser.ParseException => +val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName) Review Comment: That arbitrary sequence of identifier will be passed to SQL analyzer to resolve the table. If user for example give `a.b.c.d.e.f`, the analyzer will just say this is not found. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace
amaliujia commented on code in PR #36641: URL: https://github.com/apache/spark/pull/36641#discussion_r896156559 ## sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala: ## @@ -250,8 +251,14 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { * table/view. This throws an `AnalysisException` when no `Table` can be found. */ override def getTable(tableName: String): Table = { -val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) -getTable(tableIdent.database.orNull, tableIdent.table) +try { + val ident = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) + getTable(ident.database.orNull, ident.table) +} catch { + case e: org.apache.spark.sql.catalyst.parser.ParseException => +val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName) Review Comment: No. For arbitrary sequence it can breaks that by dots to a Seq of string. `sparkSession.sessionState.sqlParser.parseTableIdentifie` will throw exception when the input is not `b` or `a.b` thus beyond a table identifier which only considers DB and table name in OSS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6
dongjoon-hyun closed pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6 URL: https://github.com/apache/spark/pull/36858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6
dongjoon-hyun commented on PR #36858: URL: https://github.com/apache/spark/pull/36858#issuecomment-1154394229 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset
EnricoMi commented on code in PR #36150: URL: https://github.com/apache/spark/pull/36150#discussion_r893842370 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1382,6 +1417,12 @@ class Analyzer(override val catalogManager: CatalogManager) case g: Generate if containsStar(g.generator.children) => throw QueryCompilationErrors.invalidStarUsageError("explode/json_tuple/UDTF", extractStar(g.generator.children)) + // If the Melt ids contain Stars, expand them. Review Comment: Should this be merged into a single `case`? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -524,6 +525,10 @@ class Analyzer(override val catalogManager: CatalogManager) if child.resolved && groupByOpt.isDefined && hasUnresolvedAlias(groupByOpt.get) => Pivot(Some(assignAliases(groupByOpt.get)), pivotColumn, pivotValues, aggregates, child) + case m: Melt if m.child.resolved && +(hasUnresolvedAlias(m.ids) || hasUnresolvedAlias(m.values)) => +m.copy(ids = assignAliases(m.ids), values = assignAliases(m.values)) Review Comment: Any particular reason why the other `case`s do not copy here? ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2012,7 +2012,97 @@ class Dataset[T] private[sql]( @scala.annotation.varargs def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs : _*) - /** + /** + * Unpivot a DataFrame from wide format to long format, optionally + * leaving identifier columns set. + * + * This function is useful to massage a DataFrame into a format where some + * columns are identifier columns ("ids"), while all other columns ("values") + * are "unpivoted" to the rows, leaving just two non-id columns, named as given + * by `variableColumnName` and `valueColumnName`. + * + * {{{ + * val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long") + * df.show() + * // output: + * // +---+---++ + * // | id|int|long| + * // +---+---++ + * // | 1| 11| 12| + * // | 2| 21| 22| + * // +---+---++ + * + * df.melt(Array($"id"), Array($"int", $"long"), "variable", "value").show() + * // output: + * // +---++-+ + * // | id|variable|value| + * // +---++-+ + * // | 1| int| 11| + * // | 1|long| 12| + * // | 2| int| 21| + * // | 2|long| 22| + * // +---++-+ + * // schema: + * //root + * // |-- id: integer (nullable = false) + * // |-- variable: string (nullable = false) + * // |-- value: long (nullable = true) + * }}} + * + * When no "id" columns are given, the unpivoted DataFrame consists of only the + * "variable" and "value" columns. + * + * All "value" columns must be of compatible data type. If they are not the same data type, + * all "value" columns are cast to the nearest common data type. For instance, + * types `IntegerType` and `LongType` are compatible and cast to `LongType`, + * while `IntegerType` and `StringType` are not compatible and `melt` fails. + * + * @param ids Id columns + * @param values Value columns to melt + * @param variableColumnName Name of the variable column + * @param valueColumnName Name of the value column + * + * @group untypedrel + * @since 3.4.0 + */ + def melt( Review Comment: should there also be a `unpivot` alias? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -736,6 +736,21 @@ abstract class TypeCoercionBase { } } + /** + * Determines the value type of a [[Melt]]. + */ + object MeltCoercion extends Rule[LogicalPlan] { +override def apply(plan: LogicalPlan): LogicalPlan = + plan resolveOperators { +case m: Melt if m.values.nonEmpty && m.values.forall(_.resolved) && m.valueType.isEmpty => + val valueDataType = findWiderTypeWithoutStringPromotion(m.values.map(_.dataType)) Review Comment: What if users prefer the wider `findWiderCommonType` method? I personally don't. Would require to wire in an extra argument to `melt`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] polkadot21 opened a new pull request, #36859: DTW: new distance measure for clustering
polkadot21 opened a new pull request, #36859: URL: https://github.com/apache/spark/pull/36859 ### What changes were proposed in this pull request? In this pull request, I propose a new distance measure for clustering, namely, dynamic time warping. ### Why are the changes needed? This distance measure is a more accurate choice for time-series clustering. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The tests are also added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org