date:20220613

[GitHub] [spark] Ngone51 commented on a diff in pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.

2022-06-13 Thread GitBox



Ngone51 commented on code in PR #36162:
URL: https://github.com/apache/spark/pull/36162#discussion_r896402450


##
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala:
##
@@ -1069,25 +1084,56 @@ private[spark] class TaskSetManager(
* Check if the task associated with the given tid has past the time 
threshold and should be
* speculative run.
*/
-  private def checkAndSubmitSpeculatableTask(
-  tid: Long,
+  private def checkAndSubmitSpeculatableTasks(
   currentTimeMillis: Long,
-  threshold: Double): Boolean = {
-val info = taskInfos(tid)
-val index = info.index
-if (!successful(index) && copiesRunning(index) == 1 &&
-info.timeRunning(currentTimeMillis) > threshold && 
!speculatableTasks.contains(index)) {
-  addPendingTask(index, speculatable = true)
-  logInfo(
-("Marking task %d in stage %s (on %s) as speculatable because it ran 
more" +
-  " than %.0f ms(%d speculatable tasks in this taskset now)")
-  .format(index, taskSet.id, info.host, threshold, 
speculatableTasks.size + 1))
-  speculatableTasks += index
-  sched.dagScheduler.speculativeTaskSubmitted(tasks(index))
-  true
-} else {
-  false
+  threshold: Double,
+  numSuccessfulTasks: Int,

Review Comment:
   not used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



sadikovi commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154747718

   I think we can. JDBC dialects can configure how they map TimestampNTZ type.
   In the case you mentioned, both timestamps will be read as timestamp_ntz in 
MySQL and Postgres. In fact, the current timestamp type is stored as 
timestamp_ntz in those database systems.
   
   Even with dialects managing timestamp_ntz writes and reads, this would be 
the same problem unless you store them as different types.
   
   Also, the test passes in master:
   ```
   [ivan.sadikov@C02DV1TGMD6R spark-oss (master)]$ git log -n1
   commit 2349175e1b81b0a61e1ed90c2d051c01cf78de9b (HEAD -> master, 
upstream/master)
   Author: Ivan Sadikov 
   Date:   Mon Jun 13 21:22:15 2022 -0700
   
   [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
   
   
   
   
   [info] JDBCSuite:
   05:53:04.233 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
   05:53:07.936 ERROR 
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: 
Failed to load built-in provider.
   [info] - SPARK-39339: Handle TimestampNTZType null values (1 second, 555 
milliseconds)
   [info] - SPARK-39339: TimestampNTZType with different local time zones (4 
seconds, 48 milliseconds)
   05:53:14.022 WARN org.apache.spark.sql.jdbc.JDBCSuite: 
   
   = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.jdbc.JDBCSuite, threads: 
Timer-2 (daemon=true), rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 
(daemon=true) =
   [info] Run completed in 11 seconds, 724 milliseconds.
   [info] Total number of tests run: 2
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   [success] Total time: 81 s (01:21), completed Jun 14, 2022 5:53:14 AM
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Ngone51 commented on a diff in pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.

2022-06-13 Thread GitBox



Ngone51 commented on code in PR #36162:
URL: https://github.com/apache/spark/pull/36162#discussion_r896400772


##
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala:
##
@@ -1069,25 +1084,56 @@ private[spark] class TaskSetManager(
* Check if the task associated with the given tid has past the time 
threshold and should be
* speculative run.
*/
-  private def checkAndSubmitSpeculatableTask(
-  tid: Long,
+  private def checkAndSubmitSpeculatableTasks(
   currentTimeMillis: Long,
-  threshold: Double): Boolean = {
-val info = taskInfos(tid)
-val index = info.index
-if (!successful(index) && copiesRunning(index) == 1 &&
-info.timeRunning(currentTimeMillis) > threshold && 
!speculatableTasks.contains(index)) {
-  addPendingTask(index, speculatable = true)
-  logInfo(
-("Marking task %d in stage %s (on %s) as speculatable because it ran 
more" +
-  " than %.0f ms(%d speculatable tasks in this taskset now)")
-  .format(index, taskSet.id, info.host, threshold, 
speculatableTasks.size + 1))
-  speculatableTasks += index
-  sched.dagScheduler.speculativeTaskSubmitted(tasks(index))
-  true
-} else {
-  false
+  threshold: Double,
+  numSuccessfulTasks: Int,
+  customizedThreshold: Boolean = false): Boolean = {
+var foundTasksResult = false
+for (tid <- runningTasksSet) {
+  val info = taskInfos(tid)
+  val index = info.index
+  if (!successful(index) && copiesRunning(index) == 1 && 
!speculatableTasks.contains(index)) {
+val runtimeMs = info.timeRunning(currentTimeMillis)
+
+def checkMaySpeculate(): Boolean = {
+  if (customizedThreshold || taskProcessRateCalculator.isEmpty) {
+true
+  } else {
+val longTimeTask = runtimeMs > efficientTaskDurationFactor * 
threshold
+longTimeTask || 
taskProcessRateCalculator.exists(_.isInefficient(tid, runtimeMs, info))
+  }
+}
+
+def shouldSpeculateForExecutorDecomissioning(): Boolean = {

Review Comment:
   ```suggestion
   def shouldSpeculateForExecutorDecommissioning(): Boolean = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ivoson commented on pull request #36716: [SPARK-39062][CORE] Add stage level resource scheduling support for standalone cluster

2022-06-13 Thread GitBox



ivoson commented on PR #36716:
URL: https://github.com/apache/spark/pull/36716#issuecomment-1154747061

   > I kind of disagree because it doesn't work as expected compared to other 
resource manager. This to me is very confusing. I kind of hate to add more 
features on what I would consider a broken feature and then people think its 
ok. If people are using this successfully and finding it useful, the very least 
we need to do if document how it works and its limitations. I somewhat remember 
hitting other issues with dynamic allocation in standalone but then giving up 
on it as I talked to a few people that said it wasn't used and figured someone 
would go through and test it. ie on that same issue I mention "Note that there 
are other places in the code that uses executor cores which could also be wrong 
in standalone mode. for instance PythonRunner is using it to split memory."
   > 
   > @ivoson I'm assuming you are using this successfully in production, how 
much testing have you done with dynamic allocation?
   
   Hi @tgravescs , thanks for pointing out the issue. Actually, we are not 
using this in production right now. Just find this feature very useful and we 
can leverage it, but it can't work with standalone cluster right now, so I want 
to support it.
   For your concern, I'd like to create another ticket to update the doc first 
to make people be aware of it.
   
   Thanks @tgravescs  and @Ngone51 for your guidance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



beliefer commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154742731

   > I updated the test case as you suggested and it passes on my machine. Can 
you share the error message? It also passed the build.
   
   ```
   == Results ==
   !== Correct Answer - 1 ==   == Spark 
Answer - 1 ==
struct   
struct
   ![1500-01-20T00:00:00.123456]   
[1500-01-20T00:16:08.123456]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



beliefer commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154742475

   I think we can't support timestamp ntz with the option. 
   We should let JDBC dialect to decide how to supports timestamp ntz.
   If one table have ts1 is timestamp and ts2 is timestamp ntz, what is the 
output when we specify the `inferTimestampNTZType` option ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode

2022-06-13 Thread GitBox



MaxGekk closed pull request #36852: [SPARK-38700][SQL][3.3] Use error classes 
in the execution errors of save mode
URL: https://github.com/apache/spark/pull/36852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



sadikovi commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154740681

   I updated the test case as you suggested and it passes on my machine. Can 
you share the error message? It also passed the build.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode

2022-06-13 Thread GitBox



MaxGekk commented on PR #36852:
URL: https://github.com/apache/spark/pull/36852#issuecomment-1154740365

   +1, LGTM. Merging to 3.3.
   Thank you, @panbingkun and @dongjoon-hyun for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



gengliangwang commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154737984

   @beliefer Your test case is testing ORC, while this PR is about JDBC...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`

2022-06-13 Thread GitBox



dongjoon-hyun commented on code in PR #36864:
URL: https://github.com/apache/spark/pull/36864#discussion_r896391807


##
core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java:
##
@@ -32,6 +33,7 @@
 import org.junit.Test;
 
 public class JavaJdbcRDDSuite implements Serializable {
+  private String dbName = "db_" + UUID.randomUUID().toString().replace('-', 
'_');

Review Comment:
   I borrowed the logic. Thanks, @HyukjinKwon .
   
   
https://github.com/apache/spark/blob/2349175e1b81b0a61e1ed90c2d051c01cf78de9b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala#L354



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



beliefer commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154735759

   > Thanks, merging to master
   
   I update this test case and it will fail !
   ```
 test("SPARK-37463: read/write Timestamp ntz to Orc with different time 
zone") {
   DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA) {
 val sqlText = """
 |select
 | timestamp_ntz '2021-06-01 00:00:00' ts_ntz1,
 | timestamp_ntz '1883-11-16 00:00:00.0' as ts_ntz2,
 | timestamp_ntz '2021-03-14 02:15:00.0' as ts_ntz3
 |""".stripMargin
   
 withTempPath { dir =>
   val path = dir.getCanonicalPath
   val df = sql(sqlText)
   
   df.write.mode("overwrite").orc(path)
   
   val query = s"select * from `orc`.`$path`"
   
   DateTimeTestUtils.outstandingZoneIds.foreach { zoneId =>
 DateTimeTestUtils.withDefaultTimeZone(zoneId) {
   withAllNativeOrcReaders {
 checkAnswer(sql(query), df)
   }
 }
   }
 }
   }
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`

2022-06-13 Thread GitBox



dongjoon-hyun commented on code in PR #36864:
URL: https://github.com/apache/spark/pull/36864#discussion_r896389409


##
core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java:
##
@@ -32,6 +33,7 @@
 import org.junit.Test;
 
 public class JavaJdbcRDDSuite implements Serializable {
+  private String dbName = UUID.randomUUID().toString();

Review Comment:
   Oh, thank you. Sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`

2022-06-13 Thread GitBox



HyukjinKwon commented on code in PR #36864:
URL: https://github.com/apache/spark/pull/36864#discussion_r896378643


##
core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java:
##
@@ -32,6 +33,7 @@
 import org.junit.Test;
 
 public class JavaJdbcRDDSuite implements Serializable {
+  private String dbName = UUID.randomUUID().toString();

Review Comment:
   I remember it becomes flaky when the UUID starts with numbers (in Spark 
SQL), e.g.. `SQLTestUtils.withTempDatabase`. Should we better copy and paste 
that here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Ngone51 commented on pull request #36716: [SPARK-39062][CORE] Add stage level resource scheduling support for standalone cluster

2022-06-13 Thread GitBox



Ngone51 commented on PR #36716:
URL: https://github.com/apache/spark/pull/36716#issuecomment-1154714325

   @tgravescs Thanks for your feedback. I agree it's a kind of dirty way to 
build new features upon the features with known issues, which hides the issues 
deeply in further. I'm also +1 on the proposal to update docs and warn users in 
the case of None cores. For a long term, we could think of a better solution 
for it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36855: [SPARK-39432][SQL] element_at(*, 0) does not return INVALID_ARRAY_INDEX_IN_ELEMENT_AT

2022-06-13 Thread GitBox



AmplabJenkins commented on PR #36855:
URL: https://github.com/apache/spark/pull/36855#issuecomment-1154710152

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



ulysses-you commented on code in PR #36856:
URL: https://github.com/apache/spark/pull/36856#discussion_r896371376


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -53,6 +53,17 @@ case class UnaryMinus(
   override def toString: String = s"-$child"
 
   private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError)
+  private lazy val unaryMinusFunc: Any => Any = dataType match {

Review Comment:
   @srowen @LuciferYang  
   This issue mostly happens in arithmetic like expression, and I also tried to 
collect other related expressions. Hope I'm not missing someone. We can still 
keep in mind for catching and reviewing the added expression or the new 
supported data type which may introduce this issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



gengliangwang closed pull request #36726: [SPARK-39339][SQL] Support 
TimestampNTZ type in JDBC data source
URL: https://github.com/apache/spark/pull/36726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #36726: [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source

2022-06-13 Thread GitBox



gengliangwang commented on PR #36726:
URL: https://github.com/apache/spark/pull/36726#issuecomment-1154692745

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #36864: [SPARK-39463][CORE][TESTS] Use `UUID` for test database location in `JavaJdbcRDDSuite`

2022-06-13 Thread GitBox



dongjoon-hyun opened a new pull request, #36864:
URL: https://github.com/apache/spark/pull/36864

   ### What changes were proposed in this pull request?
   
   This PR aims to use UUID instead of a fixed test database location in 
`JavaJdbcRDDSuite`.
   
   ### Why are the changes needed?
   
   Although there exists a clean-up logic in `JavaJdbcRDDSuite`, the location 
is not removed cleanly when the tests are interrupted. After this PR, we can 
avoid the conflicts due to the leftover.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



LuciferYang commented on PR #36856:
URL: https://github.com/apache/spark/pull/36856#issuecomment-1154677592

   Are there any other similar cases?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36863: [SPARK-39459][CORE] `localHostName*` methods should support `IPv6`

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36863:
URL: https://github.com/apache/spark/pull/36863#issuecomment-1154672837

   cc @dbtsai 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



srowen commented on code in PR #36856:
URL: https://github.com/apache/spark/pull/36856#discussion_r896341536


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -53,6 +53,17 @@ case class UnaryMinus(
   override def toString: String = s"-$child"
 
   private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError)
+  private lazy val unaryMinusFunc: Any => Any = dataType match {

Review Comment:
   Oh I see now it depends on dataType. I understand what this does now.
   How many other places would you have to change though to cover all 
functions? or are these the important ones?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #36863: [SPARK-39459][CORE] localHostName* methods should support IPv6

2022-06-13 Thread GitBox



dongjoon-hyun opened a new pull request, #36863:
URL: https://github.com/apache/spark/pull/36863

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



ulysses-you commented on code in PR #36856:
URL: https://github.com/apache/spark/pull/36856#discussion_r89692


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -53,6 +53,17 @@ case class UnaryMinus(
   override def toString: String = s"-$child"
 
   private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError)
+  private lazy val unaryMinusFunc: Any => Any = dataType match {

Review Comment:
   Do you mean rewrite it to ?
   ```scala
   private def unaryMinusFunc: Any => Any = dataType match {
 ..
   ```
   
   The reason I pull out and make it as lazy val is: the data type is known 
before do eval in an expression. Let's say if the data type is integer,
   - then the function can be elimiated to `input  => numeric.negate(input)` 
during execution
   - if you declare it as a function, then the function would be during 
execution:
 ```scala
 dataType match {
   
   case _ => input => numeric.negate(input)
 ```
   
   The overhead is not about creating a function, but the data type matching 
elimination. So I think the lazy val is more efficient ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



ulysses-you commented on code in PR #36856:
URL: https://github.com/apache/spark/pull/36856#discussion_r89692


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -53,6 +53,17 @@ case class UnaryMinus(
   override def toString: String = s"-$child"
 
   private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError)
+  private lazy val unaryMinusFunc: Any => Any = dataType match {

Review Comment:
   Do you mean rewrite it to ?
   ```scala
   private def unaryMinusFunc: Any => Any = dataType match {
 ..
   ```
   
   The reason I pull out and make it as lazy val is: the data type is known 
before do eval in an expression. Let's say if the data type is integer,
   - then the function can be elimiated to `input  => numeric.negate(input)` 
during execution
   - if you declare it as a function, then the function would be during 
execution:
 ```scala
 dataType match {
   
   case _ => input => numeric.negate(input)
 ```
   
   The overhead if not about creating a function, but the data type matching 
elimination. So I think the lazy val is more efficient ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`

2022-06-13 Thread GitBox



dongjoon-hyun closed pull request #36862: [SPARK-39461][INFRA] Print 
`SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`
URL: https://github.com/apache/spark/pull/36862


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36862:
URL: https://github.com/apache/spark/pull/36862#issuecomment-1154655021

   Thank you so much, @sunchao ! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled

2022-06-13 Thread GitBox



mridulm commented on PR #35683:
URL: https://github.com/apache/spark/pull/35683#issuecomment-1154652812

   Merged to master.
   Thanks for working on this @abhishekd0907 !
   Thanks for the review @attilapiros :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm closed pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled

2022-06-13 Thread GitBox



mridulm closed pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add 
support for YARN decommissioning when ESS is disabled
URL: https://github.com/apache/spark/pull/35683


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



srowen commented on code in PR #36856:
URL: https://github.com/apache/spark/pull/36856#discussion_r896325472


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -53,6 +53,17 @@ case class UnaryMinus(
   override def toString: String = s"-$child"
 
   private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError)
+  private lazy val unaryMinusFunc: Any => Any = dataType match {

Review Comment:
   I mean, why declare it this way instead of a function? I don't see what 
difference it makes. There is nothing expensive about creating it, so why lazy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #36830: [SPARK-39453][SQL] DS V2 supports push down misc non-aggregate functions(non ANSI)

2022-06-13 Thread GitBox



beliefer commented on PR #36830:
URL: https://github.com/apache/spark/pull/36830#issuecomment-1154641204

   > So the statement here [#36039 
(comment)](https://github.com/apache/spark/pull/36039#issuecomment-1089567136) 
is not true any more, we will push down both ANSI functions and commonly used 
non-ANSI functions?
   
   Yes. The ANSI function too few and the commonly used non-ANSI functions also 
needed.
   A discussion between @cloud-fan and me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #36830: [SPARK-39453][SQL] DS V2 supports push down misc non-aggregate functions(non ANSI)

2022-06-13 Thread GitBox



huaxingao commented on PR #36830:
URL: https://github.com/apache/spark/pull/36830#issuecomment-1154634939

   So the statement here 
https://github.com/apache/spark/pull/36039#issuecomment-1089567136 is not true 
any more, we will push down both ANSI functions and commonly used non-ANSI 
functions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #36564: [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status

2022-06-13 Thread GitBox



AngersZh commented on PR #36564:
URL: https://github.com/apache/spark/pull/36564#issuecomment-1154633468

   > LGTM if tests pass
   
   GA failed not related to this PR
   
   ```
   __w/spark/spark/docs/_plugins/copy_api_dirs.rb:130:in `': 
Python doc generation failed (RuntimeError)
   2022-06-11T07:16:53.8252035Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:60:in
 `require'
   2022-06-11T07:16:53.8252686Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:60:in
 `block in require_with_graceful_fail'
   2022-06-11T07:16:53.8253306Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:57:in
 `each'
   2022-06-11T07:16:53.8253947Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/external.rb:57:in
 `require_with_graceful_fail'
   2022-06-11T07:16:53.8254613Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:89:in
 `block in require_plugin_files'
   2022-06-11T07:16:53.8255235Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:87:in
 `each'
   2022-06-11T07:16:53.8255870Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:87:in
 `require_plugin_files'
   2022-06-11T07:16:53.8256520Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/plugin_manager.rb:21:in
 `conscientious_require'
   2022-06-11T07:16:53.8257116Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/site.rb:131:in
 `setup'
   2022-06-11T07:16:53.8257695Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/site.rb:36:in
 `initialize'
   2022-06-11T07:16:53.8258278Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/commands/build.rb:30:in
 `new'
   2022-06-11T07:16:53.825Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/commands/build.rb:30:in
 `process'
   2022-06-11T07:16:53.8259731Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/command.rb:91:in
 `block in process_with_graceful_fail'
   2022-06-11T07:16:53.8260335Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/command.rb:91:in
 `each'
   2022-06-11T07:16:53.8260952Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/command.rb:91:in
 `process_with_graceful_fail'
   2022-06-11T07:16:53.8261625Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/lib/jekyll/commands/build.rb:18:in
 `block (2 levels) in init_with_program'
   2022-06-11T07:16:53.8262274Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in
 `block in execute'
   2022-06-11T07:16:53.8262921Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in
 `each'
   2022-06-11T07:16:53.8263524Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in
 `execute'
   2022-06-11T07:16:53.8264134Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in
 `go'
   2022-06-11T07:16:53.8264718Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in
 `program'
   2022-06-11T07:16:53.8265284Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/gems/jekyll-4.2.1/exe/jekyll:15:in
 `'
   2022-06-11T07:16:53.8265698Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/bin/jekyll:23:in `load'
   2022-06-11T07:16:53.8266159Z from 
/__w/spark/spark/docs/.local_ruby_bundle/ruby/2.7.0/bin/jekyll:23:in `'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #36564: [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status

2022-06-13 Thread GitBox



AngersZh commented on code in PR #36564:
URL: https://github.com/apache/spark/pull/36564#discussion_r896311302


##
core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala:
##
@@ -270,6 +263,16 @@ class OutputCommitCoordinatorSuite extends SparkFunSuite 
with BeforeAndAfter {
   .stageStart(meq(retriedStage.head), any())
 verify(sc.env.outputCommitCoordinator).stageEnd(meq(retriedStage.head))
   }
+
+  test("SPARK-39195: Spark OutputCommitCoordinator should abort stage " +

Review Comment:
   Removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #36785: [SPARK-39397][SQL] Relax AliasAwareOutputExpression to support alias with expression

2022-06-13 Thread GitBox



ulysses-you commented on code in PR #36785:
URL: https://github.com/apache/spark/pull/36785#discussion_r896303277


##
sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala:
##
@@ -25,15 +25,15 @@ import 
org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partition
 trait AliasAwareOutputExpression extends UnaryExecNode {
   protected def outputExpressions: Seq[NamedExpression]
 
-  private lazy val aliasMap = AttributeMap(outputExpressions.collect {
-case a @ Alias(child: AttributeReference, _) => (child, a.toAttribute)
-  })
+  private lazy val aliasMap = outputExpressions.collect {
+case a @ Alias(child, _) => child.canonicalized -> a.toAttribute
+  }.toMap
 
   protected def hasAlias: Boolean = aliasMap.nonEmpty
 
   protected def normalizeExpression(exp: Expression): Expression = {
 exp.transform {

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on a diff in pull request #36856: [SPARK-39455][SQL] Improve expression non-codegen code path performance by cache data type matching

2022-06-13 Thread GitBox



ulysses-you commented on code in PR #36856:
URL: https://github.com/apache/spark/pull/36856#discussion_r896301945


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -53,6 +53,17 @@ case class UnaryMinus(
   override def toString: String = s"-$child"
 
   private lazy val numeric = TypeUtils.getNumeric(dataType, failOnError)
+  private lazy val unaryMinusFunc: Any => Any = dataType match {

Review Comment:
   hmm, this would be called inside `eval` so a cached function is expected to 
avoid invoke by every row



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiuzhu9527 commented on pull request #36784: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid credentials'

2022-06-13 Thread GitBox



xiuzhu9527 commented on PR #36784:
URL: https://github.com/apache/spark/pull/36784#issuecomment-1154621166

   @wangyum @pan3793 Hi, What should we do next？Now I am very confused about 
whether this problem is allowed to be fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36859: DTW: new distance measure for clustering

2022-06-13 Thread GitBox



AmplabJenkins commented on PR #36859:
URL: https://github.com/apache/spark/pull/36859#issuecomment-1154620125

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers

2022-06-13 Thread GitBox



AmplabJenkins commented on PR #36861:
URL: https://github.com/apache/spark/pull/36861#issuecomment-1154620110

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36740: [SPARK-39355][SQL] Avoid UnresolvedAttribute.apply throwing ParseException

2022-06-13 Thread GitBox



cloud-fan commented on code in PR #36740:
URL: https://github.com/apache/spark/pull/36740#discussion_r896290278


##
sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala:
##
@@ -2176,4 +2176,32 @@ class SubquerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
   }
 }
   }
+
+  test("SPARK-39355: Avoid UnresolvedAttribute.apply throwing ParseException") 
{
+checkAnswer(
+  sql("""
+|SELECT *

Review Comment:
   I think we should fix `Alias.toAttribute`. It should create an attribute to 
reference the output of this alias, which is a single column, and we need to 
use `UnresolvedAttribute.quoted`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



cloud-fan commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896288305


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -287,16 +294,44 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* Checks if the database with the specified name exists.
*/
   override def databaseExists(dbName: String): Boolean = {
-sessionCatalog.databaseExists(dbName)
+// To maintain backwards compatibility, we first treat the input is a 
simple dbName and check
+// if sessionCatalog contains it. If no, we try to parse it as 3 part 
name. If the parased
+// identifier contains both catalog name and database name, we then search 
the database in the
+// catalog.
+if (!sessionCatalog.databaseExists(dbName)) {
+  val ident = 
sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName)
+  if (ident.length == 2) {

Review Comment:
   We can follow what we did in `makeTable`: Create a `UnresolvedNamespace` 
logical plan and resolve it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #36852: [SPARK-38700][SQL][3.3] Use error classes in the execution errors of save mode

2022-06-13 Thread GitBox



panbingkun commented on PR #36852:
URL: https://github.com/apache/spark/pull/36852#issuecomment-1154604668

   > @panbingkun Could you fix the test failure, please:
   > 
   > ```
   > QueryExecutionErrorsSuite.UNSUPPORTED_SAVE_MODE: unsupported null saveMode 
whether the path exists or not
   > org.scalatest.exceptions.TestFailedException: "... supported for: a 
no[n-]existent path." did not equal "... supported for: a no[t ]existent
   > ```
   
   Done @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896287031


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean 
cleanupLocalDirs) {
 logger.info("Application {} removed, cleanupLocalDirs = {}", appId, 
cleanupLocalDirs);
 AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId);
 if (null != appShuffleInfo) {
-  mergedShuffleCleaner.execute(
-() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, 
cleanupLocalDirs));
+  submitCleanupTask(
+() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, 
true));
 }
+removeAppAttemptPathInfoFromDB(
+new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId));

Review Comment:
   Updated as suggested



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896286905


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -655,6 +742,206 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 }
   }
 
+
+  @Override
+  public void close() {
+if (db != null) {
+  try {
+db.close();
+  } catch (IOException e) {
+logger.error("Exception closing leveldb with registered app paths info 
and "
++ "shuffle partition info", e);
+  }
+}
+  }
+
+  private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo 
appPathsInfo) {
+if (db != null) {
+  try {
+byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, 
attemptId));
+String valueStr = mapper.writeValueAsString(appPathsInfo);
+byte[] value = valueStr.getBytes(StandardCharsets.UTF_8);
+db.put(key, value);
+  } catch (Exception e) {
+logger.error("Error saving registered app paths info", e);
+  }
+}
+  }
+
+  private void writeAppAttemptShuffleMergeInfoToDB(
+  String appId,
+  int appAttemptId,
+  int shuffleId,
+  int shuffleMergeId) {
+if (db != null) {
+  // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles
+  try{
+byte[] dbKey = getDbAppAttemptShufflePartitionKey(
+new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, 
shuffleMergeId));
+db.put(dbKey, new byte[0]);
+  } catch (Exception e) {
+logger.error("Error saving active app shuffle partition", e);
+  }
+}
+
+  }
+
+  private  T parseDbKey(String key, String prefix, Class valueType) {
+try {
+  String json = key.substring(prefix.length() + 1);
+  return mapper.readValue(json, valueType);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB key {}", key);
+  return null;
+}
+  }
+
+  private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId 
appAttemptId) {
+try {
+  return mapper.readValue(value, AppPathsInfo.class);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB value for {}", 
appAttemptId);
+  return null;
+}
+  }
+
+  private AppAttemptId parseDbAppAttemptPathsKey(String key) {
+return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class);
+  }
+
+  private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey(
+  String key,
+  String prefix) {
+return parseDbKey(key, prefix, AppAttemptShuffleMergeId.class);
+  }
+
+  private byte[] getDbKey(Object key, String prefix) {
+// We add a common prefix on all the keys so we can find them in the DB
+try {
+  String keyJsonString = prefix + DB_KEY_DELIMITER + 
mapper.writeValueAsString(key);
+  return keyJsonString.getBytes(StandardCharsets.UTF_8);
+} catch (Exception exception) {
+  logger.error("Exception while generating the DB key {}", key);
+  return null;
+}
+  }
+
+  private byte[] getDbAppAttemptShufflePartitionKey(
+  AppAttemptShuffleMergeId appAttemptShuffleMergeId) {
+return getDbKey(appAttemptShuffleMergeId, 
APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX);
+  }
+
+  private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId){
+return getDbKey(appAttemptId, APP_ATTEMPT_PATH_KEY_PREFIX);
+  }
+
+  @VisibleForTesting
+  void reloadAndCleanUpAppShuffleInfo(DB db) {
+logger.info("Reload applications merged shuffle information from DB");
+List dbKeysToBeRemoved = reloadActiveAppAttemptsPathInfo(db);
+dbKeysToBeRemoved.addAll(reloadFinalizedAppAttemptsShuffleMergeInfo(db));
+// Clean up invalid data stored in DB
+submitCleanupTask(() ->
+dbKeysToBeRemoved.forEach(
+(key) -> {
+  try {
+db.delete(key);
+  } catch (Exception e) {
+logger.error("Error deleting data in DB", e);
+  }
+}
+)
+);
+  }
+
+  private List reloadActiveAppAttemptsPathInfo(DB db) {
+List dbKeysToBeRemoved = new ArrayList<>();
+if (db != null) {
+  DBIterator itr = db.iterator();
+  itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8));
+  while (itr.hasNext()) {
+Map.Entry e = itr.next();
+String key = new String(e.getKey(), StandardCharsets.UTF_8);
+if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) {
+  break;
+}
+AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key);
+if (appAttemptId == null) {
+  dbKeysToBeRemoved.add(e.getKey());
+  break;
+}
+AppPathsInfo appPathsInfo = parseDBAppAttemptPathsValue(e.getValue(), 
appAttemptId);
+if (appPathsInfo == null) {

Review Comment:
   Removed.



##

[GitHub] [spark] cloud-fan commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



cloud-fan commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896286734


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -250,8 +251,18 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* table/view. This throws an `AnalysisException` when no `Table` can be 
found.
*/
   override def getTable(tableName: String): Table = {
-val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
-getTable(tableIdent.database.orNull, tableIdent.table)
+// calling `sqlParser.parseTableIdentifier` to parse tableName. If it 
contains only table name
+// and optionally contains a database name(thus a TableIdentifier), then 
that is used to get
+// the table. Otherwise we try `sqlParser.parseMultipartIdentifier` to 
have a sequence of string

Review Comment:
   And then one idea is, if we can't find this table in HMS, shall we fall back 
to lookup the table with v2 code path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896286777


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -643,8 +725,13 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 AppShuffleInfo appShuffleInfo = originalAppShuffleInfo.get();
 logger.warn("Cleanup shuffle info and merged shuffle files for 
{}_{} as new " +
 "application attempt registered", appId, 
appShuffleInfo.attemptId);
-mergedShuffleCleaner.execute(
-  () -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, 
true));
+// Clean up all the merge shuffle related information in the DB 
for the former attempt
+submitCleanupTask(
+  () -> {
+removeAppAttemptPathInfoFromDB(new AppAttemptId(appId, 
appShuffleInfo.attemptId));

Review Comment:
   Updated to make strong consistent between DB and in memory hashmap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896286599


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -342,6 +380,33 @@ void closeAndDeletePartitionFilesIfNeeded(
 if (cleanupLocalDirs) {
   deleteExecutorDirs(appShuffleInfo);
 }
+if (removeFromDb){
+  removeAppShuffleInfoFromDB(appShuffleInfo);
+}
+  }
+
+  private void removeAppAttemptPathInfoFromDB(AppAttemptId appAttemptId) {
+if (db != null) {
+  try {
+db.delete(getDbAppAttemptPathsKey(appAttemptId));
+  } catch (Exception e) {
+logger.error("Error deleting {} from application paths info in DB", 
appAttemptId, e);
+  }
+}
+  }
+
+  private void removeAppShuffleInfoFromDB(AppShuffleInfo appShuffleInfo) {
+if (db != null) {
+  appShuffleInfo.shuffles
+.forEach((shuffleId, shuffleInfo) -> shuffleInfo.shuffleMergePartitions
+  .forEach((shuffleMergeId, partitionInfo) -> {
+synchronized (partitionInfo) {
+  removeAppShufflePartitionInfoFromDB(
+new AppAttemptShuffleMergeId(
+  appShuffleInfo.appId, appShuffleInfo.attemptId, shuffleId, 
shuffleMergeId));
+}

Review Comment:
   Refactored the code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



cloud-fan commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896286460


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -250,8 +251,18 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* table/view. This throws an `AnalysisException` when no `Table` can be 
found.
*/
   override def getTable(tableName: String): Table = {
-val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
-getTable(tableIdent.database.orNull, tableIdent.table)
+// calling `sqlParser.parseTableIdentifier` to parse tableName. If it 
contains only table name
+// and optionally contains a database name(thus a TableIdentifier), then 
that is used to get
+// the table. Otherwise we try `sqlParser.parseMultipartIdentifier` to 
have a sequence of string

Review Comment:
   let's be more specific about the semantics. If the table name has less than 
3 parts, we always get the table from HMS first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896286443


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -655,6 +742,206 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 }
   }
 
+
+  @Override
+  public void close() {
+if (db != null) {
+  try {
+db.close();
+  } catch (IOException e) {
+logger.error("Exception closing leveldb with registered app paths info 
and "
++ "shuffle partition info", e);
+  }
+}
+  }
+
+  private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo 
appPathsInfo) {
+if (db != null) {
+  try {
+byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, 
attemptId));
+String valueStr = mapper.writeValueAsString(appPathsInfo);
+byte[] value = valueStr.getBytes(StandardCharsets.UTF_8);
+db.put(key, value);
+  } catch (Exception e) {
+logger.error("Error saving registered app paths info", e);
+  }
+}
+  }
+
+  private void writeAppAttemptShuffleMergeInfoToDB(
+  String appId,
+  int appAttemptId,
+  int shuffleId,
+  int shuffleMergeId) {
+if (db != null) {
+  // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles
+  try{
+byte[] dbKey = getDbAppAttemptShufflePartitionKey(
+new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, 
shuffleMergeId));
+db.put(dbKey, new byte[0]);
+  } catch (Exception e) {
+logger.error("Error saving active app shuffle partition", e);
+  }
+}
+
+  }
+
+  private  T parseDbKey(String key, String prefix, Class valueType) {
+try {
+  String json = key.substring(prefix.length() + 1);
+  return mapper.readValue(json, valueType);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB key {}", key);
+  return null;
+}
+  }
+
+  private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId 
appAttemptId) {
+try {
+  return mapper.readValue(value, AppPathsInfo.class);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB value for {}", 
appAttemptId);
+  return null;
+}
+  }
+
+  private AppAttemptId parseDbAppAttemptPathsKey(String key) {
+return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class);
+  }
+
+  private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey(
+  String key,
+  String prefix) {
+return parseDbKey(key, prefix, AppAttemptShuffleMergeId.class);
+  }
+
+  private byte[] getDbKey(Object key, String prefix) {
+// We add a common prefix on all the keys so we can find them in the DB
+try {
+  String keyJsonString = prefix + DB_KEY_DELIMITER + 
mapper.writeValueAsString(key);
+  return keyJsonString.getBytes(StandardCharsets.UTF_8);
+} catch (Exception exception) {
+  logger.error("Exception while generating the DB key {}", key);
+  return null;
+}
+  }
+
+  private byte[] getDbAppAttemptShufflePartitionKey(
+  AppAttemptShuffleMergeId appAttemptShuffleMergeId) {
+return getDbKey(appAttemptShuffleMergeId, 
APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX);
+  }
+
+  private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId){
+return getDbKey(appAttemptId, APP_ATTEMPT_PATH_KEY_PREFIX);
+  }
+
+  @VisibleForTesting
+  void reloadAndCleanUpAppShuffleInfo(DB db) {
+logger.info("Reload applications merged shuffle information from DB");
+List dbKeysToBeRemoved = reloadActiveAppAttemptsPathInfo(db);
+dbKeysToBeRemoved.addAll(reloadFinalizedAppAttemptsShuffleMergeInfo(db));
+// Clean up invalid data stored in DB
+submitCleanupTask(() ->
+dbKeysToBeRemoved.forEach(
+(key) -> {
+  try {
+db.delete(key);
+  } catch (Exception e) {
+logger.error("Error deleting data in DB", e);
+  }
+}
+)
+);
+  }
+
+  private List reloadActiveAppAttemptsPathInfo(DB db) {
+List dbKeysToBeRemoved = new ArrayList<>();
+if (db != null) {
+  DBIterator itr = db.iterator();
+  itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8));
+  while (itr.hasNext()) {
+Map.Entry e = itr.next();
+String key = new String(e.getKey(), StandardCharsets.UTF_8);
+if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) {
+  break;
+}
+AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key);
+if (appAttemptId == null) {
+  dbKeysToBeRemoved.add(e.getKey());
+  break;
+}
+AppPathsInfo appPathsInfo = parseDBAppAttemptPathsValue(e.getValue(), 
appAttemptId);
+if (appPathsInfo == null) {
+  dbKeysToBeRemoved.add(e.getKey());
+  break;
+

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896286390


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -655,6 +742,206 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 }
   }
 
+
+  @Override
+  public void close() {
+if (db != null) {
+  try {
+db.close();
+  } catch (IOException e) {
+logger.error("Exception closing leveldb with registered app paths info 
and "
++ "shuffle partition info", e);
+  }
+}
+  }
+
+  private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo 
appPathsInfo) {
+if (db != null) {
+  try {
+byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, 
attemptId));
+String valueStr = mapper.writeValueAsString(appPathsInfo);
+byte[] value = valueStr.getBytes(StandardCharsets.UTF_8);
+db.put(key, value);
+  } catch (Exception e) {
+logger.error("Error saving registered app paths info", e);
+  }
+}
+  }
+
+  private void writeAppAttemptShuffleMergeInfoToDB(
+  String appId,
+  int appAttemptId,
+  int shuffleId,
+  int shuffleMergeId) {
+if (db != null) {
+  // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles
+  try{
+byte[] dbKey = getDbAppAttemptShufflePartitionKey(
+new AppAttemptShuffleMergeId(appId, appAttemptId, shuffleId, 
shuffleMergeId));
+db.put(dbKey, new byte[0]);
+  } catch (Exception e) {
+logger.error("Error saving active app shuffle partition", e);
+  }
+}
+
+  }
+
+  private  T parseDbKey(String key, String prefix, Class valueType) {
+try {
+  String json = key.substring(prefix.length() + 1);
+  return mapper.readValue(json, valueType);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB key {}", key);
+  return null;
+}
+  }
+
+  private AppPathsInfo parseDBAppAttemptPathsValue(byte[] value, AppAttemptId 
appAttemptId) {
+try {
+  return mapper.readValue(value, AppPathsInfo.class);
+} catch (Exception exception) {
+  logger.error("Exception while parsing the DB value for {}", 
appAttemptId);
+  return null;
+}
+  }
+
+  private AppAttemptId parseDbAppAttemptPathsKey(String key) {
+return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class);
+  }
+
+  private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey(
+  String key,
+  String prefix) {
+return parseDbKey(key, prefix, AppAttemptShuffleMergeId.class);
+  }
+
+  private byte[] getDbKey(Object key, String prefix) {
+// We add a common prefix on all the keys so we can find them in the DB
+try {
+  String keyJsonString = prefix + DB_KEY_DELIMITER + 
mapper.writeValueAsString(key);
+  return keyJsonString.getBytes(StandardCharsets.UTF_8);
+} catch (Exception exception) {
+  logger.error("Exception while generating the DB key {}", key);
+  return null;
+}
+  }
+
+  private byte[] getDbAppAttemptShufflePartitionKey(
+  AppAttemptShuffleMergeId appAttemptShuffleMergeId) {
+return getDbKey(appAttemptShuffleMergeId, 
APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX);
+  }
+
+  private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId){
+return getDbKey(appAttemptId, APP_ATTEMPT_PATH_KEY_PREFIX);
+  }
+
+  @VisibleForTesting
+  void reloadAndCleanUpAppShuffleInfo(DB db) {
+logger.info("Reload applications merged shuffle information from DB");
+List dbKeysToBeRemoved = reloadActiveAppAttemptsPathInfo(db);
+dbKeysToBeRemoved.addAll(reloadFinalizedAppAttemptsShuffleMergeInfo(db));
+// Clean up invalid data stored in DB
+submitCleanupTask(() ->
+dbKeysToBeRemoved.forEach(
+(key) -> {
+  try {
+db.delete(key);
+  } catch (Exception e) {
+logger.error("Error deleting data in DB", e);
+  }
+}
+)
+);
+  }
+
+  private List reloadActiveAppAttemptsPathInfo(DB db) {
+List dbKeysToBeRemoved = new ArrayList<>();
+if (db != null) {
+  DBIterator itr = db.iterator();
+  itr.seek(APP_ATTEMPT_PATH_KEY_PREFIX.getBytes(StandardCharsets.UTF_8));
+  while (itr.hasNext()) {
+Map.Entry e = itr.next();
+String key = new String(e.getKey(), StandardCharsets.UTF_8);
+if (!key.startsWith(APP_ATTEMPT_PATH_KEY_PREFIX)) {
+  break;
+}
+AppAttemptId appAttemptId = parseDbAppAttemptPathsKey(key);
+if (appAttemptId == null) {

Review Comment:
   Removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries

[GitHub] [spark] cloud-fan commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers

2022-06-13 Thread GitBox



cloud-fan commented on PR #36861:
URL: https://github.com/apache/spark/pull/36861#issuecomment-1154596720

   @dtenedor for bug fix, we should create a new JIRA ticket instead of reusing 
the original one...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers

2022-06-13 Thread GitBox



cloud-fan closed pull request #36861: [SPARK-38796][SQL] Update to_number and 
try_to_number functions to allow PR with positive numbers
URL: https://github.com/apache/spark/pull/36861


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers

2022-06-13 Thread GitBox



cloud-fan commented on PR #36861:
URL: https://github.com/apache/spark/pull/36861#issuecomment-1154596074

   thanks, merging to master/3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896281336


##
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##
@@ -230,11 +241,14 @@ protected void serviceInit(Configuration externalConf) 
throws Exception {
   // when it comes back
   if (_recoveryPath != null) {
 registeredExecutorFile = initRecoveryDb(RECOVERY_FILE_NAME);
+mergeManagerFile = 
initRecoveryDb(SPARK_SHUFFLE_MERGE_RECOVERY_FILE_NAME);
   }
 
-  TransportConf transportConf = new TransportConf("shuffle", new 
HadoopConfigProvider(_conf));
-  MergedShuffleFileManager shuffleMergeManager = 
newMergedShuffleFileManagerInstance(
-transportConf);
+  TransportConf transportConf = new TransportConf("shuffle",new 
HadoopConfigProvider(_conf));
+  if (shuffleMergeManager == null) {

Review Comment:
   Added comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896279364


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -350,15 +415,27 @@ void closeAndDeletePartitionFilesIfNeeded(
* up older shuffleMergeId partitions. The cleanup will be executed in a 
separate thread.
*/
   @VisibleForTesting
-  void closeAndDeletePartitionFiles(Map 
partitions) {
+  void closeAndDeleteOutdatedPartitions(Map 
partitions) {
 partitions
   .forEach((partitionId, partitionInfo) -> {
 synchronized (partitionInfo) {
   partitionInfo.closeAllFilesAndDeleteIfNeeded(true);
+  
removeAppShufflePartitionInfoFromDB(partitionInfo.appAttemptShuffleMergeId);

Review Comment:
   Refactored the signature of closeAndDeleteOutdatedPartitions by adding the 
AppAttemptShuffleMergeId, since every time it is getting called, there is going 
to only one call to remove the finalized partitionInfo from DB with the key 
AppAttemptShuffleMergeId.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-06-13 Thread GitBox



zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r896274975


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean 
cleanupLocalDirs) {
 logger.info("Application {} removed, cleanupLocalDirs = {}", appId, 
cleanupLocalDirs);
 AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId);
 if (null != appShuffleInfo) {
-  mergedShuffleCleaner.execute(
-() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, 
cleanupLocalDirs));
+  submitCleanupTask(
+() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, 
true));
 }
+removeAppAttemptPathInfoFromDB(
+new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId));
   }
 
-
   /**
* Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo.
* If cleanupLocalDirs is true, the merged shuffle files will also be 
deleted.
* The cleanup will be executed in a separate thread.
*/
   @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter")
   @VisibleForTesting
-  void closeAndDeletePartitionFilesIfNeeded(
+  void closeAndDeletePartitions(
   AppShuffleInfo appShuffleInfo,
-  boolean cleanupLocalDirs) {
+  boolean cleanupLocalDirs,
+  boolean removeFromDb) {

Review Comment:
   This is to handle the case you mentioned earlier, that the merged shuffle 
data has been removed from the disk through some API(TBD in another ticket for 
cleaning up merged shuffle during job runtime), but the information in the DB 
should be kept. Right now, we don't have that API in place, so all the callers 
will set this flag to true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36841: [SPARK-39444][SQL] Add OptimizeSubqueries into nonExcludableRules list

2022-06-13 Thread GitBox



HyukjinKwon commented on code in PR #36841:
URL: https://github.com/apache/spark/pull/36841#discussion_r896274952


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##
@@ -4456,6 +4456,24 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 """.stripMargin),
   Seq(Row(2), Row(1)))
   }
+
+  test("SPARK-39444: Add OptimizeSubqueries into nonExcludableRules list") {
+withSQLConf(
+  SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
+
"org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries") {

Review Comment:
   and let's match the test place with 
https://github.com/apache/spark/pull/36847



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #36273: [SPARK-38960][CORE]Spark should fail fast if initial memory too large…

2022-06-13 Thread GitBox



srowen closed pull request #36273: [SPARK-38960][CORE]Spark should fail fast if 
initial memory too large…
URL: https://github.com/apache/spark/pull/36273


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36841: [SPARK-39444][SQL] Add OptimizeSubqueries into nonExcludableRules list

2022-06-13 Thread GitBox



HyukjinKwon commented on code in PR #36841:
URL: https://github.com/apache/spark/pull/36841#discussion_r896274142


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##
@@ -4456,6 +4456,24 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 """.stripMargin),
   Seq(Row(2), Row(1)))
   }
+
+  test("SPARK-39444: Add OptimizeSubqueries into nonExcludableRules list") {
+withSQLConf(
+  SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
+
"org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries") {

Review Comment:
   nit but can we use a class reference instead of a string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35256: [SPARK-37933][SQL] Limit push down for parquet vectorized reader

2022-06-13 Thread GitBox



github-actions[bot] closed pull request #35256: [SPARK-37933][SQL] Limit push 
down for parquet vectorized reader
URL: https://github.com/apache/spark/pull/35256


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35719: [SPARK-38401][SQL][CORE] Unify get preferred locations for shuffle in AQE

2022-06-13 Thread GitBox



github-actions[bot] closed pull request #35719: [SPARK-38401][SQL][CORE] Unify 
get preferred locations for shuffle in AQE
URL: https://github.com/apache/spark/pull/35719


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36729: [SPARK-39295][PYTHON][DOCS] Improve documentation of pandas API support list

2022-06-13 Thread GitBox



HyukjinKwon commented on PR #36729:
URL: https://github.com/apache/spark/pull/36729#issuecomment-1154570503

   Sure, thanks for tracking this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations

2022-06-13 Thread GitBox



cloud-fan commented on PR #36837:
URL: https://github.com/apache/spark/pull/36837#issuecomment-1154563272

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations

2022-06-13 Thread GitBox



cloud-fan closed pull request #36837: [SPARK-39441][SQL] Speed up 
DeduplicateRelations
URL: https://github.com/apache/spark/pull/36837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/(mvn|sbt)`

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36862:
URL: https://github.com/apache/spark/pull/36862#issuecomment-1154557430

   Hi, @sunchao . Could you review this when you have some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations

2022-06-13 Thread GitBox



dongjoon-hyun closed pull request #36860: [SPARK-39460][CORE][TESTS] Fix 
`CoarseGrainedSchedulerBackendSuite` to handle fast allocations
URL: https://github.com/apache/spark/pull/36860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36860:
URL: https://github.com/apache/spark/pull/36860#issuecomment-1154551586

   Thank you so much, @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations

2022-06-13 Thread GitBox



huaxingao commented on PR #36860:
URL: https://github.com/apache/spark/pull/36860#issuecomment-1154541511

   LGTM. Thanks for pinging me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #36843: [SPARK-39446][MLLIB] Add relevance score for nDCG evaluation

2022-06-13 Thread GitBox



srowen commented on code in PR #36843:
URL: https://github.com/apache/spark/pull/36843#discussion_r896243825


##
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##
@@ -35,8 +35,16 @@ import org.apache.spark.rdd.RDD
  * @param predictionAndLabels an RDD of (predicted ranking, ground truth set) 
pairs.
  */
 @Since("1.2.0")
-class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
-  extends Logging with Serializable {
+class RankingMetrics[T: ClassTag](
+predictionAndLabels: RDD[(Array[T], Array[T], Array[(T, Double)])])

Review Comment:
   Hm, why does the last need to be (T, Double) pairs? Wouldn't this be an 
attribute of the ground truth? Looking this up via Map seems clunky later. The 
problem is not changing the binary signature, but, at least, how about a third 
array of Double only, that is parallel to the second array?



##
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala:
##
@@ -58,9 +66,12 @@ class RankingMetrics[T: ClassTag](predictionAndLabels: 
RDD[(Array[T], Array[T])]
   @Since("1.2.0")
   def precisionAt(k: Int): Double = {

Review Comment:
   Is there any similar notion of precision@k when the ground truth has a 
relevance 'weight'?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations

2022-06-13 Thread GitBox



allisonwang-db commented on PR #36837:
URL: https://github.com/apache/spark/pull/36837#issuecomment-1154536909

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on pull request #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations

2022-06-13 Thread GitBox



allisonwang-db commented on PR #36837:
URL: https://github.com/apache/spark/pull/36837#issuecomment-1154536858

   @LuciferYang I am running on M1 as well. Indeed the runtime for the 
TPCDSQuerySuite can vary over the runs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #35290: [SPARK-37865][SQL][3.0]Fix union bug when the first child of union has duplicate columns

2022-06-13 Thread GitBox



srowen closed pull request #35290: [SPARK-37865][SQL][3.0]Fix union bug when 
the first child of union has duplicate columns
URL: https://github.com/apache/spark/pull/35290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #36334: Refactor tests

2022-06-13 Thread GitBox



srowen closed pull request #36334: Refactor tests
URL: https://github.com/apache/spark/pull/36334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #36862: [SPARK-39461][INFRA] Print `SPARK_LOCAL_(HOSTNAME|IP)` in `build/{mvn|sbt}`

2022-06-13 Thread GitBox



dongjoon-hyun opened a new pull request, #36862:
URL: https://github.com/apache/spark/pull/36862

   ### What changes were proposed in this pull request?
   
   This PR aims to print `SPARK_LOCAL_(HOSTNAME|IP)` during building and 
testing at `build/{mvn|sbt}`.
   
   ### Why are the changes needed?
   
   `SPARK_LOCAL_HOSTNAME` and `SPARK_LOCAL_IP` are used during testing and the 
test result depends on them.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the CIs and check GitHub Action logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #35586: [SPARK-38265][DOCS][CORE] Update comments of ExecutorAllocationClient

2022-06-13 Thread GitBox



srowen closed pull request #35586: [SPARK-38265][DOCS][CORE] Update comments of 
ExecutorAllocationClient
URL: https://github.com/apache/spark/pull/35586


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36860: [SPARK-39460][CORE][TESTS] Fix `CoarseGrainedSchedulerBackendSuite` to handle fast allocations

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36860:
URL: https://github.com/apache/spark/pull/36860#issuecomment-1154505057

   Hi, @huaxingao . Could you review this test PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers

2022-06-13 Thread GitBox



dtenedor commented on PR #36861:
URL: https://github.com/apache/spark/pull/36861#issuecomment-1154503075

   Hi @cloud-fan can you take a look at this when you have time, it is a bug 
fix for the `to_number` and `try_to_number` functions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor opened a new pull request, #36861: [SPARK-38796][SQL] Update to_number and try_to_number functions to allow PR with positive numbers

2022-06-13 Thread GitBox



dtenedor opened a new pull request, #36861:
URL: https://github.com/apache/spark/pull/36861

   ### What changes were proposed in this pull request?
   
   Update `to_number` and `try_to_number` functions to allow the `PR` format 
token with input strings comprising positive numbers.
   
   Before this bug fix, function calls like `to_number(' 123 ', '999PR')` would 
fail. Now they succeed, which is helpful since `PR` should allow both positive 
and negative numbers.
   
   This satisfies the following specification:
   
   ```
   to_number(expr, fmt)
   fmt
 { ' [ MI | S ] [ L | $ ]
 [ 0 | 9 | G | , ] [...]
 [ . | D ]
 [ 0 | 9 ] [...]
 [ L | $ ] [ PR | MI | S ] ' }
   ```
   
   ### Why are the changes needed?
   
   After reviewing the specification, this behavior makes the most sense.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, a slight change in the behavior of the format string.
   
   ### How was this patch tested?
   
   Existing and updated unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #36860: [SPARK-39460][CORE][TESTS] Fix CoarseGrainedSchedulerBackendSuite to handle fast allocations

2022-06-13 Thread GitBox



dongjoon-hyun opened a new pull request, #36860:
URL: https://github.com/apache/spark/pull/36860

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



amaliujia commented on PR #36641:
URL: https://github.com/apache/spark/pull/36641#issuecomment-1154465576

   @cloud-fan comments addressed and there is one that we can discuss.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



amaliujia commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896166516


##
sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:
##
@@ -681,4 +681,60 @@ class CatalogSuite extends SharedSparkSession with 
AnalysisTest with BeforeAndAf
 assert(spark.catalog.listTables("default").collect().map(_.name).toSet ==
   Set("my_table1", "my_table2", "my_temp_table"))
   }
+
+  test("three layer namespace compatibility - get table") {
+val catalogName = "testcat"
+val dbName = "my_db"
+val tableName = "my_table"
+val tableSchema = new StructType().add("i", "int")
+val description = "this is a test table"
+
+spark.catalog.createTable(
+  tableName = Array(catalogName, dbName, tableName).mkString("."),
+  source = classOf[FakeV2Provider].getName,
+  schema = tableSchema,
+  description = description,
+  options = Map.empty[String, String])
+
+val t = spark.catalog.getTable(Array(catalogName, dbName, 
tableName).mkString("."))
+val expectedTable =
+  new Table(
+tableName,
+catalogName,
+Array(dbName),
+description,
+CatalogTableType.MANAGED.name,
+false)
+assert(expectedTable.toString == t.toString)
+  }
+
+  test("three layer namespace compatibility - table exists") {
+val catalogName = "testcat"
+val dbName = "my_db"
+val tableName = "my_table"
+val tableSchema = new StructType().add("i", "int")
+
+assert(!spark.catalog.tableExists(Array(catalogName, dbName, 
tableName).mkString(".")))
+
+spark.catalog.createTable(
+  tableName = Array(catalogName, dbName, tableName).mkString("."),
+  source = classOf[FakeV2Provider].getName,
+  schema = tableSchema,
+  description = "",
+  options = Map.empty[String, String])
+
+assert(spark.catalog.tableExists(Array(catalogName, dbName, 
tableName).mkString(".")))
+  }
+
+  test("three layer namespace compatibility - database exists") {
+val catalogName = "testcat"
+val dbName = "my_db"
+assert(!spark.catalog.databaseExists(Array(catalogName, 
dbName).mkString(".")))
+
+val e = intercept[CatalogNotFoundException] {
+  val catalogName2 = "catalog_not_exists"
+  spark.catalog.databaseExists(Array(catalogName2, dbName).mkString("."))
+}
+assert(e.getMessage.contains("catalog_not_exists is not defined"))

Review Comment:
   done



##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -250,8 +251,14 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* table/view. This throws an `AnalysisException` when no `Table` can be 
found.
*/
   override def getTable(tableName: String): Table = {
-val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
-getTable(tableIdent.database.orNull, tableIdent.table)
+try {
+  val ident = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



amaliujia commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896163957


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -287,16 +294,44 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* Checks if the database with the specified name exists.
*/
   override def databaseExists(dbName: String): Boolean = {
-sessionCatalog.databaseExists(dbName)
+// To maintain backwards compatibility, we first treat the input is a 
simple dbName and check
+// if sessionCatalog contains it. If no, we try to parse it as 3 part 
name. If the parased
+// identifier contains both catalog name and database name, we then search 
the database in the
+// catalog.
+if (!sessionCatalog.databaseExists(dbName)) {
+  val ident = 
sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName)
+  if (ident.length == 2) {

Review Comment:
   hmm I don't know how to deal with nested namespace without the help of SQL 
analyzer. I am using 
`sparkSession.sessionState.catalogManager.catalog(catalog_name)` to check if a 
database exists.
   
   
   Maybe we should add a V2 command to handle nested namespace? If I handle by 
calling `catalogManager`, it seems not to be extensible?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



amaliujia commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896161200


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -250,8 +251,14 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* table/view. This throws an `AnalysisException` when no `Table` can be 
found.
*/
   override def getTable(tableName: String): Table = {
-val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
-getTable(tableIdent.database.orNull, tableIdent.table)
+try {
+  val ident = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
+  getTable(ident.database.orNull, ident.table)
+} catch {
+  case e: org.apache.spark.sql.catalyst.parser.ParseException =>
+val ident = 
sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName)

Review Comment:
   That arbitrary sequence of identifier will be passed to SQL analyzer to 
resolve the table. If user for example give `a.b.c.d.e.f`, the analyzer will 
just say this is not found.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #36641: [SPARK-39263][SQL] Make GetTable, TableExists and DatabaseExists be compatible with 3 layer namespace

2022-06-13 Thread GitBox



amaliujia commented on code in PR #36641:
URL: https://github.com/apache/spark/pull/36641#discussion_r896156559


##
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:
##
@@ -250,8 +251,14 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* table/view. This throws an `AnalysisException` when no `Table` can be 
found.
*/
   override def getTable(tableName: String): Table = {
-val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
-getTable(tableIdent.database.orNull, tableIdent.table)
+try {
+  val ident = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
+  getTable(ident.database.orNull, ident.table)
+} catch {
+  case e: org.apache.spark.sql.catalyst.parser.ParseException =>
+val ident = 
sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName)

Review Comment:
   No. For arbitrary sequence it can breaks that by dots to a Seq of string.
   
   `sparkSession.sessionState.sqlParser.parseTableIdentifie` will throw 
exception when the input is not `b` or `a.b` thus beyond a table identifier 
which only considers DB and table name in OSS. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6

2022-06-13 Thread GitBox



dongjoon-hyun closed pull request #36858: [SPARK-39458][CORE][TESTS] Fix 
`UISuite` for IPv6
URL: https://github.com/apache/spark/pull/36858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36858:
URL: https://github.com/apache/spark/pull/36858#issuecomment-1154394229

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #36150: [SPARK-38864][SQL] Add melt / unpivot to Dataset

2022-06-13 Thread GitBox



EnricoMi commented on code in PR #36150:
URL: https://github.com/apache/spark/pull/36150#discussion_r893842370


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1382,6 +1417,12 @@ class Analyzer(override val catalogManager: 
CatalogManager)
   case g: Generate if containsStar(g.generator.children) =>
 throw 
QueryCompilationErrors.invalidStarUsageError("explode/json_tuple/UDTF",
   extractStar(g.generator.children))
+  // If the Melt ids contain Stars, expand them.

Review Comment:
   Should this be merged into a single `case`?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -524,6 +525,10 @@ class Analyzer(override val catalogManager: CatalogManager)
 if child.resolved && groupByOpt.isDefined && 
hasUnresolvedAlias(groupByOpt.get) =>
 Pivot(Some(assignAliases(groupByOpt.get)), pivotColumn, pivotValues, 
aggregates, child)
 
+  case m: Melt if m.child.resolved &&
+(hasUnresolvedAlias(m.ids) || hasUnresolvedAlias(m.values)) =>
+m.copy(ids = assignAliases(m.ids), values = assignAliases(m.values))

Review Comment:
   Any particular reason why the other `case`s do not copy here?



##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2012,7 +2012,97 @@ class Dataset[T] private[sql](
   @scala.annotation.varargs
   def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs 
: _*)
 
- /**
+  /**
+   * Unpivot a DataFrame from wide format to long format, optionally
+   * leaving identifier columns set.
+   *
+   * This function is useful to massage a DataFrame into a format where some
+   * columns are identifier columns ("ids"), while all other columns ("values")
+   * are "unpivoted" to the rows, leaving just two non-id columns, named as 
given
+   * by `variableColumnName` and `valueColumnName`.
+   *
+   * {{{
+   *   val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long")
+   *   df.show()
+   *   // output:
+   *   // +---+---++
+   *   // | id|int|long|
+   *   // +---+---++
+   *   // |  1| 11|  12|
+   *   // |  2| 21|  22|
+   *   // +---+---++
+   *
+   *   df.melt(Array($"id"), Array($"int", $"long"), "variable", 
"value").show()
+   *   // output:
+   *   // +---++-+
+   *   // | id|variable|value|
+   *   // +---++-+
+   *   // |  1| int|   11|
+   *   // |  1|long|   12|
+   *   // |  2| int|   21|
+   *   // |  2|long|   22|
+   *   // +---++-+
+   *   // schema:
+   *   //root
+   *   // |-- id: integer (nullable = false)
+   *   // |-- variable: string (nullable = false)
+   *   // |-- value: long (nullable = true)
+   * }}}
+   *
+   * When no "id" columns are given, the unpivoted DataFrame consists of only 
the
+   * "variable" and "value" columns.
+   *
+   * All "value" columns must be of compatible data type. If they are not the 
same data type,
+   * all "value" columns are cast to the nearest common data type. For 
instance,
+   * types `IntegerType` and `LongType` are compatible and cast to `LongType`,
+   * while `IntegerType` and `StringType` are not compatible and `melt` fails.
+   *
+   * @param ids Id columns
+   * @param values Value columns to melt
+   * @param variableColumnName Name of the variable column
+   * @param valueColumnName Name of the value column
+   *
+   * @group untypedrel
+   * @since 3.4.0
+   */
+  def melt(

Review Comment:
   should there also be a `unpivot` alias?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -736,6 +736,21 @@ abstract class TypeCoercionBase {
 }
   }
 
+  /**
+   * Determines the value type of a [[Melt]].
+   */
+  object MeltCoercion extends Rule[LogicalPlan] {
+override def apply(plan: LogicalPlan): LogicalPlan =
+  plan resolveOperators {
+case m: Melt if m.values.nonEmpty && m.values.forall(_.resolved) && 
m.valueType.isEmpty =>
+  val valueDataType = 
findWiderTypeWithoutStringPromotion(m.values.map(_.dataType))

Review Comment:
   What if users prefer the wider `findWiderCommonType` method? I personally 
don't. Would require to wire in an extra argument to `melt`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] polkadot21 opened a new pull request, #36859: DTW: new distance measure for clustering

2022-06-13 Thread GitBox



polkadot21 opened a new pull request, #36859:
URL: https://github.com/apache/spark/pull/36859

   
   
   ### What changes were proposed in this pull request?
   In this pull request, I propose a new distance measure for clustering, 
namely, dynamic time warping. 
   
   
   ### Why are the changes needed?
   This distance measure is a more accurate choice for time-series clustering.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   The tests are also added.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36858:
URL: https://github.com/apache/spark/pull/36858#issuecomment-1154340620

   Thank you, @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #36858: [SPARK-39458][CORE][TESTS] Fix `UISuite` for IPv6

2022-06-13 Thread GitBox



huaxingao commented on PR #36858:
URL: https://github.com/apache/spark/pull/36858#issuecomment-1154339963

   @dongjoon-hyun Thanks for pinging me. The change looks good to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36832: [SPARK-39439][CORE] Check final file if in-progress event log file does not exist

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36832:
URL: https://github.com/apache/spark/pull/36832#issuecomment-1154339726

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #36832: [SPARK-39439][CORE] Check final file if in-progress event log file does not exist

2022-06-13 Thread GitBox



dongjoon-hyun closed pull request #36832: [SPARK-39439][CORE] Check final file 
if in-progress event log file does not exist
URL: https://github.com/apache/spark/pull/36832


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36858: [SPARK-39458][CORE][TESTS] Fix UISuite for IPv6

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36858:
URL: https://github.com/apache/spark/pull/36858#issuecomment-1154275690

   Could you review this test PR, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #36858: [SPARK-39458][CORE][TESTS] Fix UISuite for IPv6

2022-06-13 Thread GitBox



dongjoon-hyun opened a new pull request, #36858:
URL: https://github.com/apache/spark/pull/36858

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36854: [SPARK-39437][SQL][TEST][3.3] Normalize plan id separately in PlanStabilitySuite

2022-06-13 Thread GitBox



cloud-fan closed pull request #36854: [SPARK-39437][SQL][TEST][3.3] Normalize 
plan id separately in PlanStabilitySuite
URL: https://github.com/apache/spark/pull/36854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #36854: [SPARK-39437][SQL][TEST][3.3] Normalize plan id separately in PlanStabilitySuite

2022-06-13 Thread GitBox



dongjoon-hyun commented on PR #36854:
URL: https://github.com/apache/spark/pull/36854#issuecomment-1154193491

   Merged to branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk opened a new pull request, #36857: [WIP][SQL] Support cast of ANSI intervals to decimals

2022-06-13 Thread GitBox



MaxGekk opened a new pull request, #36857:
URL: https://github.com/apache/spark/pull/36857

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   By running new tests:
   ```
   $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
cast.sql"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] abhishekd0907 commented on pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled

2022-06-13 Thread GitBox



abhishekd0907 commented on PR #35683:
URL: https://github.com/apache/spark/pull/35683#issuecomment-1154149964

   > Can you update the description to reflect the changes made to the PR 
@abhishekd0907 ? Specifically - this is only related to decomissioning and we 
do not handle shuffle ?
   > 
   > The changes look good to me, will merge once we update.
   
   @mridulm done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maryannxue commented on pull request #36845: [SPARK-39447][SQL] Only non-broadcast query stage can propagate empty relation

2022-06-13 Thread GitBox



maryannxue commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1154125105

   @wangyum I don't think this fix will solve all problems. Empty relation 
propagation can cause invalid plans in other ways, e.g., a 
`BroadcastExchangeExec` ending up as the child of a random plan node other than 
BHJ or BNLJ. So you can approach this problem from a different angle. Have an 
invalid plan checker to detect those patterns and throw a specific exception 
that can be caught by `AdaptiveSparkPlanExec`'s replanning method and so it can 
void the latest plan and keep the old valid plan in use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 159 matches

Mail list logo