[GitHub] [spark] viirya commented on a change in pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-23 Thread GitBox


viirya commented on a change in pull request #33989:
URL: https://github.com/apache/spark/pull/33989#discussion_r715335263



##
File path: assembly/pom.xml
##
@@ -165,6 +169,13 @@
 
   hive
   
+

Review comment:
   okay thanks for the update!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


mridulm commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926363391


   +CC @zhouyejoe, @thejdeep 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


mridulm commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715332652



##
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
 toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+  stageId: Int,
+  attemptId: Int,
+  completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
 val countToDelete = calculateNumberToRemove(count, 
conf.get(MAX_RETAINED_STAGES))
 if (countToDelete <= 0L) {
   return
 }
 
+val stageArray = new ArrayBuffer[StageCompletionTime]()
+val stageDataCount = new mutable.HashMap[Int, Int]()
+kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+  // Here we keep track of the total number of StageDataWrapper entries 
for each stage id.
+  // This will be used in cleaning up the RDDOperationGraphWrapper data.
+  if (stageDataCount.contains(s.info.stageId)) {
+stageDataCount(s.info.stageId) += 1
+  } else {
+stageDataCount(s.info.stageId) = 1
+  }
+  if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING) {
+val candidate =
+  StageCompletionTime(s.info.stageId, s.info.attemptId, 
s.completionTime)
+stageArray.append(candidate)
+  }
+}
+
 // As the completion time of a skipped stage is always -1, we will remove 
skipped stages first.
 // This is safe since the job itself contains enough information to render 
skipped stages in the
 // UI.
-val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-  s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
-}
-
-val stageIds = stages.map { s =>

Review comment:
   Scratch that - did not see the attempt iteration - makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


mridulm commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715332397



##
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
 toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+  stageId: Int,
+  attemptId: Int,
+  completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
 val countToDelete = calculateNumberToRemove(count, 
conf.get(MAX_RETAINED_STAGES))
 if (countToDelete <= 0L) {
   return
 }
 
+val stageArray = new ArrayBuffer[StageCompletionTime]()
+val stageDataCount = new mutable.HashMap[Int, Int]()
+kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+  // Here we keep track of the total number of StageDataWrapper entries 
for each stage id.
+  // This will be used in cleaning up the RDDOperationGraphWrapper data.
+  if (stageDataCount.contains(s.info.stageId)) {
+stageDataCount(s.info.stageId) += 1
+  } else {
+stageDataCount(s.info.stageId) = 1
+  }
+  if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING) {
+val candidate =
+  StageCompletionTime(s.info.stageId, s.info.attemptId, 
s.completionTime)
+stageArray.append(candidate)
+  }
+}
+
 // As the completion time of a skipped stage is always -1, we will remove 
skipped stages first.
 // This is safe since the job itself contains enough information to render 
skipped stages in the
 // UI.
-val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-  s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
-}
-
-val stageIds = stages.map { s =>

Review comment:
   I am trying to understand the last part - what is the difference w.r.t 
new code for finding stage id ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


SparkQA commented on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926360172


   **[Test build #143590 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143590/testReport)**
 for PR 34009 at commit 
[`9b58975`](https://github.com/apache/spark/commit/9b58975f88eaad623febea4524b3e7a63dd99272).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


mridulm commented on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926358471


   Add to whitelist


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2021-09-23 Thread GitBox


mridulm commented on pull request #34083:
URL: https://github.com/apache/spark/pull/34083#issuecomment-926358173


   Agree with @HyukjinKwon - it would be good to start a discussion in spark 
mailing lists as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] attilapiros commented on a change in pull request #33936: [SPARK-36693][REPL] Implement spark-shell idle timeouts

2021-09-23 Thread GitBox


attilapiros commented on a change in pull request #33936:
URL: https://github.com/apache/spark/pull/33936#discussion_r715326783



##
File path: repl/src/main/scala-2.12/org/apache/spark/repl/SparkILoop.scala
##
@@ -105,6 +108,13 @@ class SparkILoop(in0: Option[BufferedReader], out: 
JPrintWriter)
 echo("Type :help for more information.")
   }
 
+  override def processLine(line: String): Boolean = {
+inactivityTimeout.stopInactivityTimer()
+val result = super.processLine(line)

Review comment:
   What happens when the line is evaluated and it throws an exception? 
   Do the underlying `processLine` guarantees that exceptions are not 
propagated into the caller? Because if there is no such guarantees then we 
should wrap this into a `try {..} finally {...}` and stop the timer in the 
`finally`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #34079: [SPARK-36834][SHUFFLE] Add support for namespacing log lines emitted by external shuffle service

2021-09-23 Thread GitBox


mridulm commented on a change in pull request #34079:
URL: https://github.com/apache/spark/pull/34079#discussion_r715325832



##
File path: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
##
@@ -284,7 +293,6 @@ static MergedShuffleFileManager 
newMergedShuffleFileManagerInstance(TransportCon
   // will also need the transport configuration.
   return 
mergeManagerSubClazz.getConstructor(TransportConf.class).newInstance(conf);
 } catch (Exception e) {
-  logger.error("Unable to create an instance of {}", 
mergeManagerImplClassName);

Review comment:
   I agree with @tgravescs, dropping the log message (particularly an 
error) would be missing out on very useful debugging information.
   One option would be to move it out of this class into some other util ?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join

2021-09-23 Thread GitBox


SparkQA commented on pull request #34069:
URL: https://github.com/apache/spark/pull/34069#issuecomment-926356585


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48096/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on pull request #34081: [SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-23 Thread GitBox


allisonwang-db commented on pull request #34081:
URL: https://github.com/apache/spark/pull/34081#issuecomment-926356169


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34077: [SPARK-36829][SQL] Refactor NULL check for collectionOperators

2021-09-23 Thread GitBox


SparkQA commented on pull request #34077:
URL: https://github.com/apache/spark/pull/34077#issuecomment-926354788


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48097/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join

2021-09-23 Thread GitBox


SparkQA commented on pull request #34069:
URL: https://github.com/apache/spark/pull/34069#issuecomment-926354328


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48095/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926353009


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143575/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926353009


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143575/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926252459


   **[Test build #143575 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143575/testReport)**
 for PR 34053 at commit 
[`cd0f707`](https://github.com/apache/spark/commit/cd0f7070b4a504d2aba57d7e4b71fcc225731603).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


gengliangwang commented on a change in pull request #34092:
URL: https://github.com/apache/spark/pull/34092#discussion_r715321250



##
File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
##
@@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener(
 toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) }
   }
 
+  private case class StageCompletionTime(
+  stageId: Int,
+  attemptId: Int,
+  completionTime: Long)
+
   private def cleanupStages(count: Long): Unit = {
 val countToDelete = calculateNumberToRemove(count, 
conf.get(MAX_RETAINED_STAGES))
 if (countToDelete <= 0L) {
   return
 }
 
+val stageArray = new ArrayBuffer[StageCompletionTime]()
+val stageDataCount = new mutable.HashMap[Int, Int]()
+kvstore.view(classOf[StageDataWrapper]).forEach { s =>
+  // Here we keep track of the total number of StageDataWrapper entries 
for each stage id.
+  // This will be used in cleaning up the RDDOperationGraphWrapper data.
+  if (stageDataCount.contains(s.info.stageId)) {
+stageDataCount(s.info.stageId) += 1
+  } else {
+stageDataCount(s.info.stageId) = 1
+  }
+  if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING) {
+val candidate =
+  StageCompletionTime(s.info.stageId, s.info.attemptId, 
s.completionTime)
+stageArray.append(candidate)
+  }
+}
+
 // As the completion time of a skipped stage is always -1, we will remove 
skipped stages first.
 // This is safe since the job itself contains enough information to render 
skipped stages in the
 // UI.
-val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime")
-val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
-  s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
-}
-
-val stageIds = stages.map { s =>

Review comment:
   I thought about keeping the original code for LevelDB here. But after 
investigation, I find that:
   The default retained stages size is 1000, so as per
   ```
 private def calculateNumberToRemove(dataSize: Long, retainedSize: Long): 
Long = {
   if (dataSize > retainedSize) {
 math.max(retainedSize / 10L, dataSize - retainedSize)
   } else {
 0L
   }
 }
   ```
   The `stages` here normally has a length of 100. Finding stage id inside 
LevelDB 100 times is not efficient, comparing to the new code. 
   So I decide to make it simple and use the same code for both InMemoryStore 
and LevelDB.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #34080: [SPARK-33832][SQL] Force skew join code simplification and improvement

2021-09-23 Thread GitBox


cloud-fan closed pull request #34080:
URL: https://github.com/apache/spark/pull/34080


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


SparkQA commented on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926352150


   **[Test build #143575 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143575/testReport)**
 for PR 34053 at commit 
[`cd0f707`](https://github.com/apache/spark/commit/cd0f7070b4a504d2aba57d7e4b71fcc225731603).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #34080: [SPARK-33832][SQL] Force skew join code simplification and improvement

2021-09-23 Thread GitBox


cloud-fan commented on pull request #34080:
URL: https://github.com/apache/spark/pull/34080#issuecomment-926352184


   thanks for review, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926349536


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143580/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34051:
URL: https://github.com/apache/spark/pull/34051#issuecomment-926349534


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926349538


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143579/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34033:
URL: https://github.com/apache/spark/pull/34033#issuecomment-926349537


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34091: [SPARK-36839][INFRA] Add daily build with Hadoop 2 profile in GitHub Actions build

2021-09-23 Thread GitBox


SparkQA commented on pull request #34091:
URL: https://github.com/apache/spark/pull/34091#issuecomment-926350029


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48094/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


SparkQA commented on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926349944


   **[Test build #143589 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143589/testReport)**
 for PR 34038 at commit 
[`be31929`](https://github.com/apache/spark/commit/be31929417fb240c098eb12ff79cb3a8e364e973).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] taroplus commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


taroplus commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926349872


   @gengliangwang sounds good, thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


SparkQA commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926349874


   **[Test build #143588 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143588/testReport)**
 for PR 34089 at commit 
[`7a581fc`](https://github.com/apache/spark/commit/7a581fc89fa38f921def1de4924bdae9df9d647e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


SparkQA commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926349734


   **[Test build #143587 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143587/testReport)**
 for PR 34092 at commit 
[`b270d2b`](https://github.com/apache/spark/commit/b270d2b6334b5373265735f3a86faf51df015ccc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926349538


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143579/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34051:
URL: https://github.com/apache/spark/pull/34051#issuecomment-926349534


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926349536


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143580/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34033:
URL: https://github.com/apache/spark/pull/34033#issuecomment-926349537


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] taroplus closed pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages

2021-09-23 Thread GitBox


taroplus closed pull request #34090:
URL: https://github.com/apache/spark/pull/34090


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] taroplus commented on pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages

2021-09-23 Thread GitBox


taroplus commented on pull request #34090:
URL: https://github.com/apache/spark/pull/34090#issuecomment-926348877


   in favor of 
   https://github.com/apache/spark/pull/34092


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages

2021-09-23 Thread GitBox


gengliangwang commented on pull request #34090:
URL: https://github.com/apache/spark/pull/34090#issuecomment-926347932


   @taroplus I was working on this as well yesterday.  
https://github.com/apache/spark/pull/34092
   If we have to pull all the stage data out from KVStore, we should avoid 
calling `KVUtils.viewToSeq(view, countToDelete.toInt)` which will copy the 
stage data and perform sorting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


gengliangwang commented on pull request #34092:
URL: https://github.com/apache/spark/pull/34092#issuecomment-926347002


   @taroplus I was working on this yesterday. I didn't send it out because I 
think we can do better if we build a live priority queue in `AppStatusListener` 
so that Spark doesn't need to pull all the stage data out on every cleaning up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data

2021-09-23 Thread GitBox


gengliangwang opened a new pull request #34092:
URL: https://github.com/apache/spark/pull/34092


   
   
   ### What changes were proposed in this pull request?
   
   Improve the perf and memory usage of cleaning up stage UI data. The new code 
make copy of the essential fields(stage id, attempt id, completion time) to an 
array and determine which stage data and `RDDOperationGraphWrapper` needs to be 
clean based on it
   ### Why are the changes needed?
   
   Fix the memory usage issue described in 
https://issues.apache.org/jira/browse/SPARK-36827
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Add new unit test for the InMemoryStore.
   Also, run a simple benchmark with 
   ```
   val testConf = conf.clone()
 .set(MAX_RETAINED_STAGES, 1000)
   
   val listener = new AppStatusListener(store, testConf, true)
   val stages = (1 to 3000).map { i =>
 new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1",
   resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
   }
   listener.onJobStart(SparkListenerJobStart(4, time, Nil, null))
 stages.foreach { s =>
   time +=1
   s.submissionTime = Some(time)
   listener.onStageSubmitted(SparkListenerStageSubmitted(s, new 
Properties()))
   s.completionTime = Some(time)
   listener.onStageCompleted(SparkListenerStageCompleted(s))
 }
   ```
   
   Before changes:
   InMemoryStore: 2.8s
   LevelDB: 68.9s
   
   After changes:
   InMemoryStore: 0.95s
   LevelDB: 60.3s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


SparkQA commented on pull request #34051:
URL: https://github.com/apache/spark/pull/34051#issuecomment-926345957


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN

2021-09-23 Thread GitBox


SparkQA commented on pull request #34033:
URL: https://github.com/apache/spark/pull/34033#issuecomment-926345014


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926289733


   **[Test build #143579 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143579/testReport)**
 for PR 34088 at commit 
[`7816574`](https://github.com/apache/spark/commit/781657486e1952b9446201ae713d6fb8288bb9f8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


SparkQA commented on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926339514


   **[Test build #143579 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143579/testReport)**
 for PR 34088 at commit 
[`7816574`](https://github.com/apache/spark/commit/781657486e1952b9446201ae713d6fb8288bb9f8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926289915


   **[Test build #143580 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143580/testReport)**
 for PR 34038 at commit 
[`a0af93c`](https://github.com/apache/spark/commit/a0af93cb5b8e319ef8dd87d1131fb91979e51e30).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


SparkQA commented on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926336277


   **[Test build #143580 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143580/testReport)**
 for PR 34038 at commit 
[`a0af93c`](https://github.com/apache/spark/commit/a0af93cb5b8e319ef8dd87d1131fb91979e51e30).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926317237


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48091/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34077: [SPARK-36829][SQL] Refactor NULL check for collectionOperators

2021-09-23 Thread GitBox


SparkQA commented on pull request #34077:
URL: https://github.com/apache/spark/pull/34077#issuecomment-926334207


   **[Test build #143586 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143586/testReport)**
 for PR 34077 at commit 
[`c571e5f`](https://github.com/apache/spark/commit/c571e5fe550d6ab22b114f1edd82caddc4244729).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34077: [SPARK-36829][SQL] Refactor NULL check for collectionOperators

2021-09-23 Thread GitBox


AngersZh commented on a change in pull request #34077:
URL: https://github.com/apache/spark/pull/34077#discussion_r715309252



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##
@@ -3532,22 +3509,29 @@ case class ArrayDistinct(child: Expression)
  |}
""".stripMargin
 
-val processArray = withArrayNullAssignment(
-  s"$jt $value = ${genGetValue(array, i)};" +
-SQLOpenHashSet.withNaNCheckCode(elementType, value, hashSet, body,
-  (valueNaN: String) =>
-s"""
-   |$size++;
-   |$builder.$$plus$$eq($valueNaN);
-   |""".stripMargin))
+val processArray = SQLOpenHashSet.withNullCheckCode(dataType, 
dataType, hashSet,

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join

2021-09-23 Thread GitBox


SparkQA commented on pull request #34069:
URL: https://github.com/apache/spark/pull/34069#issuecomment-926332383


   **[Test build #143585 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143585/testReport)**
 for PR 34069 at commit 
[`4ed226f`](https://github.com/apache/spark/commit/4ed226f66643523f3661b53a28517383cf1f0eb5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926325603


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143571/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926325590


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48089/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926325588


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143577/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34091: [SPARK-36839][INFRA] Add daily build with Hadoop 2 profile in GitHub Actions build

2021-09-23 Thread GitBox


SparkQA commented on pull request #34091:
URL: https://github.com/apache/spark/pull/34091#issuecomment-926326364


   **[Test build #143584 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143584/testReport)**
 for PR 34091 at commit 
[`27a202a`](https://github.com/apache/spark/commit/27a202a95b3e31c9ece4fa9dabd04d78f597).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


SparkQA commented on pull request #34051:
URL: https://github.com/apache/spark/pull/34051#issuecomment-926325914


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926325603


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143571/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926325588


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143577/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926325590


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48089/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN

2021-09-23 Thread GitBox


SparkQA commented on pull request #34033:
URL: https://github.com/apache/spark/pull/34033#issuecomment-926324867


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926224948


   **[Test build #143571 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143571/testReport)**
 for PR 34053 at commit 
[`973b09d`](https://github.com/apache/spark/commit/973b09d5e6d9c2360bfdef5ad4c5e69d6bb929f8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin

2021-09-23 Thread GitBox


dongjoon-hyun commented on pull request #34085:
URL: https://github.com/apache/spark/pull/34085#issuecomment-926320990


   +1, LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof

2021-09-23 Thread GitBox


SparkQA commented on pull request #34053:
URL: https://github.com/apache/spark/pull/34053#issuecomment-926320996


   **[Test build #143571 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143571/testReport)**
 for PR 34053 at commit 
[`973b09d`](https://github.com/apache/spark/commit/973b09d5e6d9c2360bfdef5ad4c5e69d6bb929f8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


HyukjinKwon commented on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926320732


   for 3.1, I will directly revert 
https://github.com/apache/spark/commit/b4916d4a410820ba00125c00b55ca724b27cc853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


HyukjinKwon closed pull request #34088:
URL: https://github.com/apache/spark/pull/34088


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926271814


   **[Test build #143577 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143577/testReport)**
 for PR 34009 at commit 
[`9b58975`](https://github.com/apache/spark/commit/9b58975f88eaad623febea4524b3e7a63dd99272).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


SparkQA commented on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926320328


   **[Test build #143577 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143577/testReport)**
 for PR 34009 at commit 
[`9b58975`](https://github.com/apache/spark/commit/9b58975f88eaad623febea4524b3e7a63dd99272).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


SparkQA commented on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926319931


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48089/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


HyukjinKwon commented on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926319897


   `UISeleniumSuite` test failure looks very unlikely related. I am merging it 
in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request #34091: [SPARK-36839][INFRA] Add daily build with Hadoop 2 profile in GitHub Actions build

2021-09-23 Thread GitBox


HyukjinKwon opened a new pull request #34091:
URL: https://github.com/apache/spark/pull/34091


   ### What changes were proposed in this pull request?
   
   This PR proposes to run daily build for Hadoop 2 profile in GitHub Actions.
   
   ### Why are the changes needed?
   
   In order to improve test coverage and catch bugs e.g.) 
https://github.com/apache/spark/pull/34064
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, dev-only.
   
   ### How was this patch tested?
   
   Being tested in my own fork.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926317237


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48091/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


SparkQA commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926317214


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48091/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34090:
URL: https://github.com/apache/spark/pull/34090#issuecomment-926316219


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] taroplus opened a new pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages

2021-09-23 Thread GitBox


taroplus opened a new pull request #34090:
URL: https://github.com/apache/spark/pull/34090


   ### What changes were proposed in this pull request?
   This PR fixes a performance issue in `AppStatusListener.cleanupStages`. When 
there are large number of stages in store, this logic below runs like N*M order.
   
   ```
   val stageIds = stages.map { s =>
 val key = Array(s.info.stageId, s.info.attemptId)
 kvstore.delete(s.getClass(), key)
   
 // Check whether there are remaining attempts for the same stage. If 
there aren't, then
 // also delete the RDD graph data.
 val remainingAttempts = kvstore.view(classOf[StageDataWrapper])
   .index("stageId")
   .first(s.info.stageId)
   .last(s.info.stageId)
   .closeableIterator()
   ...
   ```
   Instead of accessing the view for checking remaining task per stage, this 
change is to move the logic after removing stages. Then it only needs to access 
the view(`kvstore.view(classOf[StageDataWrapper])`) once.
   
   ### Why are the changes needed?
   When there are more than ideal number of stages kept inside the memory, the 
clean up process is unable to catch up with the speed of incoming stages 
because of this perf issue, that leads to a behavior which looks like a memory 
leak.  Eventually it causes OutOfMemoryError.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   The behavior should be identical before and after the change, and the 
existing tests should verify that. This change has been applied to the 
environment where constant memory leak was observed. With the same load, now 
services are running perfectly healthy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926314083


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


SparkQA commented on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926314068


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34088:
URL: https://github.com/apache/spark/pull/34088#issuecomment-926314083


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34083:
URL: https://github.com/apache/spark/pull/34083#issuecomment-926313819


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48087/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34083:
URL: https://github.com/apache/spark/pull/34083#issuecomment-926313819


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48087/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2021-09-23 Thread GitBox


SparkQA commented on pull request #34083:
URL: https://github.com/apache/spark/pull/34083#issuecomment-926313795


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48087/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax

2021-09-23 Thread GitBox


HyukjinKwon commented on a change in pull request #34058:
URL: https://github.com/apache/spark/pull/34058#discussion_r715290197



##
File path: python/pyspark/pandas/typedef/typehints.py
##
@@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object:
 Typing data columns with an index:
 
 >>> ps.DataFrame[int, [int, int]]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, int, int]
+typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, numpy.int64]
+typing.Tuple[...IndexNameType, ...NameType]
 >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]]  # 
doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
 ... # doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType]
+
+Typing data columns with an Multi-index:
+>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']]
+>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
+>>> pdf = pd.DataFrame({'a': range(3)}, index=idx)
+>>> ps.DataFrame[[int, int], [int, int]]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
+>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), 
("A", int)]]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), 
zip(pdf.columns, pdf.dtypes)]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
 """
-return Tuple[extract_types(params)]
+return Tuple[_extract_types(params)]
 
 
-# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types.
-def extract_types(params: Any) -> Tuple:
+def _extract_types(params: Any) -> Tuple:
 origin = params
-if isinstance(params, zip):  # type: ignore
-# Example:
-#   DataFrame[zip(pdf.columns, pdf.dtypes)]
-params = tuple(slice(name, tpe) for name, tpe in params)  # type: 
ignore
 
-if isinstance(params, Iterable):
-params = tuple(params)
-else:
-params = (params,)
+params = _prepare_a_tuple(params)
 
-if all(
-isinstance(param, slice)
-and param.start is not None
-and param.step is None
-and param.stop is not None
-for param in params
-):
+if _is_valid_slices(params):
 # Example:
 #   DataFrame["id": int, "A": int]
-new_params = []
-for param in params:
-new_param = type("NameType", (NameTypeHolder,), {})  # type: 
Type[NameTypeHolder]
-new_param.name = param.start
-# When the given argument is a numpy's dtype instance.
-new_param.tpe = param.stop.type if isinstance(param.stop, 
np.dtype) else param.stop
-new_params.append(new_param)
-
+new_params = _convert_slices_to_holders(params, is_index=False)
 return tuple(new_params)
 elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)):
 # Example:
 #   DataFrame[int, [int, int]]
 #   DataFrame[pdf.index.dtype, pdf.dtypes]
 #   DataFrame[("index", int), [("id", int), ("A", int)]]
 #   DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
+#
+#   DataFrame[[int, int], [int, int]]
+#   DataFrame[pdf.index.dtypes, pdf.dtypes]
+#   DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", 
int)]]
+#   DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, 
pdf.dtypes)]
 
-index_param = params[0]
-index_type = type(
-"IndexNameType", (IndexNameTypeHolder,), {}
-)  # type: Type[IndexNameTypeHolder]
-if isinstance(index_param, tuple):
-if len(index_param) != 2:
-raise TypeError(
-"Type hints for index should be specified as "
-"DataFrame[('name', type), ...]; however, got %s" % 
index_param
-)
-name, tpe = index_param
-else:
-name, tpe = None, index_param
+index_params = params[0]
+
+if isinstance(index_params, tuple) and len(index_params) == 2:
+index_params = tuple([slice(*index_params)])
+
+index_params = (
+_convert_tuples_to_zip(index_params)
+if _is_valid_type_tuples(index_params)
+else index_params
+)
+index_params = _prepare_a_tuple(index_params)
 
-index_type.name 

[GitHub] [spark] SparkQA removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926308162


   **[Test build #143581 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143581/testReport)**
 for PR 34089 at commit 
[`d0c9ed4`](https://github.com/apache/spark/commit/d0c9ed4069a8d7b1006fc8dc8c1422bd25893136).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax

2021-09-23 Thread GitBox


HyukjinKwon commented on a change in pull request #34058:
URL: https://github.com/apache/spark/pull/34058#discussion_r715290116



##
File path: python/pyspark/pandas/typedef/typehints.py
##
@@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object:
 Typing data columns with an index:
 
 >>> ps.DataFrame[int, [int, int]]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, int, int]
+typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, numpy.int64]
+typing.Tuple[...IndexNameType, ...NameType]
 >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]]  # 
doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
 ... # doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType]
+
+Typing data columns with an Multi-index:
+>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']]
+>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
+>>> pdf = pd.DataFrame({'a': range(3)}, index=idx)
+>>> ps.DataFrame[[int, int], [int, int]]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
+>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), 
("A", int)]]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), 
zip(pdf.columns, pdf.dtypes)]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
 """
-return Tuple[extract_types(params)]
+return Tuple[_extract_types(params)]
 
 
-# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types.
-def extract_types(params: Any) -> Tuple:
+def _extract_types(params: Any) -> Tuple:
 origin = params
-if isinstance(params, zip):  # type: ignore
-# Example:
-#   DataFrame[zip(pdf.columns, pdf.dtypes)]
-params = tuple(slice(name, tpe) for name, tpe in params)  # type: 
ignore
 
-if isinstance(params, Iterable):
-params = tuple(params)
-else:
-params = (params,)
+params = _prepare_a_tuple(params)
 
-if all(
-isinstance(param, slice)
-and param.start is not None
-and param.step is None
-and param.stop is not None
-for param in params
-):
+if _is_valid_slices(params):
 # Example:
 #   DataFrame["id": int, "A": int]
-new_params = []
-for param in params:
-new_param = type("NameType", (NameTypeHolder,), {})  # type: 
Type[NameTypeHolder]
-new_param.name = param.start
-# When the given argument is a numpy's dtype instance.
-new_param.tpe = param.stop.type if isinstance(param.stop, 
np.dtype) else param.stop
-new_params.append(new_param)
-
+new_params = _convert_slices_to_holders(params, is_index=False)
 return tuple(new_params)
 elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)):
 # Example:
 #   DataFrame[int, [int, int]]
 #   DataFrame[pdf.index.dtype, pdf.dtypes]
 #   DataFrame[("index", int), [("id", int), ("A", int)]]
 #   DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
+#
+#   DataFrame[[int, int], [int, int]]
+#   DataFrame[pdf.index.dtypes, pdf.dtypes]
+#   DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", 
int)]]
+#   DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, 
pdf.dtypes)]
 
-index_param = params[0]
-index_type = type(
-"IndexNameType", (IndexNameTypeHolder,), {}
-)  # type: Type[IndexNameTypeHolder]
-if isinstance(index_param, tuple):
-if len(index_param) != 2:
-raise TypeError(
-"Type hints for index should be specified as "
-"DataFrame[('name', type), ...]; however, got %s" % 
index_param
-)
-name, tpe = index_param
-else:
-name, tpe = None, index_param
+index_params = params[0]
+
+if isinstance(index_params, tuple) and len(index_params) == 2:
+index_params = tuple([slice(*index_params)])
+
+index_params = (
+_convert_tuples_to_zip(index_params)
+if _is_valid_type_tuples(index_params)
+else index_params
+)
+index_params = _prepare_a_tuple(index_params)
 
-index_type.name 

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926311898


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143581/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926311898


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143581/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


SparkQA commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926311866


   **[Test build #143581 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143581/testReport)**
 for PR 34089 at commit 
[`d0c9ed4`](https://github.com/apache/spark/commit/d0c9ed4069a8d7b1006fc8dc8c1422bd25893136).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


HyukjinKwon commented on a change in pull request #34051:
URL: https://github.com/apache/spark/pull/34051#discussion_r715289655



##
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
##
@@ -157,7 +161,8 @@ case class InSubqueryExec(
   child = child.canonicalized,
   plan = plan.canonicalized.asInstanceOf[BaseSubqueryExec],
   exprId = ExprId(0),
-  resultBroadcast = null)
+  resultBroadcast = null,
+  result = null)

Review comment:
   I see, okie. that's fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN

2021-09-23 Thread GitBox


SparkQA commented on pull request #34033:
URL: https://github.com/apache/spark/pull/34033#issuecomment-926308397


   **[Test build #143583 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143583/testReport)**
 for PR 34033 at commit 
[`ef0e81f`](https://github.com/apache/spark/commit/ef0e81f8e8e5872c4402aee1525a27febefd7292).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


SparkQA commented on pull request #34051:
URL: https://github.com/apache/spark/pull/34051#issuecomment-926308274


   **[Test build #143582 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143582/testReport)**
 for PR 34051 at commit 
[`47dce1a`](https://github.com/apache/spark/commit/47dce1a9bb14f2e4eb3b9fe669d6bf6d7ef7042a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


SparkQA commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926308162


   **[Test build #143581 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143581/testReport)**
 for PR 34089 at commit 
[`d0c9ed4`](https://github.com/apache/spark/commit/d0c9ed4069a8d7b1006fc8dc8c1422bd25893136).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926306970


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48086/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34085:
URL: https://github.com/apache/spark/pull/34085#issuecomment-926306969


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143574/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


AmplabJenkins removed a comment on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926306971


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34085:
URL: https://github.com/apache/spark/pull/34085#issuecomment-926306969


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143574/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926306971


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

2021-09-23 Thread GitBox


AmplabJenkins commented on pull request #34009:
URL: https://github.com/apache/spark/pull/34009#issuecomment-926306970


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48086/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin

2021-09-23 Thread GitBox


SparkQA removed a comment on pull request #34085:
URL: https://github.com/apache/spark/pull/34085#issuecomment-926252427


   **[Test build #143574 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143574/testReport)**
 for PR 34085 at commit 
[`08b1f31`](https://github.com/apache/spark/commit/08b1f31a7587cc8536b8a672b0a390ab6618bb97).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP

2021-09-23 Thread GitBox


viirya commented on a change in pull request #34051:
URL: https://github.com/apache/spark/pull/34051#discussion_r715285748



##
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala
##
@@ -157,7 +161,8 @@ case class InSubqueryExec(
   child = child.canonicalized,
   plan = plan.canonicalized.asInstanceOf[BaseSubqueryExec],
   exprId = ExprId(0),
-  resultBroadcast = null)
+  resultBroadcast = null,
+  result = null)

Review comment:
   I tried to move it out of constructor, but there was some errors about 
`result` is null at the moment of preparing result. There might be some where 
we do `copy` it. In the case, we will lose `result` value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin

2021-09-23 Thread GitBox


SparkQA commented on pull request #34085:
URL: https://github.com/apache/spark/pull/34085#issuecomment-926305859


   **[Test build #143574 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143574/testReport)**
 for PR 34085 at commit 
[`08b1f31`](https://github.com/apache/spark/commit/08b1f31a7587cc8536b8a672b0a390ab6618bb97).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns

2021-09-23 Thread GitBox


SparkQA commented on pull request #34038:
URL: https://github.com/apache/spark/pull/34038#issuecomment-926305432


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48089/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dgd-contributor commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax

2021-09-23 Thread GitBox


dgd-contributor commented on a change in pull request #34058:
URL: https://github.com/apache/spark/pull/34058#discussion_r715284917



##
File path: python/pyspark/pandas/typedef/typehints.py
##
@@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object:
 Typing data columns with an index:
 
 >>> ps.DataFrame[int, [int, int]]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, int, int]
+typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, numpy.int64]
+typing.Tuple[...IndexNameType, ...NameType]
 >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]]  # 
doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
 ... # doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType]
+
+Typing data columns with an Multi-index:
+>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']]
+>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
+>>> pdf = pd.DataFrame({'a': range(3)}, index=idx)
+>>> ps.DataFrame[[int, int], [int, int]]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
+>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), 
("A", int)]]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), 
zip(pdf.columns, pdf.dtypes)]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
 """
-return Tuple[extract_types(params)]
+return Tuple[_extract_types(params)]
 
 
-# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types.
-def extract_types(params: Any) -> Tuple:
+def _extract_types(params: Any) -> Tuple:
 origin = params
-if isinstance(params, zip):  # type: ignore
-# Example:
-#   DataFrame[zip(pdf.columns, pdf.dtypes)]
-params = tuple(slice(name, tpe) for name, tpe in params)  # type: 
ignore
 
-if isinstance(params, Iterable):
-params = tuple(params)
-else:
-params = (params,)
+params = _prepare_a_tuple(params)
 
-if all(
-isinstance(param, slice)
-and param.start is not None
-and param.step is None
-and param.stop is not None
-for param in params
-):
+if _is_valid_slices(params):
 # Example:
 #   DataFrame["id": int, "A": int]
-new_params = []
-for param in params:
-new_param = type("NameType", (NameTypeHolder,), {})  # type: 
Type[NameTypeHolder]
-new_param.name = param.start
-# When the given argument is a numpy's dtype instance.
-new_param.tpe = param.stop.type if isinstance(param.stop, 
np.dtype) else param.stop
-new_params.append(new_param)
-
+new_params = _convert_slices_to_holders(params, is_index=False)
 return tuple(new_params)
 elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)):
 # Example:
 #   DataFrame[int, [int, int]]
 #   DataFrame[pdf.index.dtype, pdf.dtypes]
 #   DataFrame[("index", int), [("id", int), ("A", int)]]
 #   DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
+#
+#   DataFrame[[int, int], [int, int]]
+#   DataFrame[pdf.index.dtypes, pdf.dtypes]
+#   DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", 
int)]]
+#   DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, 
pdf.dtypes)]
 
-index_param = params[0]
-index_type = type(
-"IndexNameType", (IndexNameTypeHolder,), {}
-)  # type: Type[IndexNameTypeHolder]
-if isinstance(index_param, tuple):
-if len(index_param) != 2:
-raise TypeError(
-"Type hints for index should be specified as "
-"DataFrame[('name', type), ...]; however, got %s" % 
index_param
-)
-name, tpe = index_param
-else:
-name, tpe = None, index_param
+index_params = params[0]
+
+if isinstance(index_params, tuple) and len(index_params) == 2:
+index_params = tuple([slice(*index_params)])
+
+index_params = (
+_convert_tuples_to_zip(index_params)
+if _is_valid_type_tuples(index_params)
+else index_params
+)
+index_params = _prepare_a_tuple(index_params)
 
-

[GitHub] [spark] dgd-contributor commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax

2021-09-23 Thread GitBox


dgd-contributor commented on a change in pull request #34058:
URL: https://github.com/apache/spark/pull/34058#discussion_r715284917



##
File path: python/pyspark/pandas/typedef/typehints.py
##
@@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object:
 Typing data columns with an index:
 
 >>> ps.DataFrame[int, [int, int]]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, int, int]
+typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes]  # doctest: +ELLIPSIS
-typing.Tuple[...IndexNameType, numpy.int64]
+typing.Tuple[...IndexNameType, ...NameType]
 >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]]  # 
doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType, ...NameType]
 >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
 ... # doctest: +ELLIPSIS
 typing.Tuple[...IndexNameType, ...NameType]
+
+Typing data columns with an Multi-index:
+>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']]
+>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
+>>> pdf = pd.DataFrame({'a': range(3)}, index=idx)
+>>> ps.DataFrame[[int, int], [int, int]]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes]  # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
+>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), 
("A", int)]]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, 
...NameType]
+>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), 
zip(pdf.columns, pdf.dtypes)]
+... # doctest: +ELLIPSIS
+typing.Tuple[...IndexNameType, ...NameType]
 """
-return Tuple[extract_types(params)]
+return Tuple[_extract_types(params)]
 
 
-# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types.
-def extract_types(params: Any) -> Tuple:
+def _extract_types(params: Any) -> Tuple:
 origin = params
-if isinstance(params, zip):  # type: ignore
-# Example:
-#   DataFrame[zip(pdf.columns, pdf.dtypes)]
-params = tuple(slice(name, tpe) for name, tpe in params)  # type: 
ignore
 
-if isinstance(params, Iterable):
-params = tuple(params)
-else:
-params = (params,)
+params = _prepare_a_tuple(params)
 
-if all(
-isinstance(param, slice)
-and param.start is not None
-and param.step is None
-and param.stop is not None
-for param in params
-):
+if _is_valid_slices(params):
 # Example:
 #   DataFrame["id": int, "A": int]
-new_params = []
-for param in params:
-new_param = type("NameType", (NameTypeHolder,), {})  # type: 
Type[NameTypeHolder]
-new_param.name = param.start
-# When the given argument is a numpy's dtype instance.
-new_param.tpe = param.stop.type if isinstance(param.stop, 
np.dtype) else param.stop
-new_params.append(new_param)
-
+new_params = _convert_slices_to_holders(params, is_index=False)
 return tuple(new_params)
 elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)):
 # Example:
 #   DataFrame[int, [int, int]]
 #   DataFrame[pdf.index.dtype, pdf.dtypes]
 #   DataFrame[("index", int), [("id", int), ("A", int)]]
 #   DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, 
pdf.dtypes)]
+#
+#   DataFrame[[int, int], [int, int]]
+#   DataFrame[pdf.index.dtypes, pdf.dtypes]
+#   DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", 
int)]]
+#   DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, 
pdf.dtypes)]
 
-index_param = params[0]
-index_type = type(
-"IndexNameType", (IndexNameTypeHolder,), {}
-)  # type: Type[IndexNameTypeHolder]
-if isinstance(index_param, tuple):
-if len(index_param) != 2:
-raise TypeError(
-"Type hints for index should be specified as "
-"DataFrame[('name', type), ...]; however, got %s" % 
index_param
-)
-name, tpe = index_param
-else:
-name, tpe = None, index_param
+index_params = params[0]
+
+if isinstance(index_params, tuple) and len(index_params) == 2:
+index_params = tuple([slice(*index_params)])
+
+index_params = (
+_convert_tuples_to_zip(index_params)
+if _is_valid_type_tuples(index_params)
+else index_params
+)
+index_params = _prepare_a_tuple(index_params)
 
-

[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0

2021-09-23 Thread GitBox


SparkQA commented on pull request #34089:
URL: https://github.com/apache/spark/pull/34089#issuecomment-926304717


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >