[GitHub] [spark] viirya commented on a change in pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
viirya commented on a change in pull request #33989: URL: https://github.com/apache/spark/pull/33989#discussion_r715335263 ## File path: assembly/pom.xml ## @@ -165,6 +169,13 @@ hive + Review comment: okay thanks for the update! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
mridulm commented on pull request #34092: URL: https://github.com/apache/spark/pull/34092#issuecomment-926363391 +CC @zhouyejoe, @thejdeep -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
mridulm commented on a change in pull request #34092: URL: https://github.com/apache/spark/pull/34092#discussion_r715332652 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener( toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) } } + private case class StageCompletionTime( + stageId: Int, + attemptId: Int, + completionTime: Long) + private def cleanupStages(count: Long): Unit = { val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES)) if (countToDelete <= 0L) { return } +val stageArray = new ArrayBuffer[StageCompletionTime]() +val stageDataCount = new mutable.HashMap[Int, Int]() +kvstore.view(classOf[StageDataWrapper]).forEach { s => + // Here we keep track of the total number of StageDataWrapper entries for each stage id. + // This will be used in cleaning up the RDDOperationGraphWrapper data. + if (stageDataCount.contains(s.info.stageId)) { +stageDataCount(s.info.stageId) += 1 + } else { +stageDataCount(s.info.stageId) = 1 + } + if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) { +val candidate = + StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime) +stageArray.append(candidate) + } +} + // As the completion time of a skipped stage is always -1, we will remove skipped stages first. // This is safe since the job itself contains enough information to render skipped stages in the // UI. -val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime") -val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s => - s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING -} - -val stageIds = stages.map { s => Review comment: Scratch that - did not see the attempt iteration - makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #34092: [WIP][SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
mridulm commented on a change in pull request #34092: URL: https://github.com/apache/spark/pull/34092#discussion_r715332397 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener( toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) } } + private case class StageCompletionTime( + stageId: Int, + attemptId: Int, + completionTime: Long) + private def cleanupStages(count: Long): Unit = { val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES)) if (countToDelete <= 0L) { return } +val stageArray = new ArrayBuffer[StageCompletionTime]() +val stageDataCount = new mutable.HashMap[Int, Int]() +kvstore.view(classOf[StageDataWrapper]).forEach { s => + // Here we keep track of the total number of StageDataWrapper entries for each stage id. + // This will be used in cleaning up the RDDOperationGraphWrapper data. + if (stageDataCount.contains(s.info.stageId)) { +stageDataCount(s.info.stageId) += 1 + } else { +stageDataCount(s.info.stageId) = 1 + } + if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) { +val candidate = + StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime) +stageArray.append(candidate) + } +} + // As the completion time of a skipped stage is always -1, we will remove skipped stages first. // This is safe since the job itself contains enough information to render skipped stages in the // UI. -val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime") -val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s => - s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING -} - -val stageIds = stages.map { s => Review comment: I am trying to understand the last part - what is the difference w.r.t new code for finding stage id ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
SparkQA commented on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926360172 **[Test build #143590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143590/testReport)** for PR 34009 at commit [`9b58975`](https://github.com/apache/spark/commit/9b58975f88eaad623febea4524b3e7a63dd99272). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
mridulm commented on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926358471 Add to whitelist -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
mridulm commented on pull request #34083: URL: https://github.com/apache/spark/pull/34083#issuecomment-926358173 Agree with @HyukjinKwon - it would be good to start a discussion in spark mailing lists as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a change in pull request #33936: [SPARK-36693][REPL] Implement spark-shell idle timeouts
attilapiros commented on a change in pull request #33936: URL: https://github.com/apache/spark/pull/33936#discussion_r715326783 ## File path: repl/src/main/scala-2.12/org/apache/spark/repl/SparkILoop.scala ## @@ -105,6 +108,13 @@ class SparkILoop(in0: Option[BufferedReader], out: JPrintWriter) echo("Type :help for more information.") } + override def processLine(line: String): Boolean = { +inactivityTimeout.stopInactivityTimer() +val result = super.processLine(line) Review comment: What happens when the line is evaluated and it throws an exception? Do the underlying `processLine` guarantees that exceptions are not propagated into the caller? Because if there is no such guarantees then we should wrap this into a `try {..} finally {...}` and stop the timer in the `finally`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #34079: [SPARK-36834][SHUFFLE] Add support for namespacing log lines emitted by external shuffle service
mridulm commented on a change in pull request #34079: URL: https://github.com/apache/spark/pull/34079#discussion_r715325832 ## File path: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ## @@ -284,7 +293,6 @@ static MergedShuffleFileManager newMergedShuffleFileManagerInstance(TransportCon // will also need the transport configuration. return mergeManagerSubClazz.getConstructor(TransportConf.class).newInstance(conf); } catch (Exception e) { - logger.error("Unable to create an instance of {}", mergeManagerImplClassName); Review comment: I agree with @tgravescs, dropping the log message (particularly an error) would be missing out on very useful debugging information. One option would be to move it out of this class into some other util ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join
SparkQA commented on pull request #34069: URL: https://github.com/apache/spark/pull/34069#issuecomment-926356585 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48096/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on pull request #34081: [SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
allisonwang-db commented on pull request #34081: URL: https://github.com/apache/spark/pull/34081#issuecomment-926356169 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34077: [SPARK-36829][SQL] Refactor NULL check for collectionOperators
SparkQA commented on pull request #34077: URL: https://github.com/apache/spark/pull/34077#issuecomment-926354788 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48097/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join
SparkQA commented on pull request #34069: URL: https://github.com/apache/spark/pull/34069#issuecomment-926354328 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48095/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
AmplabJenkins removed a comment on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926353009 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143575/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
AmplabJenkins commented on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926353009 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143575/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
SparkQA removed a comment on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926252459 **[Test build #143575 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143575/testReport)** for PR 34053 at commit [`cd0f707`](https://github.com/apache/spark/commit/cd0f7070b4a504d2aba57d7e4b71fcc225731603). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
gengliangwang commented on a change in pull request #34092: URL: https://github.com/apache/spark/pull/34092#discussion_r715321250 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -1253,44 +1254,46 @@ private[spark] class AppStatusListener( toDelete.foreach { j => kvstore.delete(j.getClass(), j.info.jobId) } } + private case class StageCompletionTime( + stageId: Int, + attemptId: Int, + completionTime: Long) + private def cleanupStages(count: Long): Unit = { val countToDelete = calculateNumberToRemove(count, conf.get(MAX_RETAINED_STAGES)) if (countToDelete <= 0L) { return } +val stageArray = new ArrayBuffer[StageCompletionTime]() +val stageDataCount = new mutable.HashMap[Int, Int]() +kvstore.view(classOf[StageDataWrapper]).forEach { s => + // Here we keep track of the total number of StageDataWrapper entries for each stage id. + // This will be used in cleaning up the RDDOperationGraphWrapper data. + if (stageDataCount.contains(s.info.stageId)) { +stageDataCount(s.info.stageId) += 1 + } else { +stageDataCount(s.info.stageId) = 1 + } + if (s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING) { +val candidate = + StageCompletionTime(s.info.stageId, s.info.attemptId, s.completionTime) +stageArray.append(candidate) + } +} + // As the completion time of a skipped stage is always -1, we will remove skipped stages first. // This is safe since the job itself contains enough information to render skipped stages in the // UI. -val view = kvstore.view(classOf[StageDataWrapper]).index("completionTime") -val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s => - s.info.status != v1.StageStatus.ACTIVE && s.info.status != v1.StageStatus.PENDING -} - -val stageIds = stages.map { s => Review comment: I thought about keeping the original code for LevelDB here. But after investigation, I find that: The default retained stages size is 1000, so as per ``` private def calculateNumberToRemove(dataSize: Long, retainedSize: Long): Long = { if (dataSize > retainedSize) { math.max(retainedSize / 10L, dataSize - retainedSize) } else { 0L } } ``` The `stages` here normally has a length of 100. Finding stage id inside LevelDB 100 times is not efficient, comparing to the new code. So I decide to make it simple and use the same code for both InMemoryStore and LevelDB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #34080: [SPARK-33832][SQL] Force skew join code simplification and improvement
cloud-fan closed pull request #34080: URL: https://github.com/apache/spark/pull/34080 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
SparkQA commented on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926352150 **[Test build #143575 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143575/testReport)** for PR 34053 at commit [`cd0f707`](https://github.com/apache/spark/commit/cd0f7070b4a504d2aba57d7e4b71fcc225731603). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #34080: [SPARK-33832][SQL] Force skew join code simplification and improvement
cloud-fan commented on pull request #34080: URL: https://github.com/apache/spark/pull/34080#issuecomment-926352184 thanks for review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
AmplabJenkins removed a comment on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926349536 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143580/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
AmplabJenkins removed a comment on pull request #34051: URL: https://github.com/apache/spark/pull/34051#issuecomment-926349534 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48092/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
AmplabJenkins removed a comment on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926349538 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143579/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AmplabJenkins removed a comment on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-926349537 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48093/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34091: [SPARK-36839][INFRA] Add daily build with Hadoop 2 profile in GitHub Actions build
SparkQA commented on pull request #34091: URL: https://github.com/apache/spark/pull/34091#issuecomment-926350029 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48094/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
SparkQA commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926349944 **[Test build #143589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143589/testReport)** for PR 34038 at commit [`be31929`](https://github.com/apache/spark/commit/be31929417fb240c098eb12ff79cb3a8e364e973). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] taroplus commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
taroplus commented on pull request #34092: URL: https://github.com/apache/spark/pull/34092#issuecomment-926349872 @gengliangwang sounds good, thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
SparkQA commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926349874 **[Test build #143588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143588/testReport)** for PR 34089 at commit [`7a581fc`](https://github.com/apache/spark/commit/7a581fc89fa38f921def1de4924bdae9df9d647e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
SparkQA commented on pull request #34092: URL: https://github.com/apache/spark/pull/34092#issuecomment-926349734 **[Test build #143587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143587/testReport)** for PR 34092 at commit [`b270d2b`](https://github.com/apache/spark/commit/b270d2b6334b5373265735f3a86faf51df015ccc). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
AmplabJenkins commented on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926349538 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143579/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
AmplabJenkins commented on pull request #34051: URL: https://github.com/apache/spark/pull/34051#issuecomment-926349534 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48092/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
AmplabJenkins commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926349536 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143580/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AmplabJenkins commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-926349537 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48093/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] taroplus closed pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages
taroplus closed pull request #34090: URL: https://github.com/apache/spark/pull/34090 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] taroplus commented on pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages
taroplus commented on pull request #34090: URL: https://github.com/apache/spark/pull/34090#issuecomment-926348877 in favor of https://github.com/apache/spark/pull/34092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages
gengliangwang commented on pull request #34090: URL: https://github.com/apache/spark/pull/34090#issuecomment-926347932 @taroplus I was working on this as well yesterday. https://github.com/apache/spark/pull/34092 If we have to pull all the stage data out from KVStore, we should avoid calling `KVUtils.viewToSeq(view, countToDelete.toInt)` which will copy the stage data and perform sorting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
gengliangwang commented on pull request #34092: URL: https://github.com/apache/spark/pull/34092#issuecomment-926347002 @taroplus I was working on this yesterday. I didn't send it out because I think we can do better if we build a live priority queue in `AppStatusListener` so that Spark doesn't need to pull all the stage data out on every cleaning up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request #34092: [SPARK-36827][CORE] Improve the perf and memory usage of cleaning up stage UI data
gengliangwang opened a new pull request #34092: URL: https://github.com/apache/spark/pull/34092 ### What changes were proposed in this pull request? Improve the perf and memory usage of cleaning up stage UI data. The new code make copy of the essential fields(stage id, attempt id, completion time) to an array and determine which stage data and `RDDOperationGraphWrapper` needs to be clean based on it ### Why are the changes needed? Fix the memory usage issue described in https://issues.apache.org/jira/browse/SPARK-36827 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new unit test for the InMemoryStore. Also, run a simple benchmark with ``` val testConf = conf.clone() .set(MAX_RETAINED_STAGES, 1000) val listener = new AppStatusListener(store, testConf, true) val stages = (1 to 3000).map { i => new StageInfo(i, 0, s"stage$i", 4, Nil, Nil, "details1", resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID) } listener.onJobStart(SparkListenerJobStart(4, time, Nil, null)) stages.foreach { s => time +=1 s.submissionTime = Some(time) listener.onStageSubmitted(SparkListenerStageSubmitted(s, new Properties())) s.completionTime = Some(time) listener.onStageCompleted(SparkListenerStageCompleted(s)) } ``` Before changes: InMemoryStore: 2.8s LevelDB: 68.9s After changes: InMemoryStore: 0.95s LevelDB: 60.3s -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
SparkQA commented on pull request #34051: URL: https://github.com/apache/spark/pull/34051#issuecomment-926345957 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48092/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-926345014 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48093/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
SparkQA removed a comment on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926289733 **[Test build #143579 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143579/testReport)** for PR 34088 at commit [`7816574`](https://github.com/apache/spark/commit/781657486e1952b9446201ae713d6fb8288bb9f8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
SparkQA commented on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926339514 **[Test build #143579 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143579/testReport)** for PR 34088 at commit [`7816574`](https://github.com/apache/spark/commit/781657486e1952b9446201ae713d6fb8288bb9f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
SparkQA removed a comment on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926289915 **[Test build #143580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143580/testReport)** for PR 34038 at commit [`a0af93c`](https://github.com/apache/spark/commit/a0af93cb5b8e319ef8dd87d1131fb91979e51e30). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
SparkQA commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926336277 **[Test build #143580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143580/testReport)** for PR 34038 at commit [`a0af93c`](https://github.com/apache/spark/commit/a0af93cb5b8e319ef8dd87d1131fb91979e51e30). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
AmplabJenkins removed a comment on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926317237 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48091/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34077: [SPARK-36829][SQL] Refactor NULL check for collectionOperators
SparkQA commented on pull request #34077: URL: https://github.com/apache/spark/pull/34077#issuecomment-926334207 **[Test build #143586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143586/testReport)** for PR 34077 at commit [`c571e5f`](https://github.com/apache/spark/commit/c571e5fe550d6ab22b114f1edd82caddc4244729). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34077: [SPARK-36829][SQL] Refactor NULL check for collectionOperators
AngersZh commented on a change in pull request #34077: URL: https://github.com/apache/spark/pull/34077#discussion_r715309252 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala ## @@ -3532,22 +3509,29 @@ case class ArrayDistinct(child: Expression) |} """.stripMargin -val processArray = withArrayNullAssignment( - s"$jt $value = ${genGetValue(array, i)};" + -SQLOpenHashSet.withNaNCheckCode(elementType, value, hashSet, body, - (valueNaN: String) => -s""" - |$size++; - |$builder.$$plus$$eq($valueNaN); - |""".stripMargin)) +val processArray = SQLOpenHashSet.withNullCheckCode(dataType, dataType, hashSet, Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join
SparkQA commented on pull request #34069: URL: https://github.com/apache/spark/pull/34069#issuecomment-926332383 **[Test build #143585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143585/testReport)** for PR 34069 at commit [`4ed226f`](https://github.com/apache/spark/commit/4ed226f66643523f3661b53a28517383cf1f0eb5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
AmplabJenkins removed a comment on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926325603 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143571/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
AmplabJenkins removed a comment on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926325590 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48089/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
AmplabJenkins removed a comment on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926325588 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143577/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34091: [SPARK-36839][INFRA] Add daily build with Hadoop 2 profile in GitHub Actions build
SparkQA commented on pull request #34091: URL: https://github.com/apache/spark/pull/34091#issuecomment-926326364 **[Test build #143584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143584/testReport)** for PR 34091 at commit [`27a202a`](https://github.com/apache/spark/commit/27a202a95b3e31c9ece4fa9dabd04d78f597). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
SparkQA commented on pull request #34051: URL: https://github.com/apache/spark/pull/34051#issuecomment-926325914 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48092/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
AmplabJenkins commented on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926325603 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143571/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
AmplabJenkins commented on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926325588 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143577/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
AmplabJenkins commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926325590 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48089/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-926324867 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48093/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
SparkQA removed a comment on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926224948 **[Test build #143571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143571/testReport)** for PR 34053 at commit [`973b09d`](https://github.com/apache/spark/commit/973b09d5e6d9c2360bfdef5ad4c5e69d6bb929f8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin
dongjoon-hyun commented on pull request #34085: URL: https://github.com/apache/spark/pull/34085#issuecomment-926320990 +1, LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
SparkQA commented on pull request #34053: URL: https://github.com/apache/spark/pull/34053#issuecomment-926320996 **[Test build #143571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143571/testReport)** for PR 34053 at commit [`973b09d`](https://github.com/apache/spark/commit/973b09d5e6d9c2360bfdef5ad4c5e69d6bb929f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
HyukjinKwon commented on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926320732 for 3.1, I will directly revert https://github.com/apache/spark/commit/b4916d4a410820ba00125c00b55ca724b27cc853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
HyukjinKwon closed pull request #34088: URL: https://github.com/apache/spark/pull/34088 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
SparkQA removed a comment on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926271814 **[Test build #143577 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143577/testReport)** for PR 34009 at commit [`9b58975`](https://github.com/apache/spark/commit/9b58975f88eaad623febea4524b3e7a63dd99272). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
SparkQA commented on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926320328 **[Test build #143577 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143577/testReport)** for PR 34009 at commit [`9b58975`](https://github.com/apache/spark/commit/9b58975f88eaad623febea4524b3e7a63dd99272). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
SparkQA commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926319931 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48089/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
HyukjinKwon commented on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926319897 `UISeleniumSuite` test failure looks very unlikely related. I am merging it in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #34091: [SPARK-36839][INFRA] Add daily build with Hadoop 2 profile in GitHub Actions build
HyukjinKwon opened a new pull request #34091: URL: https://github.com/apache/spark/pull/34091 ### What changes were proposed in this pull request? This PR proposes to run daily build for Hadoop 2 profile in GitHub Actions. ### Why are the changes needed? In order to improve test coverage and catch bugs e.g.) https://github.com/apache/spark/pull/34064 ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Being tested in my own fork. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
AmplabJenkins commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926317237 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48091/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
SparkQA commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926317214 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48091/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages
AmplabJenkins commented on pull request #34090: URL: https://github.com/apache/spark/pull/34090#issuecomment-926316219 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] taroplus opened a new pull request #34090: [SPARK-36827][CORE] Fix perf issue in AppStatusListener.cleanupStages
taroplus opened a new pull request #34090: URL: https://github.com/apache/spark/pull/34090 ### What changes were proposed in this pull request? This PR fixes a performance issue in `AppStatusListener.cleanupStages`. When there are large number of stages in store, this logic below runs like N*M order. ``` val stageIds = stages.map { s => val key = Array(s.info.stageId, s.info.attemptId) kvstore.delete(s.getClass(), key) // Check whether there are remaining attempts for the same stage. If there aren't, then // also delete the RDD graph data. val remainingAttempts = kvstore.view(classOf[StageDataWrapper]) .index("stageId") .first(s.info.stageId) .last(s.info.stageId) .closeableIterator() ... ``` Instead of accessing the view for checking remaining task per stage, this change is to move the logic after removing stages. Then it only needs to access the view(`kvstore.view(classOf[StageDataWrapper])`) once. ### Why are the changes needed? When there are more than ideal number of stages kept inside the memory, the clean up process is unable to catch up with the speed of incoming stages because of this perf issue, that leads to a behavior which looks like a memory leak. Eventually it causes OutOfMemoryError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The behavior should be identical before and after the change, and the existing tests should verify that. This change has been applied to the environment where constant memory leak was observed. With the same load, now services are running perfectly healthy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
AmplabJenkins removed a comment on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926314083 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48088/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
SparkQA commented on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926314068 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48088/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34088: Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
AmplabJenkins commented on pull request #34088: URL: https://github.com/apache/spark/pull/34088#issuecomment-926314083 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48088/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
AmplabJenkins removed a comment on pull request #34083: URL: https://github.com/apache/spark/pull/34083#issuecomment-926313819 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48087/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
AmplabJenkins commented on pull request #34083: URL: https://github.com/apache/spark/pull/34083#issuecomment-926313819 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48087/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
SparkQA commented on pull request #34083: URL: https://github.com/apache/spark/pull/34083#issuecomment-926313795 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48087/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
HyukjinKwon commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r715290197 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ -return Tuple[extract_types(params)] +return Tuple[_extract_types(params)] -# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. -def extract_types(params: Any) -> Tuple: +def _extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] +# DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] +# DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] -index_param = params[0] -index_type = type( -"IndexNameType", (IndexNameTypeHolder,), {} -) # type: Type[IndexNameTypeHolder] -if isinstance(index_param, tuple): -if len(index_param) != 2: -raise TypeError( -"Type hints for index should be specified as " -"DataFrame[('name', type), ...]; however, got %s" % index_param -) -name, tpe = index_param -else: -name, tpe = None, index_param +index_params = params[0] + +if isinstance(index_params, tuple) and len(index_params) == 2: +index_params = tuple([slice(*index_params)]) + +index_params = ( +_convert_tuples_to_zip(index_params) +if _is_valid_type_tuples(index_params) +else index_params +) +index_params = _prepare_a_tuple(index_params) -index_type.name
[GitHub] [spark] SparkQA removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
SparkQA removed a comment on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926308162 **[Test build #143581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143581/testReport)** for PR 34089 at commit [`d0c9ed4`](https://github.com/apache/spark/commit/d0c9ed4069a8d7b1006fc8dc8c1422bd25893136). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
HyukjinKwon commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r715290116 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ -return Tuple[extract_types(params)] +return Tuple[_extract_types(params)] -# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. -def extract_types(params: Any) -> Tuple: +def _extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] +# DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] +# DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] -index_param = params[0] -index_type = type( -"IndexNameType", (IndexNameTypeHolder,), {} -) # type: Type[IndexNameTypeHolder] -if isinstance(index_param, tuple): -if len(index_param) != 2: -raise TypeError( -"Type hints for index should be specified as " -"DataFrame[('name', type), ...]; however, got %s" % index_param -) -name, tpe = index_param -else: -name, tpe = None, index_param +index_params = params[0] + +if isinstance(index_params, tuple) and len(index_params) == 2: +index_params = tuple([slice(*index_params)]) + +index_params = ( +_convert_tuples_to_zip(index_params) +if _is_valid_type_tuples(index_params) +else index_params +) +index_params = _prepare_a_tuple(index_params) -index_type.name
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
AmplabJenkins removed a comment on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926311898 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143581/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
AmplabJenkins commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926311898 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143581/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
SparkQA commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926311866 **[Test build #143581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143581/testReport)** for PR 34089 at commit [`d0c9ed4`](https://github.com/apache/spark/commit/d0c9ed4069a8d7b1006fc8dc8c1422bd25893136). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
HyukjinKwon commented on a change in pull request #34051: URL: https://github.com/apache/spark/pull/34051#discussion_r715289655 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ## @@ -157,7 +161,8 @@ case class InSubqueryExec( child = child.canonicalized, plan = plan.canonicalized.asInstanceOf[BaseSubqueryExec], exprId = ExprId(0), - resultBroadcast = null) + resultBroadcast = null, + result = null) Review comment: I see, okie. that's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-926308397 **[Test build #143583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143583/testReport)** for PR 34033 at commit [`ef0e81f`](https://github.com/apache/spark/commit/ef0e81f8e8e5872c4402aee1525a27febefd7292). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
SparkQA commented on pull request #34051: URL: https://github.com/apache/spark/pull/34051#issuecomment-926308274 **[Test build #143582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143582/testReport)** for PR 34051 at commit [`47dce1a`](https://github.com/apache/spark/commit/47dce1a9bb14f2e4eb3b9fe669d6bf6d7ef7042a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
SparkQA commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926308162 **[Test build #143581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143581/testReport)** for PR 34089 at commit [`d0c9ed4`](https://github.com/apache/spark/commit/d0c9ed4069a8d7b1006fc8dc8c1422bd25893136). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
AmplabJenkins removed a comment on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926306970 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48086/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin
AmplabJenkins removed a comment on pull request #34085: URL: https://github.com/apache/spark/pull/34085#issuecomment-926306969 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143574/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
AmplabJenkins removed a comment on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926306971 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48090/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin
AmplabJenkins commented on pull request #34085: URL: https://github.com/apache/spark/pull/34085#issuecomment-926306969 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143574/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
AmplabJenkins commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926306971 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48090/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34009: [SPARK-34378][SQL][AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields
AmplabJenkins commented on pull request #34009: URL: https://github.com/apache/spark/pull/34009#issuecomment-926306970 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48086/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin
SparkQA removed a comment on pull request #34085: URL: https://github.com/apache/spark/pull/34085#issuecomment-926252427 **[Test build #143574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143574/testReport)** for PR 34085 at commit [`08b1f31`](https://github.com/apache/spark/commit/08b1f31a7587cc8536b8a672b0a390ab6618bb97). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
viirya commented on a change in pull request #34051: URL: https://github.com/apache/spark/pull/34051#discussion_r715285748 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ## @@ -157,7 +161,8 @@ case class InSubqueryExec( child = child.canonicalized, plan = plan.canonicalized.asInstanceOf[BaseSubqueryExec], exprId = ExprId(0), - resultBroadcast = null) + resultBroadcast = null, + result = null) Review comment: I tried to move it out of constructor, but there was some errors about `result` is null at the moment of preparing result. There might be some where we do `copy` it. In the case, we will lose `result` value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34085: [SPARK-36835][BUILD] Enable createDependencyReducedPom for Maven shaded plugin
SparkQA commented on pull request #34085: URL: https://github.com/apache/spark/pull/34085#issuecomment-926305859 **[Test build #143574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143574/testReport)** for PR 34085 at commit [`08b1f31`](https://github.com/apache/spark/commit/08b1f31a7587cc8536b8a672b0a390ab6618bb97). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
SparkQA commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-926305432 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48089/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dgd-contributor commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
dgd-contributor commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r715284917 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ -return Tuple[extract_types(params)] +return Tuple[_extract_types(params)] -# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. -def extract_types(params: Any) -> Tuple: +def _extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] +# DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] +# DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] -index_param = params[0] -index_type = type( -"IndexNameType", (IndexNameTypeHolder,), {} -) # type: Type[IndexNameTypeHolder] -if isinstance(index_param, tuple): -if len(index_param) != 2: -raise TypeError( -"Type hints for index should be specified as " -"DataFrame[('name', type), ...]; however, got %s" % index_param -) -name, tpe = index_param -else: -name, tpe = None, index_param +index_params = params[0] + +if isinstance(index_params, tuple) and len(index_params) == 2: +index_params = tuple([slice(*index_params)]) + +index_params = ( +_convert_tuples_to_zip(index_params) +if _is_valid_type_tuples(index_params) +else index_params +) +index_params = _prepare_a_tuple(index_params) -
[GitHub] [spark] dgd-contributor commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
dgd-contributor commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r715284917 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -690,98 +696,145 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ -return Tuple[extract_types(params)] +return Tuple[_extract_types(params)] -# TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. -def extract_types(params: Any) -> Tuple: +def _extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] +# DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] +# DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] -index_param = params[0] -index_type = type( -"IndexNameType", (IndexNameTypeHolder,), {} -) # type: Type[IndexNameTypeHolder] -if isinstance(index_param, tuple): -if len(index_param) != 2: -raise TypeError( -"Type hints for index should be specified as " -"DataFrame[('name', type), ...]; however, got %s" % index_param -) -name, tpe = index_param -else: -name, tpe = None, index_param +index_params = params[0] + +if isinstance(index_params, tuple) and len(index_params) == 2: +index_params = tuple([slice(*index_params)]) + +index_params = ( +_convert_tuples_to_zip(index_params) +if _is_valid_type_tuples(index_params) +else index_params +) +index_params = _prepare_a_tuple(index_params) -
[GitHub] [spark] SparkQA commented on pull request #34089: [SPARK-36837][BUILD] Upgrade Kafka to 3.0.0
SparkQA commented on pull request #34089: URL: https://github.com/apache/spark/pull/34089#issuecomment-926304717 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48090/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org