[GitHub] [spark] SparkQA commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
SparkQA commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721198104 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35169/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
AmplabJenkins removed a comment on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721196324 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
AmplabJenkins commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721196324 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
SparkQA commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721196293 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35168/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #30229: [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
imback82 commented on a change in pull request #30229: URL: https://github.com/apache/spark/pull/30229#discussion_r516749544 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala ## @@ -75,6 +75,9 @@ case class AnalyzePartitionCommand( override def run(sparkSession: SparkSession): Seq[Row] = { val sessionState = sparkSession.sessionState +if (sessionState.catalog.getTempView(tableIdent.identifier).isDefined) { + throw new AnalysisException("ANALYZE TABLE is not supported on a temporary view.") +} Review comment: OK. What if we want to support the scenario where a temp view is allowed but not a permanent view (not this PR)? Do we want something like the following? ```scala case class UnresolvedTableOrView( multipartIdentifier: Seq[String], allowTempView: Boolean = true, allowPermanentView: Boolean = true) extends LeafNode { require(allowTempView || allowPermanentView) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jkleckner commented on a change in pull request #29496: [SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s
jkleckner commented on a change in pull request #29496: URL: https://github.com/apache/spark/pull/29496#discussion_r516741734 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala ## @@ -177,4 +188,27 @@ private[k8s] class LoggingPodStatusWatcherImpl( private def formatTime(time: String): String = { if (time != null || time != "") time else "N/A" } + + override def watchOrStop(sId: String): Boolean = if (hasCompleted()) { Review comment: @shockdm The 3.0 patch #29533 has now been merged. Even though we never got an answer whether that was a prerequisite, that blockage is now gone. Would you be willing to rebase and clean up your patch for submission? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #30232: [SPARK-33156][INFRA][2.4] Upgrade GithubAction image from 18.04 to 20.04
HyukjinKwon edited a comment on pull request #30232: URL: https://github.com/apache/spark/pull/30232#issuecomment-720965482 BTW, @dongjoon-hyun, seems like Ubuntu 20.04 changed its default Python version to Python 3 (?). I think that's why the error message shows a bit weird like `b'..`. I saw this symptom before. We deprecated Python 2 in Spark 3 and added Python 3 supports in the dev scripts too (at SPARK-27889) but `branch-2.4` does not have the fix yet IIRC. Can you try to remove this line https://github.com/apache/spark/pull/30232/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R126: ```diff # Yarn has a Python specific test too, for example, YarnClusterSuite. - if: contains(matrix.modules, 'yarn') || contains(matrix.modules, 'pyspark') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) with: ``` so the builds install Python 2 always? As far as I remember, the last installed Python becomes the default `python`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
SparkQA commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721182558 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35168/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #30203: [SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls
srowen commented on a change in pull request #30203: URL: https://github.com/apache/spark/pull/30203#discussion_r516730762 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ## @@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { } } + private def canonicalizeDeterministic(u: PythonUDF) = { Review comment: Indeed, the example in the PR should have c and d equal no matter what, ideally, as it shouldn't be evaluated multiple times. But then, being deterministic doesn't matter; how would changing the default be the right fix? I don't doubt it should be 'fixed' so that even non-deterministic UDFs aren't surprisingly reevaluated. Is the point that the current handling for non-deterministic UDFs slower, even though it evaluates them just once by design? that seems weird. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30229: [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
cloud-fan commented on a change in pull request #30229: URL: https://github.com/apache/spark/pull/30229#discussion_r516728706 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala ## @@ -175,7 +175,10 @@ abstract class SQLViewSuite extends QueryTest with SQLTestUtils { }.getMessage assert(e2.contains("SHOW CREATE TABLE is not supported on a temporary view")) assertNoSuchTable(s"SHOW PARTITIONS $viewName") - assertNoSuchTable(s"ANALYZE TABLE $viewName COMPUTE STATISTICS") + val e3 = intercept[AnalysisException] { +sql(s"ANALYZE TABLE $viewName COMPUTE STATISTICS") + }.getMessage + assert(e3.contains("ANALYZE TABLE is not supported on a temporary view")) assertNoSuchTable(s"ANALYZE TABLE $viewName COMPUTE STATISTICS FOR COLUMNS id") Review comment: We can create a separate PR to fix it. I think it should be `AnalysisException` with clear message to mention that it's not cached. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30229: [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
cloud-fan commented on a change in pull request #30229: URL: https://github.com/apache/spark/pull/30229#discussion_r516727702 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala ## @@ -280,6 +280,9 @@ class DataSourceV2Strategy(session: SparkSession) extends Strategy with Predicat case r @ ShowTableProperties(rt: ResolvedTable, propertyKey) => ShowTablePropertiesExec(r.output, rt.table, propertyKey) :: Nil +case AnalyzeTable(_: ResolvedTable, _, _) | AnalyzeColumn(_: ResolvedTable, _, _) => + throw new AnalysisException("ANALYZE TABLE is not supported for v2 tables.") Review comment: shall we fail it earlier in analyzer rules? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30229: [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
cloud-fan commented on a change in pull request #30229: URL: https://github.com/apache/spark/pull/30229#discussion_r516726862 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala ## @@ -75,6 +75,9 @@ case class AnalyzePartitionCommand( override def run(sparkSession: SparkSession): Seq[Row] = { val sessionState = sparkSession.sessionState +if (sessionState.catalog.getTempView(tableIdent.identifier).isDefined) { + throw new AnalysisException("ANALYZE TABLE is not supported on a temporary view.") +} Review comment: Seems it's not the first time to hit this requirement. How about we add `allowTemp: Boolean` in `UnresolvedTableOrView`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #30185: [SPARK-33152][SQL] This PR proposes a new logic to maintain & track constraints which solves the OOM or performance issues in query compilatio
srowen commented on pull request #30185: URL: https://github.com/apache/spark/pull/30185#issuecomment-721169846 This still doesn't look right. I'd squash your branch changes and rebase on master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
srowen commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721165959 Hoo boy that is a lot of code to touch. It sounds like eventually it has to happen for Scala 3 and avoids some deprecation noise. I'm not against it though as usual there is a minor concern about making patches harder to back port from 3.1 to 3.0 because of merge conflicts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30229: [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
cloud-fan commented on a change in pull request #30229: URL: https://github.com/apache/spark/pull/30229#discussion_r516725745 ## File path: sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala ## @@ -419,17 +410,16 @@ class ResolveSessionCatalog( } ShowTablesCommand(db, Some(pattern), true, partitionsSpec) -case AnalyzeTableStatement(tbl, partitionSpec, noScan) => - val v1TableName = parseV1Table(tbl, "ANALYZE TABLE") +// ANALYZE TABLE works on views if the views are cached. Review comment: so it works for temp view as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #30182: [SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to sort the table will cause the header content to be lost
srowen closed pull request #30182: URL: https://github.com/apache/spark/pull/30182 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #30182: [SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to sort the table will cause the header content to be lost
srowen commented on pull request #30182: URL: https://github.com/apache/spark/pull/30182#issuecomment-721162096 Merged to master/3.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30193: [SPARK-33293][SQL] Refactor WriteToDataSourceV2Exec and reduce code duplication
cloud-fan commented on a change in pull request #30193: URL: https://github.com/apache/spark/pull/30193#discussion_r516722182 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala ## @@ -479,15 +432,16 @@ object DataWritingSparkTask extends Logging { } } -private[v2] trait AtomicTableWriteExec extends V2TableWriteExec with SupportsV1Write { +private[v2] trait TableWriteExec extends V2TableWriteExec with SupportsV1Write { Review comment: Both work for me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30203: [SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls
cloud-fan commented on a change in pull request #30203: URL: https://github.com/apache/spark/pull/30203#discussion_r516720258 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ## @@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { } } + private def canonicalizeDeterministic(u: PythonUDF) = { Review comment: > actually, shouldn't we set it as false by default? The problem is the performance regression caused by changing the default value. What we can do now is educating the users harder, like mentioning this thing in the Scala/Python UDF doc. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ## @@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { } } + private def canonicalizeDeterministic(u: PythonUDF) = { Review comment: > actually, shouldn't we set it as false by default? The problem is the performance regression caused by changing the default value. What we can do now is educating the users harder, like mentioning this thing in the Scala/Python UDF doc page. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
SparkQA commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721156123 **[Test build #130568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130568/testReport)** for PR 30234 at commit [`dfe1a98`](https://github.com/apache/spark/commit/dfe1a9840128faa8857a3aa39bfa96b45aa9f158). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
AmplabJenkins removed a comment on pull request #30151: URL: https://github.com/apache/spark/pull/30151#issuecomment-721154544 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] akiyamaneko edited a comment on pull request #30182: [SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to sort the table will cause the header content to be lost
akiyamaneko edited a comment on pull request #30182: URL: https://github.com/apache/spark/pull/30182#issuecomment-721154044 Jenkins test this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
AmplabJenkins commented on pull request #30151: URL: https://github.com/apache/spark/pull/30151#issuecomment-721154544 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
SparkQA removed a comment on pull request #30151: URL: https://github.com/apache/spark/pull/30151#issuecomment-721027459 **[Test build #130563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130563/testReport)** for PR 30151 at commit [`291cf8a`](https://github.com/apache/spark/commit/291cf8aadb30edfe391d4a716a3abea01f309dde). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] akiyamaneko commented on pull request #30182: [SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to sort the table will cause the header content to be lost
akiyamaneko commented on pull request #30182: URL: https://github.com/apache/spark/pull/30182#issuecomment-721154044 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
SparkQA commented on pull request #30151: URL: https://github.com/apache/spark/pull/30151#issuecomment-721153265 **[Test build #130563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130563/testReport)** for PR 30151 at commit [`291cf8a`](https://github.com/apache/spark/commit/291cf8aadb30edfe391d4a716a3abea01f309dde). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #30227: [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
HyukjinKwon closed pull request #30227: URL: https://github.com/apache/spark/pull/30227 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
AmplabJenkins commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721144076 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
AmplabJenkins removed a comment on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721144076 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
LuciferYang opened a new pull request #30234: URL: https://github.com/apache/spark/pull/30234 ### What changes were proposed in this pull request? The purpose of this pr is to partial resolve SPARK-33285. `Auto-application` is dropped in Scala 3 and deprecated in 2.13 (https://github.com/scala/scala/pull/8833) now. So we should add the parens when accessed empty-paren method. For example, if definition as follows: ``` Class Foo { def bar(): Unit = {} } val foo = new Foo ``` `foo.bar()` is recommend , not `foo.bar`. So the main change of this pr is only add the parens where needed. ### Why are the changes needed? eliminate compilation warnings in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
SparkQA commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721142854 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
LuciferYang commented on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721140762 cc @srowen this is the first part to fix this compilation warnings, if it's too big, I can split multiple small prs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30234: [SPARK-33285][CORE][SQL] Fix deprecated compilation warnings of "Auto-application to () is deprecated" in Scala 2.13
SparkQA removed a comment on pull request #30234: URL: https://github.com/apache/spark/pull/30234#issuecomment-721142854 **[Test build #130567 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130567/testReport)** for PR 30234 at commit [`eb84d5e`](https://github.com/apache/spark/commit/eb84d5ea90967f6dd0fd32047ef1ec9259d12913). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #30227: [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
zero323 commented on pull request #30227: URL: https://github.com/apache/spark/pull/30227#issuecomment-721137978 Thanks @HyukjinKwon! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30038: [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
SparkQA removed a comment on pull request #30038: URL: https://github.com/apache/spark/pull/30038#issuecomment-721069806 **[Test build #130566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130566/testReport)** for PR 30038 at commit [`449a7ad`](https://github.com/apache/spark/commit/449a7ad2f695d5422a12740b65ab67d75a397551). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
AmplabJenkins removed a comment on pull request #30151: URL: https://github.com/apache/spark/pull/30151#issuecomment-721055509 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ScrapCodes commented on pull request #27735: [SPARK-30985][k8s] Support propagating SPARK_CONF_DIR files to driver and executor pods.
ScrapCodes commented on pull request #27735: URL: https://github.com/apache/spark/pull/27735#issuecomment-721070338 Hi @holdenk and @dongjoon-hyun, Hope you are doing good ! Do you have a view around this one way or the other? Thanks ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #30221: [SPARK-33314][SQL] Avoid dropping rows in Avro reader
gengliangwang commented on pull request #30221: URL: https://github.com/apache/spark/pull/30221#issuecomment-720397753 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #30186: [SPARK-23432][UI] Add executor peak jvm memory metrics in executors page
tgravescs commented on a change in pull request #30186: URL: https://github.com/apache/spark/pull/30186#discussion_r516080247 ## File path: core/src/main/resources/org/apache/spark/ui/static/executorspage-template.html ## @@ -86,6 +86,22 @@ Executors Off Heap Storage Memory + + + Peak JVM Memory OnHeap / OffHeap + + + Peak Execution Memory OnHeap / OffHeap + + Review comment: I think we should expand this to be "Peak onHeap/ OffHeap memory used for storage of data like RDD partitions cached in memory" ## File path: core/src/main/resources/org/apache/spark/ui/static/executorspage-template.html ## @@ -86,6 +86,22 @@ Executors Off Heap Storage Memory + + + Peak JVM Memory OnHeap / OffHeap + + Review comment: Peak OnHeap/OffHeap memory used for execution. This refers to memory used for computation in shuffles, joins, user data structures, etc. See the Memory Management Overview documentation for more details. ## File path: core/src/main/resources/org/apache/spark/ui/static/executorspage-template.html ## @@ -86,6 +86,22 @@ Executors Off Heap Storage Memory + + + Peak JVM Memory OnHeap / OffHeap + + + Peak Execution Memory OnHeap / OffHeap + + + Peak Storage Memory OnHeap / OffHeap + + Review comment: Similarly I think we should explain more or point to java docs. At least say this is direct byte buffers and mapped are memory-mapped or perhaps better would be to point to java.nio:type=BufferPool,name=direct and java.nio:type=BufferPool,name=mapped This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #30226: [SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for `from_json`
dongjoon-hyun closed pull request #30226: URL: https://github.com/apache/spark/pull/30226 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #30203: [SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls
HyukjinKwon commented on a change in pull request #30203: URL: https://github.com/apache/spark/pull/30203#discussion_r516423241 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ## @@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { } } + private def canonicalizeDeterministic(u: PythonUDF) = { Review comment: The fix itself is logically making sense. But my concern is that we set this deterministic flag true by default for Python UDFs (and Scala UDFs too). When users just use UDFs actually without knowing it (which I believe arguably pretty common), they will get the same answer from UDFs users intended to work as non-deterministically after this fix. @gatorsmile and @cloud-fan, actually, shouldn't we set it as `false` by default? - I am reading https://github.com/apache/spark/commit/ebc24a9b7fde273ee4912f9bc1c5059703f7b31e. It's an arbitrary user-defined function so I think it makes sense to have the most loose condition. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala ## @@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper { } } + private def canonicalizeDeterministic(u: PythonUDF) = { Review comment: The fix itself is logically making sense. But my concern is that we set this deterministic flag true by default for Python UDFs (and Scala UDFs too). When users just use UDFs actually without knowing it (which I believe is arguably pretty common), they will get the same answer from UDFs users intended to work as non-deterministically after this fix. @gatorsmile and @cloud-fan, actually, shouldn't we set it as `false` by default? - I am reading https://github.com/apache/spark/commit/ebc24a9b7fde273ee4912f9bc1c5059703f7b31e. It's an arbitrary user-defined function so I think it makes sense to have the most loose condition by default. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30227: [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
AmplabJenkins removed a comment on pull request #30227: URL: https://github.com/apache/spark/pull/30227#issuecomment-720494098 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ScrapCodes commented on pull request #30038: [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
ScrapCodes commented on pull request #30038: URL: https://github.com/apache/spark/pull/30038#issuecomment-720278520 Hi @huaxingao and @cloud-fan , This is ready for review ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30223: [SPARK-33306][SQL][FOLLOWUP] Group DateType and TimestampType together in `needsTimeZone()`
AmplabJenkins removed a comment on pull request #30223: URL: https://github.com/apache/spark/pull/30223#issuecomment-720382265 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #26935: [SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore
HeartSaVioR commented on pull request #26935: URL: https://github.com/apache/spark/pull/26935#issuecomment-720970588 @viirya Kindly reminder. @gaborgsomogyi and @xuanyuanking gave +1 - would it be good to persuade your call? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #29533: [SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s
dongjoon-hyun closed pull request #29533: URL: https://github.com/apache/spark/pull/29533 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30230: [SPARK-33323][SQL] Add query resolved check before convert hive relation
AmplabJenkins commented on pull request #30230: URL: https://github.com/apache/spark/pull/30230#issuecomment-720890323 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30216: [SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR
AmplabJenkins commented on pull request #30216: URL: https://github.com/apache/spark/pull/30216#issuecomment-720328716 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request #28647: [SPARK-31828][SQL] Retain table properties at CreateTableLikeCommand
ulysses-you opened a new pull request #28647: URL: https://github.com/apache/spark/pull/28647 ### What changes were proposed in this pull request? At CreateTableLikeCommand, we use the new tblproperties with merge source tblproperties. ### Why are the changes needed? We should retain the useful tblproperties, e.g. `parquet.compression`. And hive also retain the tblproperties. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add UT. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Update migration guide to make clear what behavior changed and make variable names and configurati
cloud-fan commented on a change in pull request #30202: URL: https://github.com/apache/spark/pull/30202#discussion_r515753866 ## File path: docs/sql-migration-guide.md ## @@ -52,7 +52,7 @@ license: | - In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. - - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`. + - In Spark 3.1, when script transformation output's value size is less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`), Spark will pad NUll value to supplement data. In Spark 3.0 or earlier, Spark will do nothing and throw `ArrayIndexOutOfBoundsException`. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.transformationNotPadNullToSupplementData.enabled` to `true`. Review comment: Yea I'm just saying that we usually don't add migration guide if we make something failed to work, as it's not a breaking change and users have nothing to do for migration. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30224: [SPARK-33316][SQL] Support user provided nullable Avro schema for non-nullable catalyst schema in Avro writing
AmplabJenkins removed a comment on pull request #30224: URL: https://github.com/apache/spark/pull/30224#issuecomment-720385125 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30231: [SPARK-33156][INFRA][3.0] Upgrade GithubAction image from 18.04 to 20.04
dongjoon-hyun commented on pull request #30231: URL: https://github.com/apache/spark/pull/30231#issuecomment-720885674 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions
SparkQA commented on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-720725184 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #30144: [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP
cloud-fan commented on a change in pull request #30144: URL: https://github.com/apache/spark/pull/30144#discussion_r515765822 ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ## @@ -3691,6 +3691,32 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark checkAnswer(sql("SELECT id FROM t WHERE (SELECT true)"), Row(0L)) } } + + test("SPARK-33229: Support GROUP BY use Separate columns and CUBE/ROLLUP") { +withTable("t") { + sql("CREATE TABLE t USING PARQUET AS SELECT id AS a, id AS b, id AS c FROM range(1)") + checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY CUBE(a, b, c)"), +Row(0, 0, 0, 1) :: Row(0, 0, null, 1) :: + Row(0, null, 0, 1) :: Row(0, null, null, 1) :: + Row(null, 0, 0, 1) :: Row(null, 0, null, 1) :: + Row(null, null, 0, 1) :: Row(null, null, null, 1) :: Nil) + checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY a, CUBE(b, c)"), Review comment: what's the semantic of it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao removed a comment on pull request #30176: [SQL][MINOR] Update from_unixtime doc
sunchao removed a comment on pull request #30176: URL: https://github.com/apache/spark/pull/30176#issuecomment-720671002 @HyukjinKwon @Obbay2 is there a JIRA for this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
HeartSaVioR commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-720900095 Sigh. Now it has conflicts so I can't go with giving +1 and merging. @cchighman Could you please fix the conflicts? It would be totally OK if you'd like to let me take this over - just wanted to give a full credit of yours and accelerate the process of merging. If you can deal with fixing conflicts and hopefully also address my minor comments I'll give +1 if tests pass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #26319: [SPARK-29594][SQL] Provide better error message when creating a Dataset from a Sequence of Case class where a field name started with a
AmplabJenkins commented on pull request #26319: URL: https://github.com/apache/spark/pull/26319#issuecomment-720287298 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #26319: [SPARK-29594][SQL] Provide better error message when creating a Dataset from a Sequence of Case class where a field name started with
SparkQA removed a comment on pull request #26319: URL: https://github.com/apache/spark/pull/26319#issuecomment-720206210 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #29906: [SPARK-32037][CORE] Rename blacklisting feature
xkrogen commented on pull request #29906: URL: https://github.com/apache/spark/pull/29906#issuecomment-720575364 Huge thanks for pushing this through @tgravescs ! It was no small effort! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28647: [SPARK-31828][SQL] Retain table properties at CreateTableLikeCommand
SparkQA removed a comment on pull request #28647: URL: https://github.com/apache/spark/pull/28647#issuecomment-720570503 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30164: [SPARK-32919][SHUFFLE] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging par
SparkQA removed a comment on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-720725184 **[Test build #130533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130533/testReport)** for PR 30164 at commit [`0423970`](https://github.com/apache/spark/commit/0423970aaeb992201942ff4e82e64476283ca568). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29935: [SPARK-33055][PYTHON][SQL] Add Python CalendarIntervalType
AmplabJenkins removed a comment on pull request #29935: URL: https://github.com/apache/spark/pull/29935#issuecomment-702872843 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
HeartSaVioR commented on a change in pull request #30151: URL: https://github.com/apache/spark/pull/30151#discussion_r515757520 ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala ## @@ -126,6 +126,123 @@ private[ui] class StreamingQueryStatisticsPage(parent: StreamingQueryTab) } + def generateAggregatedStateOperators( + query: StreamingQueryUIData, + minBatchTime: Long, + maxBatchTime: Long, + jsCollector: JsCollector +): NodeBuffer = { Review comment: nit: we tend to append this line to the last line of the parameter if the last line of the parameter is not long. ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala ## @@ -126,6 +126,123 @@ private[ui] class StreamingQueryStatisticsPage(parent: StreamingQueryTab) } + def generateAggregatedStateOperators( + query: StreamingQueryUIData, + minBatchTime: Long, + maxBatchTime: Long, + jsCollector: JsCollector +): NodeBuffer = { +// This is made sure on caller side but put it here to be defensive +require(query.lastProgress != null) +if (query.lastProgress.stateOperators.nonEmpty) { + val numRowsTotalData = query.recentProgress.map(p => (parseProgressTimestamp(p.timestamp), +p.stateOperators.map(_.numRowsTotal).sum.toDouble)) + val maxNumRowsTotal = numRowsTotalData.maxBy(_._2)._2 + + val numRowsUpdatedData = query.recentProgress.map(p => (parseProgressTimestamp(p.timestamp), +p.stateOperators.map(_.numRowsUpdated).sum.toDouble)) + val maxNumRowsUpdated = numRowsUpdatedData.maxBy(_._2)._2 + + val memoryUsedBytesData = query.recentProgress.map(p => (parseProgressTimestamp(p.timestamp), +p.stateOperators.map(_.memoryUsedBytes).sum.toDouble)) + val maxMemoryUsedBytes = memoryUsedBytesData.maxBy(_._2)._2 + + val numRowsDroppedByWatermarkData = query.recentProgress +.map(p => (parseProgressTimestamp(p.timestamp), + p.stateOperators.map(_.numRowsDroppedByWatermark).sum.toDouble)) + val maxNumRowsDroppedByWatermark = numRowsDroppedByWatermarkData.maxBy(_._2)._2 + + val graphUIDataForNumberTotalRows = +new GraphUIData( + "aggregated-num-total-rows-timeline", + "aggregated-num-total-rows-histogram", + numRowsTotalData, + minBatchTime, + maxBatchTime, + 0, + maxNumRowsTotal, + "records") + graphUIDataForNumberTotalRows.generateDataJs(jsCollector) + + val graphUIDataForNumberUpdatedRows = +new GraphUIData( + "aggregated-num-updated-rows-timeline", Review comment: ditto ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala ## @@ -126,6 +126,123 @@ private[ui] class StreamingQueryStatisticsPage(parent: StreamingQueryTab) } + def generateAggregatedStateOperators( + query: StreamingQueryUIData, + minBatchTime: Long, + maxBatchTime: Long, + jsCollector: JsCollector +): NodeBuffer = { +// This is made sure on caller side but put it here to be defensive +require(query.lastProgress != null) +if (query.lastProgress.stateOperators.nonEmpty) { + val numRowsTotalData = query.recentProgress.map(p => (parseProgressTimestamp(p.timestamp), +p.stateOperators.map(_.numRowsTotal).sum.toDouble)) + val maxNumRowsTotal = numRowsTotalData.maxBy(_._2)._2 + + val numRowsUpdatedData = query.recentProgress.map(p => (parseProgressTimestamp(p.timestamp), +p.stateOperators.map(_.numRowsUpdated).sum.toDouble)) + val maxNumRowsUpdated = numRowsUpdatedData.maxBy(_._2)._2 + + val memoryUsedBytesData = query.recentProgress.map(p => (parseProgressTimestamp(p.timestamp), +p.stateOperators.map(_.memoryUsedBytes).sum.toDouble)) + val maxMemoryUsedBytes = memoryUsedBytesData.maxBy(_._2)._2 + + val numRowsDroppedByWatermarkData = query.recentProgress +.map(p => (parseProgressTimestamp(p.timestamp), + p.stateOperators.map(_.numRowsDroppedByWatermark).sum.toDouble)) + val maxNumRowsDroppedByWatermark = numRowsDroppedByWatermarkData.maxBy(_._2)._2 + + val graphUIDataForNumberTotalRows = +new GraphUIData( + "aggregated-num-total-rows-timeline", Review comment: Let's add `state` in ID of div tag explicitly, so that these can be clearly separated with other div tags. ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala ## @@ -126,6 +126,123 @@ private[ui] class StreamingQueryStatisticsPage(parent: StreamingQueryTab) } + def generateAggregatedStateOperators( + query: StreamingQueryUIData, + minBatchTime: Long, + maxBatchTime: Long, +
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
AmplabJenkins removed a comment on pull request #30009: URL: https://github.com/apache/spark/pull/30009#issuecomment-720876636 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
AmplabJenkins removed a comment on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-720292747 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on pull request #30062: [SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode
otterc commented on pull request #30062: URL: https://github.com/apache/spark/pull/30062#issuecomment-720702382 With the latest commit, all the review comments have been addressed. @Ngone51 @tgravescs @attilapiros @mridulm @Victsm @jiangxb1987 Please take another look at the changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30230: [SPARK-33323][SQL] Add query resolved check before convert hive relation
AmplabJenkins removed a comment on pull request #30230: URL: https://github.com/apache/spark/pull/30230#issuecomment-720890323 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #30062: [SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode
Ngone51 commented on a change in pull request #30062: URL: https://github.com/apache/spark/pull/30062#discussion_r516467454 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -0,0 +1,966 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.shuffle; + +import java.io.BufferedOutputStream; +import java.io.DataOutputStream; +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Arrays; +import java.util.Collection; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentMap; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Executor; +import java.util.concurrent.Executors; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Objects; +import com.google.common.base.Preconditions; +import com.google.common.cache.CacheBuilder; +import com.google.common.cache.CacheLoader; +import com.google.common.cache.LoadingCache; +import com.google.common.cache.Weigher; +import com.google.common.collect.Maps; +import io.netty.buffer.ByteBuf; +import io.netty.buffer.Unpooled; +import org.roaringbitmap.RoaringBitmap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import org.apache.spark.network.buffer.FileSegmentManagedBuffer; +import org.apache.spark.network.buffer.ManagedBuffer; +import org.apache.spark.network.client.StreamCallbackWithID; +import org.apache.spark.network.protocol.Encoders; +import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo; +import org.apache.spark.network.shuffle.protocol.FinalizeShuffleMerge; +import org.apache.spark.network.shuffle.protocol.MergeStatuses; +import org.apache.spark.network.shuffle.protocol.PushBlockStream; +import org.apache.spark.network.util.JavaUtils; +import org.apache.spark.network.util.NettyUtils; +import org.apache.spark.network.util.TransportConf; + +/** + * An implementation of {@link MergedShuffleFileManager} that provides the most essential shuffle + * service processing logic to support push based shuffle. + */ +public class RemoteBlockPushResolver implements MergedShuffleFileManager { + + private static final Logger logger = LoggerFactory.getLogger(RemoteBlockPushResolver.class); + @VisibleForTesting + static final String MERGE_MANAGER_DIR = "merge_manager"; + + private final ConcurrentMap appsPathInfo; + private final ConcurrentMap> partitions; + + private final Executor directoryCleaner; + private final TransportConf conf; + private final int minChunkSize; + private final ErrorHandler.BlockPushErrorHandler errorHandler; + + @SuppressWarnings("UnstableApiUsage") + private final LoadingCache indexCache; + + @SuppressWarnings("UnstableApiUsage") + public RemoteBlockPushResolver(TransportConf conf) { +this.conf = conf; +this.partitions = Maps.newConcurrentMap(); +this.appsPathInfo = Maps.newConcurrentMap(); +this.directoryCleaner = Executors.newSingleThreadExecutor( + // Add `spark` prefix because it will run in NM in Yarn mode. + NettyUtils.createThreadFactory("spark-shuffle-merged-shuffle-directory-cleaner")); +this.minChunkSize = conf.minChunkSizeInMergedShuffleFile(); +CacheLoader indexCacheLoader = + new CacheLoader() { +public ShuffleIndexInformation load(File file) throws IOException { + return new ShuffleIndexInformation(file); +} + }; +indexCache = CacheBuilder.newBuilder() + .maximumWeight(conf.mergedIndexCacheSize()) + .weigher((Weigher) (file, indexInfo) -> indexInfo.getSize()) + .build(indexCacheLoader); +this.errorHandler = new ErrorHandler.BlockPushErrorHandler(); + } + + /** + * Given the appShuffleId and reduceId that uniquely identifies a given shuffle partition of an + * application, retrieves the associated metadata. If not present and the corresponding merged + * shuffle does not
[GitHub] [spark] SparkQA removed a comment on pull request #29082: [SPARK-32288][UI] Add exception summary for failed tasks in stage page
SparkQA removed a comment on pull request #29082: URL: https://github.com/apache/spark/pull/29082#issuecomment-720852047 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #30177: [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends.
HyukjinKwon commented on pull request #30177: URL: https://github.com/apache/spark/pull/30177#issuecomment-720243217 Let me monitor a bit more before reverting it. I just saw it once. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Victsm commented on a change in pull request #30062: [SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode
Victsm commented on a change in pull request #30062: URL: https://github.com/apache/spark/pull/30062#discussion_r516167838 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/PushBlockStream.java ## @@ -23,24 +23,31 @@ import org.apache.spark.network.protocol.Encoders; // Needed by ScalaDoc. See SPARK-7726 -import static org.apache.spark.network.shuffle.protocol.BlockTransferMessage.Type; /** * Request to push a block to a remote shuffle service to be merged in push based shuffle. * The remote shuffle service will also include this message when responding the push requests. */ public class PushBlockStream extends BlockTransferMessage { + public static final String SHUFFLE_PUSH_BLOCK_PREFIX = "shufflePush"; public final String appId; - public final String blockId; + public final int shuffleId; + public final int mapIndex; + public final int reduceId; // Similar to the chunkIndex in StreamChunkId, indicating the index of a block in a batch of // blocks to be pushed. public final int index; + public final String streamId; Review comment: Where's this streamId used? ShuffleId is used later as streamId in toString and decode. That seems a bit confusing. ## File path: common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java ## @@ -363,4 +363,38 @@ public boolean useOldFetchProtocol() { return conf.getBoolean("spark.shuffle.useOldFetchProtocol", false); } + /** + * Class name of the implementation of MergedShuffleFileManager that merges the blocks + * pushed to it when push-based shuffle is enabled. Default implementation for merging the blocks + * remotely is 'org.apache.spark.network.shuffle.RemoteBlockPushResolver'. + * To turn off push-based shuffle at a cluster level, set the configuration to + * 'org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager'. + */ + public String mergeShuffleFileManagerImpl() { +return conf.get("spark.shuffle.push.based.mergedShuffleFileManagerImpl", + "org.apache.spark.network.shuffle.RemoteBlockPushResolver"); Review comment: Should the default be NoOpMergedShuffleFileManager instead? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #30227: [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
HyukjinKwon commented on pull request #30227: URL: https://github.com/apache/spark/pull/30227#issuecomment-720813038 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30232: [SPARK-33156][INFRA][2.4] Upgrade GithubAction image from 18.04 to 20.04
AmplabJenkins commented on pull request #30232: URL: https://github.com/apache/spark/pull/30232#issuecomment-720895587 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bozhang2820 opened a new pull request #30224: [SPARK-33316][SQL] Support user provided nullable Avro schema for non-nullable catalyst schema in Avro writing
bozhang2820 opened a new pull request #30224: URL: https://github.com/apache/spark/pull/30224 ### What changes were proposed in this pull request? This change is to support user provided nullable Avro schema for data with non-nullable catalyst schema in Avro writing. Without this change, when users try to use a nullable Avro schema to write data with a non-nullable catalyst schema, it will throw an `IncompatibleSchemaException` with a message like `Cannot convert Catalyst type StringType to Avro type ["null","string"]`. With this change it will assume that the data is non-nullable, log a warning message for the nullability difference and serialize the data to Avro format with the nullable Avro schema provided. ### Why are the changes needed? This change is needed because sometimes our users do not have full control over the nullability of the Avro schemas they use, and this change provides them with the flexibility. ### Does this PR introduce _any_ user-facing change? Yes. Users are allowed to use nullable Avro schemas for data with non-nullable catalyst schemas in Avro writing after the change. ### How was this patch tested? Added unit tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30221: [SPARK-33314][SQL] Avoid dropping rows in Avro reader
SparkQA commented on pull request #30221: URL: https://github.com/apache/spark/pull/30221#issuecomment-720879566 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30062: [SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode
otterc commented on a change in pull request #30062: URL: https://github.com/apache/spark/pull/30062#discussion_r516175694 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/PushBlockStream.java ## @@ -23,24 +23,31 @@ import org.apache.spark.network.protocol.Encoders; // Needed by ScalaDoc. See SPARK-7726 -import static org.apache.spark.network.shuffle.protocol.BlockTransferMessage.Type; /** * Request to push a block to a remote shuffle service to be merged in push based shuffle. * The remote shuffle service will also include this message when responding the push requests. */ public class PushBlockStream extends BlockTransferMessage { + public static final String SHUFFLE_PUSH_BLOCK_PREFIX = "shufflePush"; public final String appId; - public final String blockId; + public final int shuffleId; + public final int mapIndex; + public final int reduceId; // Similar to the chunkIndex in StreamChunkId, indicating the index of a block in a batch of // blocks to be pushed. public final int index; + public final String streamId; Review comment: The `streamId` is used for `StreamCallbackWithId.getID()` . The usage of the name `streamId` in `toString` and `decode` is a typo. ## File path: common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java ## @@ -363,4 +363,38 @@ public boolean useOldFetchProtocol() { return conf.getBoolean("spark.shuffle.useOldFetchProtocol", false); } + /** + * Class name of the implementation of MergedShuffleFileManager that merges the blocks + * pushed to it when push-based shuffle is enabled. Default implementation for merging the blocks + * remotely is 'org.apache.spark.network.shuffle.RemoteBlockPushResolver'. + * To turn off push-based shuffle at a cluster level, set the configuration to + * 'org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager'. + */ + public String mergeShuffleFileManagerImpl() { +return conf.get("spark.shuffle.push.based.mergedShuffleFileManagerImpl", + "org.apache.spark.network.shuffle.RemoteBlockPushResolver"); Review comment: Sure. I will change the default to be `NoOpMergedShuffleFileManager`. cc. @Ngone51 @tgravescs ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/PushBlockStream.java ## @@ -23,24 +23,31 @@ import org.apache.spark.network.protocol.Encoders; // Needed by ScalaDoc. See SPARK-7726 -import static org.apache.spark.network.shuffle.protocol.BlockTransferMessage.Type; /** * Request to push a block to a remote shuffle service to be merged in push based shuffle. * The remote shuffle service will also include this message when responding the push requests. */ public class PushBlockStream extends BlockTransferMessage { + public static final String SHUFFLE_PUSH_BLOCK_PREFIX = "shufflePush"; public final String appId; - public final String blockId; + public final int shuffleId; + public final int mapIndex; + public final int reduceId; // Similar to the chunkIndex in StreamChunkId, indicating the index of a block in a batch of // blocks to be pushed. public final int index; + public final String streamId; Review comment: This is fixed. ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -0,0 +1,883 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.shuffle; + +import java.io.BufferedOutputStream; +import java.io.DataOutputStream; +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Arrays; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentMap; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Executor; +import
[GitHub] [spark] zero323 commented on a change in pull request #30216: [SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR
zero323 commented on a change in pull request #30216: URL: https://github.com/apache/spark/pull/30216#discussion_r515789891 ## File path: R/pkg/DESCRIPTION ## @@ -42,6 +42,7 @@ Collate: 'context.R' 'deserialize.R' 'functions.R' +'functions_avro.R' Review comment: I thought it might be better to have in a separate script, as it is external after all. Not to mention `functions` are already huge and painful to navigate, even without adding new `family` there. Honestly, I'd prefer to split the whole `functions.R` into families (`functions_aggregate.R`, `functions_datettime.R`, ...). ## File path: R/pkg/R/functions_avro.R ## @@ -0,0 +1,117 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' Avro processing functions for Column operations +#' +#' Avro processing functions defined for \code{Column}. +#' +#' @param x Column to compute on. +#' @param jsonFormatSchema character Avro schema in JSON string format +#' @param ... additional argument(s) passed as parser options. +#' @name column_avro_functions +#' @rdname column_avro_functions +#' @family avro functions +#' @note Avro is built-in but external data source module since Spark 2.4. +#' Please deploy the application as per the deployment section of "Apache Avro Data Source Guide". +#' @examples +#' \dontrun{ +#' df <- createDataFrame(iris) +#' schema <- paste( +#' c( +#' '{"type": "record", "namespace": "example.avro", "name": "Iris", "fields": [', +#' '{"type": ["double", "null"], "name": "Sepal_Length"},', +#' '{"type": ["double", "null"], "name": "Sepal_Width"},', +#' '{"type": ["double", "null"], "name": "Petal_Length"},', +#' '{"type": ["double", "null"], "name": "Petal_Width"},', +#' '{"type": ["string", "null"], "name": "Species"}]}' +#' ), +#' collapse="\\n" +#' ) +#' +#' df_serialized <- select( +#' df, +#' alias(to_avro(alias(struct(column("*")), "fields")), "payload") +#' ) +#' +#' df_deserialized <- select( +#' df_serialized, +#' from_avro(df_serialized$payload, schema) +#' ) +#' +#' head(df_deserialized) +#' } +NULL + +#' @include generics.R column.R +NULL + +#' @details +#' \code{from_avro} Converts a binary column of Avro format into its corresponding catalyst value. +#' The specified schema must match the read data, otherwise the behavior is undefined: +#' it may fail or return arbitrary result. +#' To deserialize the data with a compatible and evolved schema, the expected Avro schema can be +#' set via the option avroSchema. Review comment: There's is note in the `family` section: https://github.com/apache/spark/blob/1e9545f69100394b038c502c643624d235fd085e/R/pkg/R/functions_avro.R#L28-L29 and I'll add an actual link to docs. Do you think we should duplicate this note for each function? ## File path: R/pkg/R/functions_avro.R ## @@ -0,0 +1,117 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' Avro processing functions for Column operations +#' +#' Avro processing functions defined for \code{Column}. +#' +#' @param x Column to compute on. +#' @param jsonFormatSchema character Avro schema in JSON string format +#' @param ... additional argument(s) passed as parser options. +#' @name column_avro_functions +#' @rdname column_avro_functions +#' @family avro functions +#' @note Avro is built-in but external data source module since Spark 2.4. +#' Please deploy the application as per
[GitHub] [spark] zero323 opened a new pull request #30227: [SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc*, desc*)
zero323 opened a new pull request #30227: URL: https://github.com/apache/spark/pull/30227 ### What changes were proposed in this pull request? This PR adds support for passing `Column`s as input to PySpark sorting functions. ### Why are the changes needed? According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible. ### Does this PR introduce _any_ user-facing change? PySpark users can now provide both `Column` and `str` as an argument for `asc*` and `desc*` functions. ### How was this patch tested? New unit tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
zhengruifeng commented on pull request #30009: URL: https://github.com/apache/spark/pull/30009#issuecomment-720851384 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30038: [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
AmplabJenkins commented on pull request #30038: URL: https://github.com/apache/spark/pull/30038#issuecomment-720294931 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #26319: [SPARK-29594][SQL] Provide better error message when creating a Dataset from a Sequence of Case class where a field name starte
AmplabJenkins removed a comment on pull request #26319: URL: https://github.com/apache/spark/pull/26319#issuecomment-720287298 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30233: [SPARK-33324][K8S][BUILD] Upgrade kubernetes-client to 4.11.1
dongjoon-hyun commented on pull request #30233: URL: https://github.com/apache/spark/pull/30233#issuecomment-720928756 Thank you, @HyukjinKwon . Merged to master for Apache Spark 3.1. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a change in pull request #30062: [SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode
attilapiros commented on a change in pull request #30062: URL: https://github.com/apache/spark/pull/30062#discussion_r516643629 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -0,0 +1,966 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.shuffle; + +import java.io.BufferedOutputStream; +import java.io.DataOutputStream; +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Arrays; +import java.util.Collection; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentMap; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.Executor; +import java.util.concurrent.Executors; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Objects; +import com.google.common.base.Preconditions; +import com.google.common.cache.CacheBuilder; +import com.google.common.cache.CacheLoader; +import com.google.common.cache.LoadingCache; +import com.google.common.cache.Weigher; +import com.google.common.collect.Maps; +import io.netty.buffer.ByteBuf; +import io.netty.buffer.Unpooled; +import org.roaringbitmap.RoaringBitmap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import org.apache.spark.network.buffer.FileSegmentManagedBuffer; +import org.apache.spark.network.buffer.ManagedBuffer; +import org.apache.spark.network.client.StreamCallbackWithID; +import org.apache.spark.network.protocol.Encoders; +import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo; +import org.apache.spark.network.shuffle.protocol.FinalizeShuffleMerge; +import org.apache.spark.network.shuffle.protocol.MergeStatuses; +import org.apache.spark.network.shuffle.protocol.PushBlockStream; +import org.apache.spark.network.util.JavaUtils; +import org.apache.spark.network.util.NettyUtils; +import org.apache.spark.network.util.TransportConf; + +/** + * An implementation of {@link MergedShuffleFileManager} that provides the most essential shuffle + * service processing logic to support push based shuffle. + */ +public class RemoteBlockPushResolver implements MergedShuffleFileManager { + + private static final Logger logger = LoggerFactory.getLogger(RemoteBlockPushResolver.class); + @VisibleForTesting + static final String MERGE_MANAGER_DIR = "merge_manager"; + + private final ConcurrentMap appsPathInfo; + private final ConcurrentMap> partitions; + + private final Executor directoryCleaner; + private final TransportConf conf; + private final int minChunkSize; + private final ErrorHandler.BlockPushErrorHandler errorHandler; + + @SuppressWarnings("UnstableApiUsage") + private final LoadingCache indexCache; + + @SuppressWarnings("UnstableApiUsage") + public RemoteBlockPushResolver(TransportConf conf) { +this.conf = conf; +this.partitions = Maps.newConcurrentMap(); +this.appsPathInfo = Maps.newConcurrentMap(); +this.directoryCleaner = Executors.newSingleThreadExecutor( + // Add `spark` prefix because it will run in NM in Yarn mode. + NettyUtils.createThreadFactory("spark-shuffle-merged-shuffle-directory-cleaner")); +this.minChunkSize = conf.minChunkSizeInMergedShuffleFile(); +CacheLoader indexCacheLoader = + new CacheLoader() { +public ShuffleIndexInformation load(File file) throws IOException { + return new ShuffleIndexInformation(file); +} + }; +indexCache = CacheBuilder.newBuilder() + .maximumWeight(conf.mergedIndexCacheSize()) + .weigher((Weigher) (file, indexInfo) -> indexInfo.getSize()) + .build(indexCacheLoader); +this.errorHandler = new ErrorHandler.BlockPushErrorHandler(); + } + + /** + * Given the appShuffleId and reduceId that uniquely identifies a given shuffle partition of an + * application, retrieves the associated metadata. If not present and the corresponding merged + * shuffle does not
[GitHub] [spark] HeartSaVioR commented on a change in pull request #30221: [SPARK-33314][SQL] Avoid dropping rows in Avro reader
HeartSaVioR commented on a change in pull request #30221: URL: https://github.com/apache/spark/pull/30221#discussion_r515822838 ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala ## @@ -171,9 +171,15 @@ private[sql] object AvroUtils extends Logging { protected val stopPosition: Long private[this] var completed = false +private[this] var interveningNext = true +private[this] var prevHasNextRow = false private[this] var currentRow: Option[InternalRow] = None def hasNextRow: Boolean = { + if (!interveningNext) { +// until a row is consumed, return previous result of hasNextRow +return prevHasNextRow + } Review comment: I also feel @viirya 's suggestion would be simpler. In addition, looks like the implementation didn't respect the Iterator's contracts - calling `hasNextRow` shouldn't be prerequisite to call `nextRow`. Below code would fix the original issue (as this code passes the new test), as well as it would also work for the case only `nextRow` is called with handling NoSuchElementException. ``` def hasNextRow: Boolean = { if (currentRow.isDefined) { return true } while (!completed && currentRow.isEmpty) { val r = fileReader.hasNext && !fileReader.pastSync(stopPosition) if (!r) { fileReader.close() completed = true currentRow = None } else { val record = fileReader.next() currentRow = deserializer.deserialize(record).asInstanceOf[Option[InternalRow]] } } currentRow.isDefined } def nextRow: InternalRow = { if (currentRow.isEmpty) { if (!hasNextRow) { throw new NoSuchElementException("next on empty iterator") } } val row = currentRow.get currentRow = None row } ``` ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala ## @@ -171,9 +171,15 @@ private[sql] object AvroUtils extends Logging { protected val stopPosition: Long private[this] var completed = false +private[this] var interveningNext = true +private[this] var prevHasNextRow = false private[this] var currentRow: Option[InternalRow] = None def hasNextRow: Boolean = { + if (!interveningNext) { +// until a row is consumed, return previous result of hasNextRow +return prevHasNextRow + } Review comment: I also feel @viirya 's suggestion would be simpler. In addition, looks like the implementation didn't respect the Iterator's contracts - calling `hasNextRow` shouldn't be prerequisite to call `nextRow`. Below code would fix the original issue (as this code passes the new test), as well as it would also work for the case only `nextRow` is called with handling NoSuchElementException. ``` def hasNextRow: Boolean = { if (currentRow.isDefined) { return true } while (!completed && currentRow.isEmpty) { val r = fileReader.hasNext && !fileReader.pastSync(stopPosition) if (!r) { fileReader.close() completed = true currentRow = None } else { val record = fileReader.next() currentRow = deserializer.deserialize(record).asInstanceOf[Option[InternalRow]] } } currentRow.isDefined } def nextRow: InternalRow = { if (currentRow.isEmpty) { if (!hasNextRow) { throw new NoSuchElementException("next on empty iterator") } } val row = currentRow.get currentRow = None row } ``` ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala ## @@ -171,9 +171,15 @@ private[sql] object AvroUtils extends Logging { protected val stopPosition: Long private[this] var completed = false +private[this] var interveningNext = true +private[this] var prevHasNextRow = false private[this] var currentRow: Option[InternalRow] = None def hasNextRow: Boolean = { + if (!interveningNext) { +// until a row is consumed, return previous result of hasNextRow +return prevHasNextRow + } Review comment: I also feel @viirya 's suggestion would be simpler. In addition, looks like the implementation didn't respect the Iterator's contracts - calling `hasNextRow` shouldn't be prerequisite to call `nextRow`. Below code would fix the original issue (as this code passes the new test), as well as it would also work for the case which only calls `nextRow` with handling `NoSuchElementException`. ``` def hasNextRow: Boolean =
[GitHub] [spark] AmplabJenkins commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source
AmplabJenkins commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-720292747 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request #30232: [SPARK-33156][INFRA][2.4] Upgrade GithubAction image from 18.04 to 20.04
dongjoon-hyun opened a new pull request #30232: URL: https://github.com/apache/spark/pull/30232 ### What changes were proposed in this pull request? This PR aims to upgrade `Github Action` runner image from `Ubuntu 18.04 (LTS)` to `Ubuntu 20.04 (LTS)`. ### Why are the changes needed? `ubuntu-latest` in `GitHub Action` is still `Ubuntu 18.04 (LTS)`. - https://github.com/actions/virtual-environments#available-environments This upgrade will prepare AmbLab Jenkins upgrade. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the `Github Action` in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR edited a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-720876379 No I don't have real case for knowing and taking the risk. Probably I could create some query which could evade the issue, but I agree that's more likely in theory and not real case. Saying again I don't object the change. If you look back my proposal then you'll find blocking the query is also one of options in my proposal. My point was that such change warrants the discussion, ideally in dev@ mailing list instead of PR. We should avoid making an important decision in closed group. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30151: [SPARK-33223][SS][UI]Structured Streaming Web UI state information
SparkQA commented on pull request #30151: URL: https://github.com/apache/spark/pull/30151#issuecomment-721027459 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30225: [SPARK-33187][SQL] Add a check on the number of returned metastore pa…
AmplabJenkins commented on pull request #30225: URL: https://github.com/apache/spark/pull/30225#issuecomment-720385116 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-720876379 No I don't have real case for knowing and taking the risk. Probably I could create some query which could evade the issue, but I agree that's more likely in theory and not real case. Saying again I don't object the change. If you look back my proposal then you'll find blocking the query is also one of options in my proposal. My point was that such change warrants the discussion, ideally in dev@ mailing list instead of PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #30200: [SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE
cloud-fan closed pull request #30200: URL: https://github.com/apache/spark/pull/30200 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on pull request #30204: [SPARK-33288][SPARK-32661][K8S] Stage level scheduling support for Kubernetes
tgravescs commented on pull request #30204: URL: https://github.com/apache/spark/pull/30204#issuecomment-720557444 seems like random test failures happening. I tried the recent YarnClusterSuite locally and it works fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30144: [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP
AngersZh commented on a change in pull request #30144: URL: https://github.com/apache/spark/pull/30144#discussion_r515926187 ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ## @@ -3691,6 +3691,32 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark checkAnswer(sql("SELECT id FROM t WHERE (SELECT true)"), Row(0L)) } } + + test("SPARK-33229: Support GROUP BY use Separate columns and CUBE/ROLLUP") { +withTable("t") { + sql("CREATE TABLE t USING PARQUET AS SELECT id AS a, id AS b, id AS c FROM range(1)") + checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY CUBE(a, b, c)"), +Row(0, 0, 0, 1) :: Row(0, 0, null, 1) :: + Row(0, null, 0, 1) :: Row(0, null, null, 1) :: + Row(null, 0, 0, 1) :: Row(null, 0, null, 1) :: + Row(null, null, 0, 1) :: Row(null, null, null, 1) :: Nil) + checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY a, CUBE(b, c)"), Review comment: > what's the semantic of it? If we want some dimensional analysis group by `a` and different dimensional about combine `b` & `c`, in current we need to write `group by cube(a, b, c)` and `where a !=NULL` to remove interfering data, with this patch we can just write ``` group by a, cube(b, c) ``` And this set of PR can make Grouping Analytics more flexible as Postgres SQL. And we do have this need for analysis。 ## File path: sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql ## @@ -59,4 +59,12 @@ SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY groupi -- Aliases in SELECT could be used in ROLLUP/CUBE/GROUPING SETS SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2); SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b); -SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k) +SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k); + +-- GROUP BY use mixed Separate columns and CUBE/ROLLUP +SELECT a, b, count(1) FROM testData GROUP BY a, b, CUBE(a, b); +SELECT a, b, count(1) FROM testData GROUP BY a, b, ROLLUP(a, b); +SELECT a, b, count(1) FROM testData GROUP BY CUBE(a, b), ROLLUP(a, b); +SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), ROLLUP(b); +SELECT a, b, count(1) FROM testData GROUP BY CUBE(a, b), ROLLUP(a, b) GROUPING SETS(a, b); Review comment: This is not unsupported, but it can be fixed easy after refactor GROUPING ANALYTICS This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kiszk commented on pull request #26319: [SPARK-29594][SQL] Provide better error message when creating a Dataset from a Sequence of Case class where a field name started with a number
kiszk commented on pull request #26319: URL: https://github.com/apache/spark/pull/26319#issuecomment-720634523 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30144: [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP
SparkQA removed a comment on pull request #30144: URL: https://github.com/apache/spark/pull/30144#issuecomment-720433951 **[Test build #130526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130526/testReport)** for PR 30144 at commit [`c1c551c`](https://github.com/apache/spark/commit/c1c551c6f9656ec71d6da03fb8fd4c4119d66c3a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30230: [SPARK-33323][SQL] Add query resolved check before convert hive relation
SparkQA commented on pull request #30230: URL: https://github.com/apache/spark/pull/30230#issuecomment-720875095 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30038: [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer diale
AmplabJenkins removed a comment on pull request #30038: URL: https://github.com/apache/spark/pull/30038#issuecomment-720294931 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30186: [SPARK-23432][UI] Add executor peak jvm memory metrics in executors page
AmplabJenkins removed a comment on pull request #30186: URL: https://github.com/apache/spark/pull/30186#issuecomment-720874715 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #30216: [SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR
HyukjinKwon commented on a change in pull request #30216: URL: https://github.com/apache/spark/pull/30216#discussion_r515852857 ## File path: R/pkg/R/functions_avro.R ## @@ -0,0 +1,117 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' Avro processing functions for Column operations +#' +#' Avro processing functions defined for \code{Column}. +#' +#' @param x Column to compute on. +#' @param jsonFormatSchema character Avro schema in JSON string format +#' @param ... additional argument(s) passed as parser options. +#' @name column_avro_functions +#' @rdname column_avro_functions +#' @family avro functions +#' @note Avro is built-in but external data source module since Spark 2.4. +#' Please deploy the application as per the deployment section of "Apache Avro Data Source Guide". +#' @examples +#' \dontrun{ +#' df <- createDataFrame(iris) +#' schema <- paste( +#' c( +#' '{"type": "record", "namespace": "example.avro", "name": "Iris", "fields": [', +#' '{"type": ["double", "null"], "name": "Sepal_Length"},', +#' '{"type": ["double", "null"], "name": "Sepal_Width"},', +#' '{"type": ["double", "null"], "name": "Petal_Length"},', +#' '{"type": ["double", "null"], "name": "Petal_Width"},', +#' '{"type": ["string", "null"], "name": "Species"}]}' +#' ), +#' collapse="\\n" +#' ) +#' +#' df_serialized <- select( +#' df, +#' alias(to_avro(alias(struct(column("*")), "fields")), "payload") +#' ) +#' +#' df_deserialized <- select( +#' df_serialized, +#' from_avro(df_serialized$payload, schema) +#' ) +#' +#' head(df_deserialized) +#' } +NULL + +#' @include generics.R column.R +NULL + +#' @details +#' \code{from_avro} Converts a binary column of Avro format into its corresponding catalyst value. +#' The specified schema must match the read data, otherwise the behavior is undefined: +#' it may fail or return arbitrary result. +#' To deserialize the data with a compatible and evolved schema, the expected Avro schema can be +#' set via the option avroSchema. Review comment: Oh no. I think that's fine but we'll have to add the R example in the link I pointed out. ## File path: R/pkg/DESCRIPTION ## @@ -42,6 +42,7 @@ Collate: 'context.R' 'deserialize.R' 'functions.R' +'functions_avro.R' Review comment: Separating them sounds like an idea, yes. But for now I think we should just put it in the same place; otherwise, it would make more sense to place ML ones separately to follow the structure in PySpark side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30228: [SPARK-33319][SQL][TEST] Add all built-in SerDes to HiveSerDeReadWriteSuite
SparkQA commented on pull request #30228: URL: https://github.com/apache/spark/pull/30228#issuecomment-720515948 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org