[GitHub] [spark] daugraph commented on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
daugraph commented on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925517526 ### Source code repository: ```bash https://github.com/apache/spark.git -r 4ea54e8672757c0dbe3dd57c81763afdffcbcc1b ``` ### Submit script/config: ```bash export SPARK_PRINT_LAUNCH_COMMAND="1" export SPARK_PREPEND_CLASSES="1" export HADOOP_CONF_DIR=/path/to/hadoop/conf export SPARK_SUBMIT_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf" spark-submit \ --master yarn \ --deploy-mode cluster \ --verbose \ --conf spark.kerberos.keytab=/path/to/keytab/file \ --conf spark.kerberos.principal=user_principal \ --conf spark.yarn.queue=root.user_queue \ --conf spark.yarn.maxAppAttempts=1 \ --class com.example.Main \ target/examples-1.0-SNAPSHOT.jar ``` ### Output ```bash NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_271.jdk/Contents/Home/bin/java -cp /Users/lijianmeng/github/spark/conf/:/Users/lijianmeng/github/spark/common/kvstore/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/common/network-common/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/common/network-shuffle/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/common/network-yarn/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/common/sketch/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/common/tags/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/common/unsafe/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/core/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/examples/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/graphx/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/launcher/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/mllib/target/scala-2.12/classes/:/Users/lijian meng/github/spark/repl/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/resource-managers/mesos/target/scala-2.12/classes:/Users/lijianmeng/github/spark/resource-managers/yarn/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/sql/catalyst/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/sql/core/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/sql/hive/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/sql/hive-thriftserver/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/streaming/target/scala-2.12/classes/:/Users/lijianmeng/github/spark/core/target/jars/*:/Users/lijianmeng/github/spark/mllib/target/jars/*:/Users/lijianmeng/github/spark/assembly/target/scala-2.12/jars/*:/path/to/hadoop/conf/ -Djava.security.krb5.conf=/etc/krb5.conf org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster --conf spark.kerberos.keytab=/path/to/keytab/file --conf spark.yarn.maxAppAttempts=1 --conf spark.kerberos.principal=user_principa l --conf spark.yarn.queue=root.user_queue --class com.example.Main --verbose target/examples-1.0-SNAPSHOT.jar Using properties file: null Parsed arguments: master yarn deployMode cluster executorMemory null executorCores null totalExecutorCores null propertiesFile null driverMemorynull driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue root.user_queue numExecutorsnull files null pyFiles null archivesnull mainClass com.example.Main primaryResource file:/Users/lijianmeng/bigdata/examples/target/examples-1.0-SNAPSHOT.jar namecom.example.Main childArgs [] jarsnull packagesnull packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file null: (spark.yarn.queue,root.user_queue) (spark.yarn.maxAppAttempts,1) (spark.kerberos.principal,user_principal) (spark.kerberos.keytab,/path/to/keytab/file) 21/09/23 13:17:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Main class: org.apache.spark.deploy.yarn.YarnClusterApplication Arguments: --jar file:/Users/lijianmeng/bigdata/examples/target/examples-1.0-SNAPSHOT.jar --class com.example.Main --verbose Spark config: (spark.kerberos.keytab,/path/to/keytab/file) (spark.yarn.queue,root.user_queue) (spark.app.name,com.example.Main)
[GitHub] [spark] viirya commented on a change in pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
viirya commented on a change in pull request #34038: URL: https://github.com/apache/spark/pull/34038#discussion_r714479944 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ## @@ -401,15 +401,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { |the ${ordinalNumber(ti + 1)} table has ${child.output.length} columns """.stripMargin.replace("\n", " ").trim()) } + val isUnion = operator.isInstanceOf[Union] + val dataTypesAreCompatibleFn = if (isUnion) { +// `TypeCoercion` takes care of type coercion already. If any columns or nested +// columns are not compatible, we detect it here and throw analysis exception. +val typeChecker = (dt1: DataType, dt2: DataType) => { + !TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty Review comment: It is not always able to cast the types between children of union. For incompatible types, we need to find it out and throw analysis error here. Do I misunderstand it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33627: [SPARK-36405][SQL][TESTS] Check that SQLSTATEs are valid
HyukjinKwon closed pull request #33627: URL: https://github.com/apache/spark/pull/33627 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33627: [SPARK-36405][SQL][TESTS] Check that SQLSTATEs are valid
HyukjinKwon commented on pull request #33627: URL: https://github.com/apache/spark/pull/33627#issuecomment-925515533 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
HyukjinKwon closed pull request #33844: URL: https://github.com/apache/spark/pull/33844 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
HyukjinKwon commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925515119 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925513635 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48041/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925512991 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48042/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34072: [SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL
HyukjinKwon commented on a change in pull request #34072: URL: https://github.com/apache/spark/pull/34072#discussion_r714476788 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala ## @@ -1084,6 +1087,18 @@ class PlanParserSuite extends AnalysisTest { table("testcat", "db", "tab").select(star()).hint("BROADCAST", $"tab")) } + test("option hint") { Review comment: let's add a JIRA prefix ```suggestion test("SPARK-36680: option hint") { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34072: [SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL
HyukjinKwon commented on a change in pull request #34072: URL: https://github.com/apache/spark/pull/34072#discussion_r714462840 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala ## @@ -1244,15 +1245,21 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg * }}} */ override def visitTable(ctx: TableContext): LogicalPlan = withOrigin(ctx) { -UnresolvedRelation(visitMultipartIdentifier(ctx.multipartIdentifier)) +val tableId = visitMultipartIdentifier(ctx.multipartIdentifier) +val options = Option(ctx.optionHint).map(hint => + visitPropertyKeyValues(hint.options)).getOrElse(Map.empty) +UnresolvedRelation(tableId, new CaseInsensitiveStringMap(options.asJava)) } /** * Create an aliased table reference. This is typically used in FROM clauses. */ override def visitTableName(ctx: TableNameContext): LogicalPlan = withOrigin(ctx) { val tableId = visitMultipartIdentifier(ctx.multipartIdentifier) -val table = mayApplyAliasPlan(ctx.tableAlias, UnresolvedRelation(tableId)) +val options = Option(ctx.optionHint).map(hint => + visitPropertyKeyValues(hint.options)).getOrElse(Map.empty) +val table = mayApplyAliasPlan(ctx.tableAlias, + UnresolvedRelation(tableId, new CaseInsensitiveStringMap(options.asJava))) Review comment: I don't think this works for the tables already defined with options because Spark respects the table properties defined in the table. This `options` is only for DSv2 for now. Can you add an e2e test, and see if it works? e.g.) create a table view and set the option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
HyukjinKwon commented on a change in pull request #29535: URL: https://github.com/apache/spark/pull/29535#discussion_r714476408 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala ## @@ -40,9 +41,12 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str * Holds the name of a relation that has yet to be looked up in a catalog. * * @param multipartIdentifier table name + * @param options options to scan this relation. Only applicable to v2 table scan. Review comment: okay, I just noticed https://github.com/apache/spark/commit/5e825482d70e13a8cb16f1fbdac8139710482d17 added the merging behaviour for V1. Okay, maybe we should fix the comments here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
HyukjinKwon commented on a change in pull request #29535: URL: https://github.com/apache/spark/pull/29535#discussion_r714476408 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala ## @@ -40,9 +41,12 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str * Holds the name of a relation that has yet to be looked up in a catalog. * * @param multipartIdentifier table name + * @param options options to scan this relation. Only applicable to v2 table scan. Review comment: okay, I just noticed https://github.com/apache/spark/commit/5e825482d70e13a8cb16f1fbdac8139710482d17 added the merging behaviour. Okay, maybe we should fix the comments here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
cloud-fan commented on a change in pull request #34038: URL: https://github.com/apache/spark/pull/34038#discussion_r714476343 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ## @@ -401,15 +401,30 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { |the ${ordinalNumber(ti + 1)} table has ${child.output.length} columns """.stripMargin.replace("\n", " ").trim()) } + val isUnion = operator.isInstanceOf[Union] + val dataTypesAreCompatibleFn = if (isUnion) { +// `TypeCoercion` takes care of type coercion already. If any columns or nested +// columns are not compatible, we detect it here and throw analysis exception. +val typeChecker = (dt1: DataType, dt2: DataType) => { + !TypeCoercion.findWiderTypeForTwo(dt1.asNullable, dt2.asNullable).isEmpty Review comment: I know it's from the old code. But is it necessary? The analyzer can add implicit casts to make the types the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
AmplabJenkins removed a comment on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925512221 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48037/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
AmplabJenkins commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925512221 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48037/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
SparkQA commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925512194 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48037/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33627: [SPARK-36405] Check that SQLSTATEs are valid
SparkQA commented on pull request #33627: URL: https://github.com/apache/spark/pull/33627#issuecomment-925510613 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48039/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34036: [SPARK-36795][SQL] Explain Formatted has Duplicate Node IDs
SparkQA commented on pull request #34036: URL: https://github.com/apache/spark/pull/34036#issuecomment-925510108 **[Test build #143534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143534/testReport)** for PR 34036 at commit [`c33b533`](https://github.com/apache/spark/commit/c33b5332262f132b3bdbd565b03436736f3e7a2f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
SparkQA commented on pull request #34073: URL: https://github.com/apache/spark/pull/34073#issuecomment-925510025 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48040/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
cloud-fan commented on pull request #29535: URL: https://github.com/apache/spark/pull/29535#issuecomment-925509816 > this creates a myth that setting options will overwrite table properties. This is expected. Per-scan options have higher priority than table properties. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AmplabJenkins removed a comment on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925508882 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
SparkQA commented on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925509177 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48038/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael commented on a change in pull request #34036: [SPARK-36795][SQL] Explain Formatted has Duplicate Node IDs
ChenMichael commented on a change in pull request #34036: URL: https://github.com/apache/spark/pull/34036#discussion_r714473203 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala ## @@ -704,6 +704,31 @@ class ExplainSuiteAE extends ExplainSuiteHelper with EnableAdaptiveExecutionSuit "Bucketed: false (bucket column(s) not read)") } } + + test("SPARK-36795: Node IDs should not be duplicated when InMemoryRelation Present") { +withTempView("t1", "t2") { + Seq(1).toDF("k").write.saveAsTable("t1") + Seq(1).toDF("key").write.saveAsTable("t2") + spark.sql("SELECT * FROM t1").persist() + val query = "SELECT * FROM (SELECT * FROM t1) join t2 " + +"ON k = t2.key" + val df = sql(query).toDF() + + df.collect() + checkKeywordsExistsInExplain(df, FormattedMode, +""" * BroadcastHashJoin Inner BuildLeft (12) + | :- BroadcastQueryStage (8) + | : +- BroadcastExchange (7) + | : +- * Filter (6) + | :+- * ColumnarToRow (5) Review comment: Ok. changed the test to regex that extracts the node ids and asserts they are different. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AmplabJenkins commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925508882 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33903: [SPARK-36656][SQL][TEST] CollapseProject should not collapse correlated scalar subqueries
cloud-fan commented on pull request #33903: URL: https://github.com/apache/spark/pull/33903#issuecomment-925508631 @allisonwang-db can you fix the conficts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33990: [SPARK-36747][SQL] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
cloud-fan commented on pull request #33990: URL: https://github.com/apache/spark/pull/33990#issuecomment-925508054 @allisonwang-db can you open a backport PR for 3.2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan edited a comment on pull request #33990: [SPARK-36747][SQL] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
cloud-fan edited a comment on pull request #33990: URL: https://github.com/apache/spark/pull/33990#issuecomment-925507907 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33990: [SPARK-36747][SQL] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
cloud-fan commented on pull request #33990: URL: https://github.com/apache/spark/pull/33990#issuecomment-925507907 thanks, merging to master/3.2! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
Ngone51 commented on a change in pull request #34043: URL: https://github.com/apache/spark/pull/34043#discussion_r714472130 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -117,12 +117,15 @@ class BlockManagerMasterEndpoint( case _updateBlockInfo @ UpdateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) => - val isSuccess = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) - context.reply(isSuccess) - // SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo - // returns false since the block info would be updated again later. - if (isSuccess) { - listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) + val response = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) + + response.foreach { isSuccess => +// SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo +// returns false since the block info would be updated again later. +if (isSuccess) { + listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) +} +context.reply(isSuccess) Review comment: Sure! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
mridulm commented on a change in pull request #34043: URL: https://github.com/apache/spark/pull/34043#discussion_r714471870 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -117,12 +117,15 @@ class BlockManagerMasterEndpoint( case _updateBlockInfo @ UpdateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) => - val isSuccess = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) - context.reply(isSuccess) - // SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo - // returns false since the block info would be updated again later. - if (isSuccess) { - listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) + val response = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) + + response.foreach { isSuccess => +// SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo +// returns false since the block info would be updated again later. +if (isSuccess) { + listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) +} +context.reply(isSuccess) Review comment: Given @gengliangwang has merged it, can you create a follow up PR ? We can merge it pretty quickly and possible make that into current 3.2 RC as well :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
gengliangwang closed pull request #34043: URL: https://github.com/apache/spark/pull/34043 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
gengliangwang commented on pull request #34043: URL: https://github.com/apache/spark/pull/34043#issuecomment-925506975 Merging to master/3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
gengliangwang commented on pull request #34043: URL: https://github.com/apache/spark/pull/34043#issuecomment-925506883 @mridulm @Ngone51 I really want to start 3.2.0 RC4 today. So I am going to merge this one and ask @Ngone51 to create a follow-up PR so that we can start the new RC soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #33990: [SPARK-36747][SQL] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
cloud-fan closed pull request #33990: URL: https://github.com/apache/spark/pull/33990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
HyukjinKwon commented on a change in pull request #29535: URL: https://github.com/apache/spark/pull/29535#discussion_r714469347 ## File path: sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala ## @@ -186,4 +187,21 @@ class DataSourceV2DataFrameSuite assert(e3.getMessage.contains(s"Cannot use interval type in the table schema.")) } } + + test("options to scan v2 table should be passed to DataSourceV2Relation") { +val t1 = "testcat.ns1.ns2.tbl" +withTable(t1) { + val df1 = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data") + df1.write.saveAsTable(t1) + + val optionName = "fakeOption" + val df2 = spark.read +.option(optionName, false) +.table(t1) Review comment: so for doubly sure, what happen if some options are already set in this table? e.g.) ```scala sql("CREATE TABLE tbl(a int) USING jdbc OPTIONS(a='b')") spark.option("a", "c").table("tbl") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925504422 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48036/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
HyukjinKwon commented on pull request #29535: URL: https://github.com/apache/spark/pull/29535#issuecomment-925504281 this creates a myth that setting `options` will overwrite table properties. see also https://github.com/apache/spark/pull/34072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
HyukjinKwon commented on pull request #29535: URL: https://github.com/apache/spark/pull/29535#issuecomment-925504126 So `UnresolvedReleation` is shared for both cases but conditionally use the `UnresolvedReleation.options` only for Scan? that's very confusing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
cloud-fan commented on pull request #29535: URL: https://github.com/apache/spark/pull/29535#issuecomment-925503778 it's table properties vs scan options -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
mridulm commented on a change in pull request #34043: URL: https://github.com/apache/spark/pull/34043#discussion_r714464726 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -117,12 +117,15 @@ class BlockManagerMasterEndpoint( case _updateBlockInfo @ UpdateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) => - val isSuccess = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) - context.reply(isSuccess) - // SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo - // returns false since the block info would be updated again later. - if (isSuccess) { - listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) + val response = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) + + response.foreach { isSuccess => +// SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo +// returns false since the block info would be updated again later. +if (isSuccess) { + listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) +} +context.reply(isSuccess) Review comment: Did not realize this - thanks for pointing it out ! So if I understood it right, the proposal is: ``` def handleResult(success: Boolean): Unit = { if (success) { // post } context.reply(success) } if (blockId.isShuffle) { updateShuffleBlockInfo( ... ).foreach( handleResult(_)) } else { handleResult(updateBlockInfo( ... )) } ``` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925502753 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48035/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #34043: [SPARK-36782][CORE] Avoid blocking dispatcher-BlockManagerMaster during UpdateBlockInfo
Ngone51 commented on a change in pull request #34043: URL: https://github.com/apache/spark/pull/34043#discussion_r714465842 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -117,12 +117,15 @@ class BlockManagerMasterEndpoint( case _updateBlockInfo @ UpdateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) => - val isSuccess = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) - context.reply(isSuccess) - // SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo - // returns false since the block info would be updated again later. - if (isSuccess) { - listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) + val response = updateBlockInfo(blockManagerId, blockId, storageLevel, deserializedSize, size) + + response.foreach { isSuccess => +// SPARK-30594: we should not post `SparkListenerBlockUpdated` when updateBlockInfo +// returns false since the block info would be updated again later. +if (isSuccess) { + listenerBus.post(SparkListenerBlockUpdated(BlockUpdatedInfo(_updateBlockInfo))) +} +context.reply(isSuccess) Review comment: Yes! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29535: [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
HyukjinKwon commented on pull request #29535: URL: https://github.com/apache/spark/pull/29535#issuecomment-925500304 wait, I get confused here. We already defined a table with options. How does it work with the newly set options? are they merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34072: [SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL
HyukjinKwon commented on a change in pull request #34072: URL: https://github.com/apache/spark/pull/34072#discussion_r714462840 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala ## @@ -1244,15 +1245,21 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg * }}} */ override def visitTable(ctx: TableContext): LogicalPlan = withOrigin(ctx) { -UnresolvedRelation(visitMultipartIdentifier(ctx.multipartIdentifier)) +val tableId = visitMultipartIdentifier(ctx.multipartIdentifier) +val options = Option(ctx.optionHint).map(hint => + visitPropertyKeyValues(hint.options)).getOrElse(Map.empty) +UnresolvedRelation(tableId, new CaseInsensitiveStringMap(options.asJava)) } /** * Create an aliased table reference. This is typically used in FROM clauses. */ override def visitTableName(ctx: TableNameContext): LogicalPlan = withOrigin(ctx) { val tableId = visitMultipartIdentifier(ctx.multipartIdentifier) -val table = mayApplyAliasPlan(ctx.tableAlias, UnresolvedRelation(tableId)) +val options = Option(ctx.optionHint).map(hint => + visitPropertyKeyValues(hint.options)).getOrElse(Map.empty) +val table = mayApplyAliasPlan(ctx.tableAlias, + UnresolvedRelation(tableId, new CaseInsensitiveStringMap(options.asJava))) Review comment: I don't think this works for the tables already defined with options because Spark respects the table properties defined in the table. This `options` is only for DSv2 for now. Can you add an e2e test, and see if it works? e.g.) create a table view and set the option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
AmplabJenkins removed a comment on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925497981 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143530/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
SparkQA removed a comment on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925493213 **[Test build #143530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143530/testReport)** for PR 34046 at commit [`80b24bd`](https://github.com/apache/spark/commit/80b24bdb8a4dd7cf2b46563d4708f9abdff0e540). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
AmplabJenkins commented on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925497981 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143530/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
SparkQA commented on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925497956 **[Test build #143530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143530/testReport)** for PR 34046 at commit [`80b24bd`](https://github.com/apache/spark/commit/80b24bdb8a4dd7cf2b46563d4708f9abdff0e540). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925497597 **[Test build #143533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143533/testReport)** for PR 34033 at commit [`293daea`](https://github.com/apache/spark/commit/293daea9674bb06606dbdd188b6730797de2f617). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #34030: [SPARK-36790][SQL] Update user-facing catalog to adapt CatalogPlugin
huaxingao commented on pull request #34030: URL: https://github.com/apache/spark/pull/34030#issuecomment-925495648 > Another question is, do we need to add more function overloads with an extra catalog parameter? Agree not to add more function overloading. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AngersZh commented on a change in pull request #34033: URL: https://github.com/apache/spark/pull/34033#discussion_r714460173 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -562,6 +567,8 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with protected override def nullSafeEval(value: Any): Any = { if (set.contains(value)) { true +} else if (isNaN(value)) { + set.exists(isNaN(_)) Review comment: How about current? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
SparkQA commented on pull request #34073: URL: https://github.com/apache/spark/pull/34073#issuecomment-925495331 **[Test build #143532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143532/testReport)** for PR 34073 at commit [`3a0052f`](https://github.com/apache/spark/commit/3a0052f2830ae1f31b92f0e2847937a359145477). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
SparkQA commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925494603 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48037/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33627: [SPARK-36405] Check that SQLSTATEs are valid
SparkQA commented on pull request #33627: URL: https://github.com/apache/spark/pull/33627#issuecomment-925493660 **[Test build #143531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143531/testReport)** for PR 33627 at commit [`1877bc4`](https://github.com/apache/spark/commit/1877bc48de3087134edbed6d3e45f20d4be3ba7d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
AmplabJenkins removed a comment on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-922637790 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
SparkQA commented on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925493213 **[Test build #143530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143530/testReport)** for PR 34046 at commit [`80b24bd`](https://github.com/apache/spark/commit/80b24bdb8a4dd7cf2b46563d4708f9abdff0e540). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
AmplabJenkins removed a comment on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-925490991 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48034/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
AmplabJenkins removed a comment on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925490993 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143529/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join
AmplabJenkins removed a comment on pull request #34069: URL: https://github.com/apache/spark/pull/34069#issuecomment-925490994 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48032/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
AmplabJenkins removed a comment on pull request #34058: URL: https://github.com/apache/spark/pull/34058#issuecomment-925490992 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48033/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
AmplabJenkins commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925490993 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143529/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
AmplabJenkins commented on pull request #34058: URL: https://github.com/apache/spark/pull/34058#issuecomment-925490992 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48033/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join
AmplabJenkins commented on pull request #34069: URL: https://github.com/apache/spark/pull/34069#issuecomment-925490994 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48032/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
AmplabJenkins commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-925490991 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48034/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925489997 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48036/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
SparkQA commented on pull request #34058: URL: https://github.com/apache/spark/pull/34058#issuecomment-925489894 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48033/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
SparkQA removed a comment on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925480993 **[Test build #143529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143529/testReport)** for PR 33844 at commit [`90e7ae9`](https://github.com/apache/spark/commit/90e7ae9510345f8be6aa08d2e28eacf024cb1264). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
SparkQA commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925488482 **[Test build #143529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143529/testReport)** for PR 33844 at commit [`90e7ae9`](https://github.com/apache/spark/commit/90e7ae9510345f8be6aa08d2e28eacf024cb1264). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join
SparkQA commented on pull request #34069: URL: https://github.com/apache/spark/pull/34069#issuecomment-925488226 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48032/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925488150 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48035/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
HyukjinKwon commented on pull request #34046: URL: https://github.com/apache/spark/pull/34046#issuecomment-925487408 ok to test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #34031: [SPARK-36791][DOCS] Fix spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
HyukjinKwon closed pull request #34031: URL: https://github.com/apache/spark/pull/34031 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34031: [SPARK-36791][DOCS] Fix spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
HyukjinKwon commented on pull request #34031: URL: https://github.com/apache/spark/pull/34031#issuecomment-925486202 Merged to master, branch-3.2, banch-3.1, and branch-3.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
HyukjinKwon commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r714452391 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -673,98 +673,146 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ return Tuple[extract_types(params)] # TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. def extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] +# DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] +# DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] Review comment: okie, sounds good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34038: [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
SparkQA commented on pull request #34038: URL: https://github.com/apache/spark/pull/34038#issuecomment-925485874 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48034/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
HyukjinKwon commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r714452107 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -673,98 +673,146 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ return Tuple[extract_types(params)] # TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. def extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] Review comment: ohh okay, its for dtype*s* ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -673,98 +673,146 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int],
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34058: [SPARK-36711][PYTHON] Support multi-index in new syntax
HyukjinKwon commented on a change in pull request #34058: URL: https://github.com/apache/spark/pull/34058#discussion_r714451936 ## File path: python/pyspark/pandas/typedef/typehints.py ## @@ -673,98 +673,146 @@ def create_tuple_for_frame_type(params: Any) -> object: Typing data columns with an index: >>> ps.DataFrame[int, [int, int]] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, int, int] +typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[pdf.index.dtype, pdf.dtypes] # doctest: +ELLIPSIS -typing.Tuple[...IndexNameType, numpy.int64] +typing.Tuple[...IndexNameType, ...NameType] >>> ps.DataFrame[("index", int), [("id", int), ("A", int)]] # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType, ...NameType] >>> ps.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] ... # doctest: +ELLIPSIS typing.Tuple[...IndexNameType, ...NameType] + +Typing data columns with an Multi-index: +>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] +>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) +>>> pdf = pd.DataFrame({'a': range(3)}, index=idx) +>>> ps.DataFrame[[int, int], [int, int]] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] +>>> ps.DataFrame[[("index-1", int), ("index-2", int)], [("id", int), ("A", int)]] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...IndexNameType, ...NameType, ...NameType] +>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] +... # doctest: +ELLIPSIS +typing.Tuple[...IndexNameType, ...NameType] """ return Tuple[extract_types(params)] # TODO(SPARK-36708): numpy.typing (numpy 1.21+) support for nested types. def extract_types(params: Any) -> Tuple: origin = params -if isinstance(params, zip): # type: ignore -# Example: -# DataFrame[zip(pdf.columns, pdf.dtypes)] -params = tuple(slice(name, tpe) for name, tpe in params) # type: ignore -if isinstance(params, Iterable): -params = tuple(params) -else: -params = (params,) +params = _prepare_a_tuple(params) -if all( -isinstance(param, slice) -and param.start is not None -and param.step is None -and param.stop is not None -for param in params -): +if _is_valid_slices(params): # Example: # DataFrame["id": int, "A": int] -new_params = [] -for param in params: -new_param = type("NameType", (NameTypeHolder,), {}) # type: Type[NameTypeHolder] -new_param.name = param.start -# When the given argument is a numpy's dtype instance. -new_param.tpe = param.stop.type if isinstance(param.stop, np.dtype) else param.stop -new_params.append(new_param) - +new_params = _convert_slices_to_holders(params, is_index=False) return tuple(new_params) elif len(params) == 2 and isinstance(params[1], (zip, list, pd.Series)): # Example: # DataFrame[int, [int, int]] # DataFrame[pdf.index.dtype, pdf.dtypes] # DataFrame[("index", int), [("id", int), ("A", int)]] # DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)] +# +# DataFrame[[int, int], [int, int]] +# DataFrame[pdf.index.dtypes, pdf.dtypes] +# DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] +# DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] Review comment: I meant: ``` ps.DataFrame[(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] ``` :-). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
HyukjinKwon commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925485282 @itholic mind updating Pr description too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
HyukjinKwon commented on a change in pull request #33989: URL: https://github.com/apache/spark/pull/33989#discussion_r714448159 ## File path: assembly/pom.xml ## @@ -165,6 +169,13 @@ hive + + + org.apache.spark + spark-hive-shaded_${scala.binary.version} + ${project.version} + ${hive.deps.scope} Review comment: @sunchao sorry if I missed sth but why should redeclare here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33844: [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py.
SparkQA commented on pull request #33844: URL: https://github.com/apache/spark/pull/33844#issuecomment-925480993 **[Test build #143529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143529/testReport)** for PR 33844 at commit [`90e7ae9`](https://github.com/apache/spark/commit/90e7ae9510345f8be6aa08d2e28eacf024cb1264). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] daugraph commented on a change in pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
daugraph commented on a change in pull request #34046: URL: https://github.com/apache/spark/pull/34046#discussion_r714446941 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ## @@ -66,6 +74,15 @@ private[spark] class ClientArguments(args: Array[String]) { throw new IllegalArgumentException("Cannot have primary-py-file and primary-r-file" + " at the same time") } + +if (verbose) { + logInfo("Client arguments for YARN application:") Review comment: we can also avoid throw Expection by remove --verbose option before pass arguments to org.apache.spark.deploy.yarn.ClientArgument.scala -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
cloud-fan commented on a change in pull request #34033: URL: https://github.com/apache/spark/pull/34033#discussion_r714446914 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -562,6 +567,8 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with protected override def nullSafeEval(value: Any): Any = { if (set.contains(value)) { true +} else if (isNaN(value)) { + set.exists(isNaN(_)) Review comment: can we have a `hasNaN` variable to avoid repeated computing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sigmod commented on a change in pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
sigmod commented on a change in pull request #34053: URL: https://github.com/apache/spark/pull/34053#discussion_r714446686 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAsOfJoinSuite.scala ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.sql.types._ + +class DataFrameAsOfJoinSuite extends QueryTest + with SharedSparkSession + with AdaptiveSparkPlanHelper { + + def prepareForAsOfJoin(): (DataFrame, DataFrame) = { +val schema1 = StructType( + StructField("a", IntegerType, false) :: +StructField("b", StringType, false) :: +StructField("left_val", StringType, false) :: Nil) +val rowSeq1: List[Row] = List(Row(1, "x", "a"), Row(5, "y", "b"), Row(10, "z", "c")) +val df1 = spark.createDataFrame(rowSeq1.asJava, schema1) + +val schema2 = StructType( + StructField("a", IntegerType) :: +StructField("b", StringType) :: +StructField("right_val", IntegerType) :: Nil) +val rowSeq2: List[Row] = List(Row(1, "v", 1), Row(2, "w", 2), Row(3, "x", 3), + Row(6, "y", 6), Row(7, "z", 7)) +val df2 = spark.createDataFrame(rowSeq2.asJava, schema2) + +(df1, df2) + } + + test("as-of join - simple") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf( +df2, df1.col("a"), df2.col("a"), usingColumns = Seq.empty, +joinType = "left", tolerance = null, allowExactMatches = true, direction = "backward"), + Seq( +Row(1, "x", "a", 1, "v", 1), +Row(5, "y", "b", 3, "x", 3), +Row(10, "z", "c", 7, "z", 7) + ) +) + } + + test("as-of join - usingColumns") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf(df2, df1.col("a"), df2.col("a"), usingColumns = Seq("b"), +joinType = "left", tolerance = null, allowExactMatches = true, direction = "backward"), + Seq( +Row(1, "x", "a", null, null, null), +Row(5, "y", "b", null, null, null), +Row(10, "z", "c", 7, "z", 7) + ) +) + } + + test("as-of join - usingColumns, inner") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf(df2, df1.col("a"), df2.col("a"), usingColumns = Seq("b"), +joinType = "inner", tolerance = null, allowExactMatches = true, direction = "backward"), + Seq( +Row(10, "z", "c", 7, "z", 7) + ) +) + } + + test("as-of join - tolerance = 1") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf(df2, df1.col("a"), df2.col("a"), usingColumns = Seq.empty, +joinType = "left", tolerance = lit(1), allowExactMatches = true, direction = "backward"), + Seq( +Row(1, "x", "a", 1, "v", 1), +Row(5, "y", "b", null, null, null), +Row(10, "z", "c", null, null, null) + ) +) + } + + test("as-of join - allowExactMatches = false") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf(df2, df1.col("a"), df2.col("a"), usingColumns = Seq.empty, +joinType = "left", tolerance = null, allowExactMatches = false, direction = "backward"), + Seq( +Row(1, "x", "a", null, null, null), Review comment: In the examples in comments, non-matches' numeric columns are NaN? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sigmod commented on a change in pull request #34053: [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
sigmod commented on a change in pull request #34053: URL: https://github.com/apache/spark/pull/34053#discussion_r714443723 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2122,6 +2125,68 @@ object RewriteIntersectAll extends Rule[LogicalPlan] { } } +/** + * Replaces logical [[AsOfJoin]] operator using a combination of Join and Aggregate operator. + * + * Input Pseudo-Query: + * {{{ + *SELECT * FROM left ASOF JOIN right ON (condition, as_of on(left.t, right.t), tolerance) + * }}} + * + * Rewritten Query: + * {{{ + * SELECT left.*, __right__.* + * FROM ( + *SELECT + * left.*, + * ( + * SELECT MIN_BY(STRUCT(right.*), left.t - right.t) + * FROM right + * WHERE condition AND left.t >= right.t AND right.t >= left.t - tolerance + * ) as __right__ + *FROM left + *) + * }}} + */ +object RewriteAsOfJoin extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( +_.containsPattern(AS_OF_JOIN), ruleId) { +case AsOfJoin(left, right, asOfCondition, condition, orderExpression, joinType) => + val conditionWithOuterReference = +condition.map(And(_, asOfCondition)).getOrElse(asOfCondition).transformUp { + case a: AttributeReference if left.outputSet.contains(a) => +OuterReference(a) + } + val filtered = Filter(conditionWithOuterReference, right) + + val orderExpressionWithOuterReference = orderExpression.transformUp { + case a: AttributeReference if left.outputSet.contains(a) => +OuterReference(a) +} + val rightStruct = CreateStruct(right.output) + val nearestRight = MinBy(rightStruct, orderExpressionWithOuterReference) +.toAggregateExpression() + val aggExpr = Alias(nearestRight, "__nearest_right__")() + val aggregate = Aggregate(Seq.empty, Seq(aggExpr), filtered) + + val scalarSubquery = Project( Review comment: Nit: projectWithScalarSubquery ? ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAsOfJoinSuite.scala ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.sql.types._ + +class DataFrameAsOfJoinSuite extends QueryTest + with SharedSparkSession + with AdaptiveSparkPlanHelper { + + def prepareForAsOfJoin(): (DataFrame, DataFrame) = { +val schema1 = StructType( + StructField("a", IntegerType, false) :: +StructField("b", StringType, false) :: +StructField("left_val", StringType, false) :: Nil) +val rowSeq1: List[Row] = List(Row(1, "x", "a"), Row(5, "y", "b"), Row(10, "z", "c")) +val df1 = spark.createDataFrame(rowSeq1.asJava, schema1) + +val schema2 = StructType( + StructField("a", IntegerType) :: +StructField("b", StringType) :: +StructField("right_val", IntegerType) :: Nil) +val rowSeq2: List[Row] = List(Row(1, "v", 1), Row(2, "w", 2), Row(3, "x", 3), + Row(6, "y", 6), Row(7, "z", 7)) +val df2 = spark.createDataFrame(rowSeq2.asJava, schema2) + +(df1, df2) + } + + test("as-of join - simple") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf( +df2, df1.col("a"), df2.col("a"), usingColumns = Seq.empty, +joinType = "left", tolerance = null, allowExactMatches = true, direction = "backward"), + Seq( +Row(1, "x", "a", 1, "v", 1), +Row(5, "y", "b", 3, "x", 3), +Row(10, "z", "c", 7, "z", 7) + ) +) + } + + test("as-of join - usingColumns") { +val (df1, df2) = prepareForAsOfJoin() +checkAnswer( + df1.joinAsOf(df2, df1.col("a"), df2.col("a"), usingColumns = Seq("b"), +joinType = "left", tolerance = null, allowExactMatches = true, direction = "backward"), + Seq( +Row(1, "x", "a",
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925479253 **[Test build #143528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143528/testReport)** for PR 34033 at commit [`174ac71`](https://github.com/apache/spark/commit/174ac717066e5fce2dcb5c0cd50c8d9149fe5580). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] daugraph commented on a change in pull request #34046: [SPARK-36804][YARN] Using the verbose parameter in yarn mode would cause application submission failure
daugraph commented on a change in pull request #34046: URL: https://github.com/apache/spark/pull/34046#discussion_r714445692 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ## @@ -66,6 +74,15 @@ private[spark] class ClientArguments(args: Array[String]) { throw new IllegalArgumentException("Cannot have primary-py-file and primary-r-file" + " at the same time") } + +if (verbose) { + logInfo("Client arguments for YARN application:") Review comment: Thanks for your review, your are right, but this is two different --verbose option. SparkSubmit will pass --verbose option to org.apache.spark.deploy.yarn.ClientArgument.scala, which can't handle the --verbose by now. I will add some detailed description of screenshots later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
HyukjinKwon commented on pull request #34051: URL: https://github.com/apache/spark/pull/34051#issuecomment-925478747 Otherwise, the change looks making sense to me 2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
cloud-fan commented on a change in pull request #34073: URL: https://github.com/apache/spark/pull/34073#discussion_r714445044 ## File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownV2Filters.java ## @@ -22,23 +22,26 @@ /** * A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to - * push down filters to the data source and reduce the size of the data to be read. Review comment: Let's only change the classdoc ``` push down V2 {@link Filter}s to ... Note that, this interface is preferred over {@link SupportsPushDownFilters}, which uses V1 Filter and is less efficient due to the internal -> external data conversion. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
AmplabJenkins removed a comment on pull request #34073: URL: https://github.com/apache/spark/pull/34073#issuecomment-925477621 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143523/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
AmplabJenkins commented on pull request #34073: URL: https://github.com/apache/spark/pull/34073#issuecomment-925477621 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143523/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AngersZh commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925477420 ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AmplabJenkins removed a comment on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925477202 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143527/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA removed a comment on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925475024 **[Test build #143527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143527/testReport)** for PR 34033 at commit [`87df7b0`](https://github.com/apache/spark/commit/87df7b0af3b6e3e7d8b55d9c30891bca4202a862). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
AmplabJenkins commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925477202 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143527/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925477170 **[Test build #143527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143527/testReport)** for PR 34033 at commit [`87df7b0`](https://github.com/apache/spark/commit/87df7b0af3b6e3e7d8b55d9c30891bca4202a862). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
SparkQA removed a comment on pull request #34073: URL: https://github.com/apache/spark/pull/34073#issuecomment-925377369 **[Test build #143523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143523/testReport)** for PR 34073 at commit [`1014995`](https://github.com/apache/spark/commit/1014995820aa9871ed9ac823775dda41d5024299). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34073: [SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters
SparkQA commented on pull request #34073: URL: https://github.com/apache/spark/pull/34073#issuecomment-925476776 **[Test build #143523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143523/testReport)** for PR 34073 at commit [`1014995`](https://github.com/apache/spark/commit/1014995820aa9871ed9ac823775dda41d5024299). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
HyukjinKwon commented on a change in pull request #34051: URL: https://github.com/apache/spark/pull/34051#discussion_r714443615 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ## @@ -157,7 +161,8 @@ case class InSubqueryExec( child = child.canonicalized, plan = plan.canonicalized.asInstanceOf[BaseSubqueryExec], exprId = ExprId(0), - resultBroadcast = null) + resultBroadcast = null, + result = null) Review comment: hm, IIRC when it copies, it won't copy `@transient private var result: Array[Any] = _` .. I think we won't have to move `result` into the constructor (?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34033: [SPARK-36792][SQL] InSet should handle NaN
SparkQA commented on pull request #34033: URL: https://github.com/apache/spark/pull/34033#issuecomment-925475024 **[Test build #143527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143527/testReport)** for PR 34033 at commit [`87df7b0`](https://github.com/apache/spark/commit/87df7b0af3b6e3e7d8b55d9c30891bca4202a862). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34051: [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
HyukjinKwon commented on a change in pull request #34051: URL: https://github.com/apache/spark/pull/34051#discussion_r714442193 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala ## @@ -104,17 +104,18 @@ case class ScalarSubquery( } /** - * The physical node of in-subquery. This is for Dynamic Partition Pruning only, as in-subquery - * coming from the original query will always be converted to joins. + * The physical node of in-subquery. When this is used for Dynamic Partition Pruning, as the pruning + * happens at the driver side, we don't broadcast subquery result. */ case class InSubqueryExec( child: Expression, plan: BaseSubqueryExec, exprId: ExprId, -private var resultBroadcast: Broadcast[Array[Any]] = null) +needBroadcast: Boolean = false, +private var resultBroadcast: Broadcast[Array[Any]] = null, +@transient private var result: Array[Any] = null) Review comment: qq: why should we move this to constructor? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org