[GitHub] [spark] SparkQA commented on pull request #31680: [SPARK-34568][SQL] We should respect enableHiveSupport when initialize SparkSession
SparkQA commented on pull request #31680: URL: https://github.com/apache/spark/pull/31680#issuecomment-809088540 **[Test build #136638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136638/testReport)** for PR 31680 at commit [`2afad3b`](https://github.com/apache/spark/commit/2afad3b82cef36abecd4d32d14cb8736d878d49d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
SparkQA commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809088302 **[Test build #136637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136637/testReport)** for PR 31984 at commit [`80f00a0`](https://github.com/apache/spark/commit/80f00a0a0d2ec68766f4a8fdbeb09378ecd02a10). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31987: [WIP][SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session
AmplabJenkins removed a comment on pull request #31987: URL: https://github.com/apache/spark/pull/31987#issuecomment-809087002 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136627/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31989: [WIP][SPARK-34891][SS] Introduce state store manager for session window in streaming query
AmplabJenkins removed a comment on pull request #31989: URL: https://github.com/apache/spark/pull/31989#issuecomment-809086999 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136632/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
AmplabJenkins removed a comment on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809087006 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41212/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
AmplabJenkins removed a comment on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809087005 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136623/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
AmplabJenkins commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809087005 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136623/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31989: [WIP][SPARK-34891][SS] Introduce state store manager for session window in streaming query
AmplabJenkins commented on pull request #31989: URL: https://github.com/apache/spark/pull/31989#issuecomment-809086999 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136632/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31680: [SPARK-34568][SQL] We should respect enableHiveSupport when initialize SparkSession
SparkQA commented on pull request #31680: URL: https://github.com/apache/spark/pull/31680#issuecomment-809087041 **[Test build #136636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136636/testReport)** for PR 31680 at commit [`f9024fe`](https://github.com/apache/spark/commit/f9024fecda1c75d631b1b8bd5b478c8ceae9de2f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31987: [WIP][SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session
AmplabJenkins commented on pull request #31987: URL: https://github.com/apache/spark/pull/31987#issuecomment-809087002 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136627/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
AmplabJenkins commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809087006 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41212/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
AmplabJenkins removed a comment on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809083608 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41206/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
AmplabJenkins commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809083608 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41206/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
SparkQA commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809083569 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41206/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #31979: [SPARK-34879][SQL] HiveInspector supports DayTimeIntervalType and YearMonthIntervalType
MaxGekk closed pull request #31979: URL: https://github.com/apache/spark/pull/31979 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31901: [SPARK-34802][SQL] Move simplify expression rules before operator push down
AmplabJenkins removed a comment on pull request #31901: URL: https://github.com/apache/spark/pull/31901#issuecomment-809079486 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41208/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31987: [WIP][SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session
AmplabJenkins removed a comment on pull request #31987: URL: https://github.com/apache/spark/pull/31987#issuecomment-809079495 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41210/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
AmplabJenkins removed a comment on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809079487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
AmplabJenkins removed a comment on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809079491 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136625/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
AmplabJenkins removed a comment on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809079490 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41209/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
MaxGekk commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809080557 +1, LGTM. Merging to master. Thank you @AngersZh . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader
SparkQA commented on pull request #31958: URL: https://github.com/apache/spark/pull/31958#issuecomment-809080274 **[Test build #136635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136635/testReport)** for PR 31958 at commit [`9cd3bc5`](https://github.com/apache/spark/commit/9cd3bc573514a9e25f1e7364aacfb4c86c661552). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
SparkQA commented on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809080227 **[Test build #136634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136634/testReport)** for PR 31983 at commit [`c727a0c`](https://github.com/apache/spark/commit/c727a0c1a14afaec6190c2950de0059a1930d749). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
AmplabJenkins commented on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809079490 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41209/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31901: [SPARK-34802][SQL] Move simplify expression rules before operator push down
AmplabJenkins commented on pull request #31901: URL: https://github.com/apache/spark/pull/31901#issuecomment-809079486 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41208/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
AmplabJenkins commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809079487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31987: [WIP][SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session
AmplabJenkins commented on pull request #31987: URL: https://github.com/apache/spark/pull/31987#issuecomment-809079495 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41210/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
AmplabJenkins commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809079491 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136625/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
MaxGekk commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-80907 > Apache Spark master branch doesn't have Hive 1.2 @dongjoon-hyun Thank you for the information. @AngersZh Sorry, I wasn't aware of that it was removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #31982: [SPARK-34881][SQL] New SQL Function: TRY_CAST
maropu commented on a change in pull request #31982: URL: https://github.com/apache/spark/pull/31982#discussion_r603013688 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryCast.scala ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.catalyst.expressions.codegen.Block._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.DataType + +/** + * A special version of [[AnsiCast]]. It performs the same operation (i.e. converts a value of + * one data type into another data type), but returns a NULL value instead of raising an error + * when the conversion can not be performed. + * + * When cast from/to timezone related types, we need timeZoneId, which will be resolved with + * session local timezone by an analyzer [[ResolveTimeZone]]. + */ +@ExpressionDescription( + usage = "_FUNC_(expr AS type) - Casts the value `expr` to the target data type `type`. " + +"This expression is identical to CAST with `spark.sql.ansi.enabled` as true, " + +"except it returns NULL instead of raising an error. " + +"This expression has one major difference from `cast` with `spark.sql.ansi.enabled` as true: " + +"when the source value can't be stored in the target integral(Byte/Short/Int/Long) type, " + +"`try_cast` returns null instead of returning the low order bytes of the source value.", Review comment: nit: `try_cast` => `_FUNC_`? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryCast.scala ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.catalyst.expressions.codegen.Block._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.DataType + +/** + * A special version of [[AnsiCast]]. It performs the same operation (i.e. converts a value of + * one data type into another data type), but returns a NULL value instead of raising an error + * when the conversion can not be performed. + * + * When cast from/to timezone related types, we need timeZoneId, which will be resolved with + * session local timezone by an analyzer [[ResolveTimeZone]]. + */ +@ExpressionDescription( + usage = "_FUNC_(expr AS type) - Casts the value `expr` to the target data type `type`. " + +"This expression is identical to CAST with `spark.sql.ansi.enabled` as true, " + +"except it returns NULL instead of raising an error. " + +"This expression has one major difference from `cast` with `spark.sql.ansi.enabled` as true: " + Review comment: ditto ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala ## @@ -1610,6 +1610,17 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg cast } + /** + * Create a [[TryCast]] expression. + */ + override def visitTryCast(ctx: TryCastContext): Expression = withOrigin(ctx) { Review comment: `visitCast` and `visitTryCast` are similar between each other, so how about merging their definition in `SqlBase.sql`? ``` | cast=(CAST | TRY_CAST) '(' expression
[GitHub] [spark] sarutak commented on a change in pull request #31964: [SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters
sarutak commented on a change in pull request #31964: URL: https://github.com/apache/spark/pull/31964#discussion_r603019469 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala ## @@ -148,10 +148,10 @@ package object util extends Logging { } def quoteIfNeeded(part: String): String = { Review comment: I looked into the following classes which use `quoteIfNeeded` * ResolveSessionCatalog.apply * IdentifierHelper.quoted * MultipartIdentifierHelper.quoted * DatabaseInSessionCatalog$.unapply * NamespaceHelper.quoted * Alias.sql * AttributeReference.sql * UnresolvedAttribute.sql * IdentifierImpl.toString * IdentifierHelper.quoted Finally, I think this change doesn't break existing behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA removed a comment on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809025371 **[Test build #136622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136622/testReport)** for PR 31979 at commit [`796f1f4`](https://github.com/apache/spark/commit/796f1f4177a6f6f852c220b8a9aa42d16e7518e8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809070189 **[Test build #136622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136622/testReport)** for PR 31979 at commit [`796f1f4`](https://github.com/apache/spark/commit/796f1f4177a6f6f852c220b8a9aa42d16e7518e8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #31989: [WIP][SPARK-34891][SS] Introduce state store manager for session window in streaming query
HeartSaVioR edited a comment on pull request #31989: URL: https://github.com/apache/spark/pull/31989#issuecomment-809067233 Except the test suite, one more thing worths to address here is write amplification; we "blindly" replace all start times and all sessions. This could bring unnecessary writes on "unmodified" existing sessions. In many cases we expect the new inputs will be bound and expanding to the existing sessions, but with very long watermark gap and old inputs which have various timestamps, the case could still happen. EDIT: I realized the logic is bound to the physical plan. Though it seems OK to move the logic to here so that the logic to store new session windows efficiently can be bound to the state format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31901: [SPARK-34802][SQL] Move simplify expression rules before operator push down
SparkQA commented on pull request #31901: URL: https://github.com/apache/spark/pull/31901#issuecomment-809069105 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41208/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #31989: [WIP][SPARK-34891][SS] Introduce state store manager for session window in streaming query
HeartSaVioR commented on pull request #31989: URL: https://github.com/apache/spark/pull/31989#issuecomment-809067233 Except the test suite, one more thing worths to address here is write amplification; we "blindly" replace all start times and all sessions. This could bring unnecessary writes on "unmodified" existing sessions. In many cases we expect the new inputs will be bound and expanding to the existing sessions, but with very long watermark gap and old inputs which have various timestamps, the case could still happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
SparkQA removed a comment on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809042739 **[Test build #136625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136625/testReport)** for PR 31985 at commit [`cdaafc2`](https://github.com/apache/spark/commit/cdaafc28f458d45a6f1a257b2cea381db7a09637). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
SparkQA commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809066739 **[Test build #136625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136625/testReport)** for PR 31985 at commit [`cdaafc2`](https://github.com/apache/spark/commit/cdaafc28f458d45a6f1a257b2cea381db7a09637). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader
c21 commented on a change in pull request #31958: URL: https://github.com/apache/spark/pull/31958#discussion_r603014404 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -838,6 +838,13 @@ object SQLConf { .intConf .createWithDefault(4096) + val ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED = +buildConf("spark.sql.orc.enableNestedColumnVectorizedReader") + .doc("Enables vectorized orc decoding for nested column.") + .version("3.2.0") + .booleanConf + .createWithDefault(true) Review comment: @dongjoon-hyun - makes sense to me. Updated. For all reviewers, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136587/testReport is the passed unit tests when enabling nested column vectorized reader by default. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala ## @@ -131,11 +131,27 @@ class OrcFileFormat } } + private def supportBatchForNestedColumn( + sparkSession: SparkSession, + schema: StructType): Boolean = { +val hasNestedColumn = schema.map(_.dataType).exists { + case _: ArrayType | _: MapType | _: StructType => true + case _ => false +} +if (hasNestedColumn) { + sparkSession.sessionState.conf.orcVectorizedReaderNestedColumnEnabled +} else { + true +} + } + override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { val conf = sparkSession.sessionState.conf conf.orcVectorizedReaderEnabled && conf.wholeStageEnabled && schema.length <= conf.wholeStageMaxNumFields && - schema.forall(_.dataType.isInstanceOf[AtomicType]) + schema.forall(s => supportDataType(s.dataType) && +!s.dataType.isInstanceOf[UserDefinedType[_]]) && + supportBatchForNestedColumn(sparkSession, schema) Review comment: @dongjoon-hyun - do you mean implementing Parquet vectorized reader for nested column? I created https://issues.apache.org/jira/browse/SPARK-34863 and plan to do it after this one, thanks. ## File path: project/MimaExcludes.scala ## @@ -417,6 +417,21 @@ object MimaExcludes { case _ => true }, +// [SPARK-34862][SQL] Support nested column in ORC vectorized reader + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getBoolean"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getByte"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getShort"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getInt"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getLong"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getFloat"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getDouble"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getDecimal"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getUTF8String"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getBinary"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getArray"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getMap"), + ProblemFilters.exclude[DirectAbstractMethodProblem]("org.apache.spark.sql.vectorized.ColumnVector.getChild"), Review comment: @dongjoon-hyun - updated, thanks. Sorry I was not looking at this file very closely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
SparkQA commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809064991 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41206/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
SparkQA commented on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809064604 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41209/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31901: [SPARK-34802][SQL] Move simplify expression rules before operator push down
SparkQA commented on pull request #31901: URL: https://github.com/apache/spark/pull/31901#issuecomment-809064539 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41208/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA removed a comment on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809024472 **[Test build #136621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136621/testReport)** for PR 31979 at commit [`4e88bdf`](https://github.com/apache/spark/commit/4e88bdf72919dd3c65f6ddb03e5424de4b689160). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809063741 **[Test build #136621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136621/testReport)** for PR 31979 at commit [`4e88bdf`](https://github.com/apache/spark/commit/4e88bdf72919dd3c65f6ddb03e5424de4b689160). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
AmplabJenkins removed a comment on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809062638 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136626/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
AmplabJenkins removed a comment on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809062636 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41207/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31680: [SPARK-34568][SQL] We should respect enableHiveSupport when initialize SparkSession
AmplabJenkins removed a comment on pull request #31680: URL: https://github.com/apache/spark/pull/31680#issuecomment-809062635 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136633/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
AmplabJenkins commented on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809062638 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136626/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
AmplabJenkins commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809062636 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41207/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31680: [SPARK-34568][SQL] We should respect enableHiveSupport when initialize SparkSession
AmplabJenkins commented on pull request #31680: URL: https://github.com/apache/spark/pull/31680#issuecomment-809062635 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136633/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31989: [WIP][SPARK-34891][SS] Introduce state store manager for session window in streaming query
SparkQA commented on pull request #31989: URL: https://github.com/apache/spark/pull/31989#issuecomment-809062214 **[Test build #136632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136632/testReport)** for PR 31989 at commit [`a7bd8a9`](https://github.com/apache/spark/commit/a7bd8a91c52970165e668b6a8d07ade2899c915e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #31937: [SPARK-10816][SS] Support session window natively
HeartSaVioR commented on pull request #31937: URL: https://github.com/apache/spark/pull/31937#issuecomment-809062076 I filed 5 JIRA issues for all parts, and submitted 3 PRs which are not dependent to others. Remaining 2 parts depend on others and I'll deal with them once we merge dependents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
SparkQA commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809061510 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41205/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request #31989: [WIP][SPARK-34891][SS] Introduce state store manager for session window in streaming query
HeartSaVioR opened a new pull request #31989: URL: https://github.com/apache/spark/pull/31989 Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces state store manager for session window in streaming query. Session window in batch query wouldn't need to leverage state store manager. This PR ensures versioning on state format for state store manager, so that we can apply further optimization after releasing Spark version. StreamingSessionWindowStateManager is a trait defining the available methods in session window state store manager. StreamingSessionWindowStateManagerBaseImpl and its subclasses are classes implementing the trait with versioning. The format of version 1 leverages two state stores to represent the session windows: * key -> list of start times (in session window spec) * key + start time in session window -> value This structure is simpler compared to what we tried to implement in history, and also less sub-optimal as it doesn't require all values to be rewritten when any of session window is added/modified/removed. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? WIP (new test suite is expected to be added, or can be skipped if we agree it can be skipped) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31988: [SPARK-34855][CORE] Avoid local lazy variable in SparkContext.getCallSite
SparkQA commented on pull request #31988: URL: https://github.com/apache/spark/pull/31988#issuecomment-809059717 **[Test build #136631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136631/testReport)** for PR 31988 at commit [`d2641d9`](https://github.com/apache/spark/commit/d2641d90e5a49a91312748fb655e2a7cd3790d3f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #31988: [SPARK-34855][CORE] Avoid local lazy variable in SparkContext.getCallSite
viirya commented on pull request #31988: URL: https://github.com/apache/spark/pull/31988#issuecomment-809059452 cc @HyukjinKwon @srowen @lxian -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #31988: [SPARK-34855][CORE] Avoid local lazy variable in SparkContext.getCallSite
viirya opened a new pull request #31988: URL: https://github.com/apache/spark/pull/31988 ### What changes were proposed in this pull request? `SparkContext.getCallSite` uses local lazy variable. In Scala 2.11, local lazy val requires synchronization so for large number of job submissions in the same context, it will be a bottleneck. This only for branch-2.4 as we drop Scala 2.11 support at SPARK-26132. ### Why are the changes needed? To avoid possible bottleneck for large number of job submissions in the same context. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31680: [SPARK-34568][SQL] We should respect enableHiveSupport when initialize SparkSession
SparkQA commented on pull request #31680: URL: https://github.com/apache/spark/pull/31680#issuecomment-809059073 **[Test build #136630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136630/testReport)** for PR 31680 at commit [`f73421a`](https://github.com/apache/spark/commit/f73421ae2df131faeb2509083099ecbbd645a7d0). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements
SparkQA commented on pull request #31986: URL: https://github.com/apache/spark/pull/31986#issuecomment-809058962 **[Test build #136628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136628/testReport)** for PR 31986 at commit [`3e8dd5c`](https://github.com/apache/spark/commit/3e8dd5ccd8c2c136f5c3a4ff64269267edbdf81e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
SparkQA commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809058974 **[Test build #136629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136629/testReport)** for PR 31985 at commit [`461d111`](https://github.com/apache/spark/commit/461d1110da71d504780f5a1f7db07fceaf597938). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31987: [WIP][SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session
SparkQA commented on pull request #31987: URL: https://github.com/apache/spark/pull/31987#issuecomment-809058939 **[Test build #136627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136627/testReport)** for PR 31987 at commit [`5020827`](https://github.com/apache/spark/commit/5020827a74b7bbe67951057ef64a09061c099d90). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
AmplabJenkins removed a comment on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809058108 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136620/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tanelk commented on pull request #31973: [SPARK-34876][SQL] Fill defaultResult of non-nullable aggregates
tanelk commented on pull request #31973: URL: https://github.com/apache/spark/pull/31973#issuecomment-809058487 @HyukjinKwon , There is a failure on branch-2.4. I believe it is because `CountIf` exists since 3.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
AmplabJenkins commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809058108 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136620/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request #31987: [WIP][SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session
HeartSaVioR opened a new pull request #31987: URL: https://github.com/apache/spark/pull/31987 Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces MergingSessionsIterator, which enables to merge elements belong to the same session directly. MergingSessionsIterator is a variant of SortAggregateIterator which merges the session windows based on the fact input rows are sorted by "group keys + the start time of session window". When merging windows, MergingSessionsIterator also applies aggregations on merged window, which eliminates the necessity on buffering inputs (which requires copying rows) and update the session spec for each input. MergingSessionsIterator is quite performant compared to UpdatingSessionsIterator brought by SPARK-34888. Note that MergingSessionsIterator can only apply to the cases aggregation can be applied altogether, so there're still rooms for UpdatingSessionIterator to be used. This issue also introduces MergingSessionsExec which is the physical node on leveraging MergingSessionsIterator to sort the input rows and aggregate rows according to the session windows. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? WIP (new test suite is expected to be added, or can be skipped if we agree it can be skipped) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31953: [SPARK-34855][CORE]spark context - avoid using local lazy val for callSite
viirya commented on a change in pull request #31953: URL: https://github.com/apache/spark/pull/31953#discussion_r603005164 ## File path: core/src/main/scala/org/apache/spark/SparkContext.scala ## @@ -2186,13 +2186,22 @@ class SparkContext(config: SparkConf) extends Logging { * has overridden the call site using `setCallSite()`, this will return the user's version. */ private[spark] def getCallSite(): CallSite = { -lazy val callSite = Utils.getCallSite() -CallSite( - Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), - Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) -) +if (getLocalProperty(CallSite.SHORT_FORM) == null Review comment: This is the last issue for 2.4, I think. Okay, let me create a PR first. I can close it if the author opens his after. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
SparkQA removed a comment on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809007618 **[Test build #136620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136620/testReport)** for PR 31984 at commit [`b510d7d`](https://github.com/apache/spark/commit/b510d7da21f8a92af69b1485b72aef6ad5901448). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
SparkQA commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809053124 **[Test build #136620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136620/testReport)** for PR 31984 at commit [`b510d7d`](https://github.com/apache/spark/commit/b510d7da21f8a92af69b1485b72aef6ad5901448). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request #31986: [SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements
HeartSaVioR opened a new pull request #31986: URL: https://github.com/apache/spark/pull/31986 Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces UpdatingSessionsIterator, which analyzes neighbor elements and adjust session information on elements. UpdatingSessionsIterator calculates and updates the session window for each element in the given iterator, which makes elements in the same session window having same session spec. Downstream can apply aggregation to finally merge these elements bound to the same session window. UpdatingSessionsIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort. UpdatingSessionsIterator copies the elements to safely update on each element, as well as buffers elements which are bound to the same session window. Due to such overheads, MergingSessionsIterator which will be introduced via SPARK-34889 should be used whenever possible. This PR also introduces UpdatingSessionsExec which is the physical node on leveraging UpdatingSessionsIterator to sort the input rows and updates session information on input rows. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test suite added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #31680: [SPARK-34568][SQL] We should respect enableHiveSupport when initialize SparkSession
AngersZh commented on pull request #31680: URL: https://github.com/apache/spark/pull/31680#issuecomment-809051861 retest this please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31953: [SPARK-34855][CORE]spark context - avoid using local lazy val for callSite
HyukjinKwon commented on a change in pull request #31953: URL: https://github.com/apache/spark/pull/31953#discussion_r603003296 ## File path: core/src/main/scala/org/apache/spark/SparkContext.scala ## @@ -2186,13 +2186,22 @@ class SparkContext(config: SparkConf) extends Logging { * has overridden the call site using `setCallSite()`, this will return the user's version. */ private[spark] def getCallSite(): CallSite = { -lazy val callSite = Utils.getCallSite() -CallSite( - Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), - Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) -) +if (getLocalProperty(CallSite.SHORT_FORM) == null Review comment: @viirya I believe it's fine for you to just go ahead IMO .. the author became inactive 4 days and this is the blocker of 2.4 (I guess?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31953: [SPARK-34855][CORE]spark context - avoid using local lazy val for callSite
viirya commented on a change in pull request #31953: URL: https://github.com/apache/spark/pull/31953#discussion_r603000582 ## File path: core/src/main/scala/org/apache/spark/SparkContext.scala ## @@ -2186,13 +2186,22 @@ class SparkContext(config: SparkConf) extends Logging { * has overridden the call site using `setCallSite()`, this will return the user's version. */ private[spark] def getCallSite(): CallSite = { -lazy val callSite = Utils.getCallSite() -CallSite( - Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), - Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) -) +if (getLocalProperty(CallSite.SHORT_FORM) == null Review comment: @lxian Can you create a PR for branch-2.4? If you are busy, would you mind I create a PR for branch-2.4? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output
viirya commented on a change in pull request #31966: URL: https://github.com/apache/spark/pull/31966#discussion_r602999807 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -231,6 +231,27 @@ object NestedColumnAliasing { * of it. */ object GeneratorNestedColumnAliasing { + // Partitions `attrToAliases` based on whether the attribute is in Generator's output. + private def aliasesOnGeneratorOutput( + attrToAliases: Map[ExprId, Seq[Alias]], + generatorOutput: Seq[Attribute]) = { +val generatorOutputExprId = generatorOutput.map(_.exprId) +attrToAliases.partition { k => + generatorOutputExprId.contains(k._1) +} + } + + // Partitions `nestedFieldToAlias` based on whether the attribute of nested field extractor + // is in Generator's output. + private def nestedFieldOnGeneratorOutput( + nestedFieldToAlias: Map[ExtractValue, Alias], + generatorOutput: Seq[Attribute]) = { +val generatorOutputSet = AttributeSet(generatorOutput) +nestedFieldToAlias.partition { pair => + pair._1.references.subsetOf(generatorOutputSet) +} + } Review comment: okay for me. Put it as functions not for reuse but for making the code look simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output
viirya commented on a change in pull request #31966: URL: https://github.com/apache/spark/pull/31966#discussion_r602999635 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -241,12 +262,69 @@ object GeneratorNestedColumnAliasing { // On top on `Generate`, a `Project` that might have nested column accessors. // We try to get alias maps for both project list and generator's children expressions. val exprsToPrune = projectList ++ g.generator.children - NestedColumnAliasing.getAliasSubMap(exprsToPrune, g.qualifiedGeneratorOutput).map { + NestedColumnAliasing.getAliasSubMap(exprsToPrune).map { case (nestedFieldToAlias, attrToAliases) => // Defer updating `Generate.unrequiredChildIndex` to next round of `ColumnPruning`. - val newChild = -NestedColumnAliasing.replaceWithAliases(g, nestedFieldToAlias, attrToAliases) - Project(NestedColumnAliasing.getNewProjectList(projectList, nestedFieldToAlias), newChild) + + val (nestedFieldsOnGenerator, nestedFieldsNotOnGenerator) = +nestedFieldOnGeneratorOutput(nestedFieldToAlias, g.qualifiedGeneratorOutput) + val (attrToAliasesOnGenerator, attrToAliasesNotOnGenerator) = +aliasesOnGeneratorOutput(attrToAliases, g.qualifiedGeneratorOutput) + + // Push nested column accessors through `Generator`. We cannot prune on `Generator`'s + // output. + val newChild = NestedColumnAliasing.replaceWithAliases(g, +nestedFieldsNotOnGenerator, attrToAliasesNotOnGenerator) + val pushedThrough = Project(NestedColumnAliasing +.getNewProjectList(projectList, nestedFieldsNotOnGenerator), newChild) + + // Pruning on `Generator`'s output. We only process single field case. + // For multiple field case, we cannot directly move field extractor into + // the generator expression. A workaround is to re-construct array of struct + // from multiple fields. But it will be more complicated and may not worth. + if (nestedFieldsOnGenerator.size == 1) { Review comment: sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output
viirya commented on a change in pull request #31966: URL: https://github.com/apache/spark/pull/31966#discussion_r602999476 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -241,12 +262,69 @@ object GeneratorNestedColumnAliasing { // On top on `Generate`, a `Project` that might have nested column accessors. // We try to get alias maps for both project list and generator's children expressions. val exprsToPrune = projectList ++ g.generator.children - NestedColumnAliasing.getAliasSubMap(exprsToPrune, g.qualifiedGeneratorOutput).map { + NestedColumnAliasing.getAliasSubMap(exprsToPrune).map { case (nestedFieldToAlias, attrToAliases) => // Defer updating `Generate.unrequiredChildIndex` to next round of `ColumnPruning`. - val newChild = -NestedColumnAliasing.replaceWithAliases(g, nestedFieldToAlias, attrToAliases) - Project(NestedColumnAliasing.getNewProjectList(projectList, nestedFieldToAlias), newChild) + + val (nestedFieldsOnGenerator, nestedFieldsNotOnGenerator) = +nestedFieldOnGeneratorOutput(nestedFieldToAlias, g.qualifiedGeneratorOutput) + val (attrToAliasesOnGenerator, attrToAliasesNotOnGenerator) = +aliasesOnGeneratorOutput(attrToAliases, g.qualifiedGeneratorOutput) + + // Push nested column accessors through `Generator`. We cannot prune on `Generator`'s + // output. + val newChild = NestedColumnAliasing.replaceWithAliases(g, +nestedFieldsNotOnGenerator, attrToAliasesNotOnGenerator) + val pushedThrough = Project(NestedColumnAliasing +.getNewProjectList(projectList, nestedFieldsNotOnGenerator), newChild) + + // Pruning on `Generator`'s output. We only process single field case. + // For multiple field case, we cannot directly move field extractor into + // the generator expression. A workaround is to re-construct array of struct + // from multiple fields. But it will be more complicated and may not worth. + if (nestedFieldsOnGenerator.size == 1) { +// Only one nested column accessor. +// E.g., df.select(explode($"items").as("item")).select($"item.a") +pushedThrough match { + case p @ Project(_, newG: Generate) => +// Replace the child expression of `ExplodeBase` generator with +// nested column accessor. +// E.g., df.select(explode($"items").as("item")) => +// df.select(explode($"items.a").as("item")) +val rewrittenG = newG.transformExpressions { + case e: ExplodeBase => +val extractor = nestedFieldsOnGenerator.head._1.transformUp { + case _: Attribute => +e.child + case g: GetStructField => +ExtractValue(g.child, Literal(g.extractFieldName), SQLConf.get.resolver) +} +e.withNewChildren(Seq(extractor)) +} + +// As we change the child of the generator, its output data type must be updated. +val updatedGeneratorOutput = rewrittenG.generatorOutput +.zip(rewrittenG.generator.elementSchema.toAttributes) +.map { case (oldAttr, newAttr) => + newAttr.withExprId(oldAttr.exprId).withName(oldAttr.name) +} +assert(updatedGeneratorOutput.length == rewrittenG.generatorOutput.length, Review comment: yea, i think this is the same. ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -241,12 +262,69 @@ object GeneratorNestedColumnAliasing { // On top on `Generate`, a `Project` that might have nested column accessors. // We try to get alias maps for both project list and generator's children expressions. val exprsToPrune = projectList ++ g.generator.children - NestedColumnAliasing.getAliasSubMap(exprsToPrune, g.qualifiedGeneratorOutput).map { + NestedColumnAliasing.getAliasSubMap(exprsToPrune).map { case (nestedFieldToAlias, attrToAliases) => // Defer updating `Generate.unrequiredChildIndex` to next round of `ColumnPruning`. - val newChild = -NestedColumnAliasing.replaceWithAliases(g, nestedFieldToAlias, attrToAliases) - Project(NestedColumnAliasing.getNewProjectList(projectList, nestedFieldToAlias), newChild) + + val (nestedFieldsOnGenerator, nestedFieldsNotOnGenerator) = +nestedFieldOnGeneratorOutput(nestedFieldToAlias, g.qualifiedGeneratorOutput) + val (attrToAliasesOnGenerator, attrToAliasesNotOnGenerator) = +aliasesOnGeneratorOutput(attrToAliases,
[GitHub] [spark] viirya commented on a change in pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output
viirya commented on a change in pull request #31966: URL: https://github.com/apache/spark/pull/31966#discussion_r602998834 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -241,12 +262,69 @@ object GeneratorNestedColumnAliasing { // On top on `Generate`, a `Project` that might have nested column accessors. // We try to get alias maps for both project list and generator's children expressions. val exprsToPrune = projectList ++ g.generator.children - NestedColumnAliasing.getAliasSubMap(exprsToPrune, g.qualifiedGeneratorOutput).map { + NestedColumnAliasing.getAliasSubMap(exprsToPrune).map { case (nestedFieldToAlias, attrToAliases) => // Defer updating `Generate.unrequiredChildIndex` to next round of `ColumnPruning`. - val newChild = -NestedColumnAliasing.replaceWithAliases(g, nestedFieldToAlias, attrToAliases) - Project(NestedColumnAliasing.getNewProjectList(projectList, nestedFieldToAlias), newChild) + + val (nestedFieldsOnGenerator, nestedFieldsNotOnGenerator) = +nestedFieldOnGeneratorOutput(nestedFieldToAlias, g.qualifiedGeneratorOutput) + val (attrToAliasesOnGenerator, attrToAliasesNotOnGenerator) = +aliasesOnGeneratorOutput(attrToAliases, g.qualifiedGeneratorOutput) + + // Push nested column accessors through `Generator`. We cannot prune on `Generator`'s + // output. + val newChild = NestedColumnAliasing.replaceWithAliases(g, +nestedFieldsNotOnGenerator, attrToAliasesNotOnGenerator) + val pushedThrough = Project(NestedColumnAliasing +.getNewProjectList(projectList, nestedFieldsNotOnGenerator), newChild) + + // Pruning on `Generator`'s output. We only process single field case. + // For multiple field case, we cannot directly move field extractor into + // the generator expression. A workaround is to re-construct array of struct + // from multiple fields. But it will be more complicated and may not worth. + if (nestedFieldsOnGenerator.size == 1) { +// Only one nested column accessor. +// E.g., df.select(explode($"items").as("item")).select($"item.a") +pushedThrough match { + case p @ Project(_, newG: Generate) => +// Replace the child expression of `ExplodeBase` generator with +// nested column accessor. +// E.g., df.select(explode($"items").as("item")) => +// df.select(explode($"items.a").as("item")) Review comment: oh, I miss the nested column accessor on top of it. So it looks not correct. I will update the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output
viirya commented on a change in pull request #31966: URL: https://github.com/apache/spark/pull/31966#discussion_r602997773 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -241,12 +262,69 @@ object GeneratorNestedColumnAliasing { // On top on `Generate`, a `Project` that might have nested column accessors. // We try to get alias maps for both project list and generator's children expressions. val exprsToPrune = projectList ++ g.generator.children - NestedColumnAliasing.getAliasSubMap(exprsToPrune, g.qualifiedGeneratorOutput).map { + NestedColumnAliasing.getAliasSubMap(exprsToPrune).map { case (nestedFieldToAlias, attrToAliases) => // Defer updating `Generate.unrequiredChildIndex` to next round of `ColumnPruning`. - val newChild = -NestedColumnAliasing.replaceWithAliases(g, nestedFieldToAlias, attrToAliases) - Project(NestedColumnAliasing.getNewProjectList(projectList, nestedFieldToAlias), newChild) + + val (nestedFieldsOnGenerator, nestedFieldsNotOnGenerator) = +nestedFieldOnGeneratorOutput(nestedFieldToAlias, g.qualifiedGeneratorOutput) + val (attrToAliasesOnGenerator, attrToAliasesNotOnGenerator) = +aliasesOnGeneratorOutput(attrToAliases, g.qualifiedGeneratorOutput) + + // Push nested column accessors through `Generator`. We cannot prune on `Generator`'s + // output. + val newChild = NestedColumnAliasing.replaceWithAliases(g, +nestedFieldsNotOnGenerator, attrToAliasesNotOnGenerator) + val pushedThrough = Project(NestedColumnAliasing +.getNewProjectList(projectList, nestedFieldsNotOnGenerator), newChild) + + // Pruning on `Generator`'s output. We only process single field case. + // For multiple field case, we cannot directly move field extractor into + // the generator expression. A workaround is to re-construct array of struct + // from multiple fields. But it will be more complicated and may not worth. + if (nestedFieldsOnGenerator.size == 1) { +// Only one nested column accessor. +// E.g., df.select(explode($"items").as("item")).select($"item.a") +pushedThrough match { + case p @ Project(_, newG: Generate) => +// Replace the child expression of `ExplodeBase` generator with +// nested column accessor. +// E.g., df.select(explode($"items").as("item")) => +// df.select(explode($"items.a").as("item")) +val rewrittenG = newG.transformExpressions { + case e: ExplodeBase => +val extractor = nestedFieldsOnGenerator.head._1.transformUp { + case _: Attribute => +e.child + case g: GetStructField => +ExtractValue(g.child, Literal(g.extractFieldName), SQLConf.get.resolver) Review comment: let me add one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
zhengruifeng commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809044289 @srowen @WeichenXu123 This is the last PR for LR supporting centering -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
zhengruifeng commented on a change in pull request #31985: URL: https://github.com/apache/spark/pull/31985#discussion_r602997003 ## File path: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala ## @@ -1863,21 +1899,125 @@ class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest { 0.0, 0.0, 0.0, 0.09064661, -0.1144333, 0.3204703, -0.1621061, -0.2308192, 0.0, -0.4832131, 0.0, 0.0), isTransposed = true) -val interceptsRStd = Vectors.dense(-0.72638218, -0.01737265, 0.74375484) +val interceptsRStd = Vectors.dense(-0.69265374, -0.2260274, 0.9186811) val coefficientsR = new DenseMatrix(3, 4, Array( 0.0, 0.0, 0.01641412, 0.03570376, -0.05110822, 0.0, -0.21595670, -0.16162836, 0.0, 0.0, 0.0, 0.0), isTransposed = true) val interceptsR = Vectors.dense(-0.44707756, 0.75180900, -0.3047314) -assert(model1.coefficientMatrix ~== coefficientsRStd absTol 0.05) -assert(model1.interceptVector ~== interceptsRStd relTol 0.1) +assert(model1.coefficientMatrix ~== coefficientsRStd absTol 1e-3) +assert(model1.interceptVector ~== interceptsRStd relTol 1e-3) assert(model1.interceptVector.toArray.sum ~== 0.0 absTol eps) -assert(model2.coefficientMatrix ~== coefficientsR absTol 0.02) -assert(model2.interceptVector ~== interceptsR relTol 0.1) +assert(model2.coefficientMatrix ~== coefficientsR absTol 1e-3) +assert(model2.interceptVector ~== interceptsR relTol 1e-3) assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps) } + test("SPARK-34860: multinomial logistic regression with intercept, with small var") { Review comment: master does not pass this newly add testsuite: ``` // scalastyle:off println println("R") println(interceptsR) println(coefficientsR) println() println("model1") println(model1.interceptVector) println(model1.coefficientMatrix) println() println("model2") println(model2.interceptVector) println(model2.coefficientMatrix) println() println("R2") println(interceptsR2) println(coefficientsR2) println() println("model3") println(model3.interceptVector) println(model3.coefficientMatrix) // scalastyle:on println ``` this PR: ``` R [2.91748298,-17.510746,14.59326301] 0.21755977 0.01647541 0.16507778 -0.1401668 -0.244360.7564655-0.2955698 1.3262009 0.02680026 -0.77294095 0.13049206 -1.18603411 model1 [2.933958199942738,-17.543164024163175,14.609205824220437] 0.21812136899052606 0.015486127035160564 0.16560717317181253 -0.14189621394905397 -0.2454895541210769 0.7584152697648037-0.2966285999752721 1.3296192946128171 0.027368185130550855 -0.7739013967999642 0.13102142680345957 -1.187723080663763 model2 [2.933958199942738,-17.543164024163175,14.609205824220437] 0.21812136899052606 0.015486127035160564 0.16560717317181253 -0.14189621394905397 -0.2454895541210769 0.7584152697648037-0.2966285999752721 1.3296192946128171 0.027368185130550855 -0.7739013967999642 0.13102142680345957 -1.187723080663763 R2 [1.751626027,-3.9297124987,2.178086472] 0.019970169 0.079611293 0.003959452 0.110024399 -4.788494E-4 0.0010097453 -5.832701E-4 0.0 -0.01936999 -0.080851149 -0.003319687 -0.112435972 model3 [1.7516587309368687,-3.9297178332916585,2.1780591023547897] 0.0199685439000646050.079604564245496850.0039592584764418055 0.11002491382872195 -4.7805989516075794E-4 0.0010124410611496804 -5.830912612961964E-4 0.0 -0.01936890596857533-0.08084716280475213 -0.0033195486718121834 -0.1124344396230352 ``` master: ``` R [2.91748298,-17.510746,14.59326301] 0.21755977 0.01647541 0.16507778 -0.1401668 -0.244360.7564655-0.2955698 1.3262009 0.02680026 -0.77294095 0.13049206 -1.18603411 model1 [3.2289115796175536,-3.8874667667006286,0.6585551870830749] 0.21614280080869921 0.010853354751576538 0.16526956599746928 -0.16826299113708829 -0.24226138413980347 0.766137782321547 -0.2961105375461299 -0.01353727702893284 0.02611858333110428 -0.7769911370731234 0.13084097154866067 0.18180026816602116 model2 [3.2289115795385817,-3.8874667667014213,0.65855518716284] 0.216142800347921 0.01085335149421333 0.1652695665789533 -0.16826299025797364 -0.24226138429694594 0.7661377826486023 -0.2961105377075671 -0.013537276769415511 0.026118583949024932 -0.7769911341428156 0.13084097112861381 0.18180026702738916 R2 [1.751626027,-3.9297124987,2.178086472] 0.019970169 0.079611293 0.003959452 0.110024399
[GitHub] [spark] zhengruifeng commented on a change in pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
zhengruifeng commented on a change in pull request #31985: URL: https://github.com/apache/spark/pull/31985#discussion_r602996562 ## File path: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala ## @@ -1863,21 +1899,125 @@ class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest { 0.0, 0.0, 0.0, 0.09064661, -0.1144333, 0.3204703, -0.1621061, -0.2308192, 0.0, -0.4832131, 0.0, 0.0), isTransposed = true) -val interceptsRStd = Vectors.dense(-0.72638218, -0.01737265, 0.74375484) +val interceptsRStd = Vectors.dense(-0.69265374, -0.2260274, 0.9186811) Review comment: Old `interceptsRStd` did not equal to GLMNET's result: [-0.69265374, -0.2260274, 0.9186811], so I think this should be a good change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
SparkQA commented on pull request #31985: URL: https://github.com/apache/spark/pull/31985#issuecomment-809042739 **[Test build #136625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136625/testReport)** for PR 31985 at commit [`cdaafc2`](https://github.com/apache/spark/commit/cdaafc28f458d45a6f1a257b2cea381db7a09637). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31983: [SPARK-34882][SQL] Replace if with filter clause in RewriteDistinctAggregates
SparkQA commented on pull request #31983: URL: https://github.com/apache/spark/pull/31983#issuecomment-809042768 **[Test build #136626 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136626/testReport)** for PR 31983 at commit [`1b94589`](https://github.com/apache/spark/commit/1b94589fa35d39ccae7e5e16aee3fd7fe8cc81dd). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request #31985: [SPARK-34860][ML] Multinomial Logistic Regression with intercept support centering
zhengruifeng opened a new pull request #31985: URL: https://github.com/apache/spark/pull/31985 ### What changes were proposed in this pull request? 1, use new `MultinomialLogisticBlockAggregator` which support virtual centering 2, remove no-used `BlockLogisticAggregator` ### Why are the changes needed? 1, for better convergence; 2, its solution is much close to GLMNET; ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated and new test suites -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809041822 **[Test build #136624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136624/testReport)** for PR 31979 at commit [`b368584`](https://github.com/apache/spark/commit/b368584c123dbbaf1fc3a1d6ca6902c097728192). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
SparkQA commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809041800 **[Test build #136623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136623/testReport)** for PR 31984 at commit [`68ddc7a`](https://github.com/apache/spark/commit/68ddc7a3a328705ff266a301966db4efef3d7528). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
AmplabJenkins removed a comment on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809041420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
AmplabJenkins commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809041420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31984: [SPARK-34884][SQL] Improve dynamic partition pruning evaluation
HyukjinKwon commented on pull request #31984: URL: https://github.com/apache/spark/pull/31984#issuecomment-809040667 cc @maryannxue FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809039750 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41203/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31859: [SPARK-34769][SQL]AnsiTypeCoercion: return closest convertible type among TypeCollection
HyukjinKwon commented on pull request #31859: URL: https://github.com/apache/spark/pull/31859#issuecomment-809038253 I just found out that I mistakenly assigned it to myself .. I removed it back now .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809037801 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41204/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31979: [SPARK-34879][SQL] HiveInspector support DayTimeIntervalType and YearMonthIntervalType
SparkQA commented on pull request #31979: URL: https://github.com/apache/spark/pull/31979#issuecomment-809037424 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41203/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #31942: [SPARK-34834][NETWORK] Fix a potential Netty memory leak in TransportResponseHandler.
Ngone51 commented on pull request #31942: URL: https://github.com/apache/spark/pull/31942#issuecomment-809035511 I'm also confused with this part. I don't even see a place where the `resp.body()` (a.k.a `ManagedBuffer`) is referenced before the `TransportResponseHandler` handle the `ResponseMessage`. And in the case of `ChunkFetchSuccess`, I wonder we may release the buffer here too early since the `listener.onSuccess(...)` is executed asynchronously: https://github.com/apache/spark/blob/4b9e94c44412f399ba19e0ea90525d346942bf71/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java#L162-L173 Another possible issue is, the buffer returned by `ChunkFetchSuccess` is supposed to be released after the data has been consumed: https://github.com/apache/spark/blob/2356cdd420f600f38d0e786dc50c15f2603b7ff2/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L257-L259 but seems like we now only release the buffer when exception throws during buffer reading. And for a normally consumed buffer, we seem to forget to release it. cc @mridulm @tgravescs @attilapiros Do you have any idea? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31693: [SPARK-34858][SPARK-34448][ML] Binary Logistic Regression with intercept support centering
zhengruifeng commented on pull request #31693: URL: https://github.com/apache/spark/pull/31693#issuecomment-809034717 @srowen Thanks for reviewing and merging! I will send another PR for multinominal LR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31976: [SPARK-34814][SQL] LikeSimplification should handle NULL
HyukjinKwon commented on pull request #31976: URL: https://github.com/apache/spark/pull/31976#issuecomment-809033091 cc @beliefer too FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #31976: [SPARK-34814][SQL] LikeSimplification should handle NULL
HyukjinKwon closed pull request #31976: URL: https://github.com/apache/spark/pull/31976 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31976: [SPARK-34814][SQL] LikeSimplification should handle NULL
HyukjinKwon commented on pull request #31976: URL: https://github.com/apache/spark/pull/31976#issuecomment-809032964 Merged to master and branch-3.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #31973: [SPARK-34876][SQL] Fill defaultResult of non-nullable aggregates
HyukjinKwon closed pull request #31973: URL: https://github.com/apache/spark/pull/31973 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #30212: [SPARK-33308][SQL] Refactor current grouping analytics
AngersZh commented on pull request #30212: URL: https://github.com/apache/spark/pull/30212#issuecomment-809028142 Gentle ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31973: [SPARK-34876][SQL] Fill defaultResult of non-nullable aggregates
HyukjinKwon commented on pull request #31973: URL: https://github.com/apache/spark/pull/31973#issuecomment-809028090 Merged to master, branch-3.1, branch-3.0 and branch-2.4 cc @cloud-fan, @maryannxue, @viirya FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org