[GitHub] [spark] wayneguow commented on pull request #39819: [SPARK-42252][CORE] Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
wayneguow commented on PR #39819: URL: https://github.com/apache/spark/pull/39819#issuecomment-1409895352 cc @mccheah and @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
peter-toth commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091544557 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => +val buffer = aliases(a.canonicalized) +if (buffer.size < aliasCandidateLimit) { + buffer += a +} + case _ => +} +aliases + } + + protected def hasAlias: Boolean = aliasMap.nonEmpty + + /** + * Return a stream of expressions in which the original expression is projected with `aliasMap`. + */ + protected def projectExpression(expr: Expression): Stream[Expression] = { +val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) +expr.multiTransformDown { + // Mapping with aliases + case e: Expression if aliasMap.contains(e.canonicalized) => +aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) Seq(e) else Seq.empty) + + // Prune if we encounter an attribute that we can't map and it is not in output set. + // This prune will go up to the closest `multiTransformDown()` call and returns `Stream.empty` + // there. + case a: Attribute if !outputSet.contains(a) => Seq.empty +} + } +} + +/** + * A trait that handles aliases in the `orderingExpressions` to produce `outputOrdering` that + * satisfies ordering requirements. + */ +trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]] + extends AliasAwareOutputExpression { self: QueryPlan[T] => + protected def orderingExpressions: Seq[SortOrder] + + override protected def strip(expr: Expression): Expression = expr match { +case e: Empty2Null => strip(e.child) +case _ => expr + } + + override final def outputOrdering: Seq[SortOrder] = { +if (hasAlias) { + orderingExpressions.flatMap { sortOrder => +val orderingSet = mutable.Set.empty[Expression] +val sameOrderings = sortOrder.children.toStream + .flatMap(projectExpression) + .filter(e => orderingSet.add(e.canonicalized)) + .take(aliasCandidateLimit) +if (sameOrderings.nonEmpty) { + Some(sortOrder.copy(child = sameOrderings.head, +sameOrderExpressions = sameOrderings.tail)) +} else { + None +} Review Comment: Hmm, I might see
[GitHub] [spark] itholic opened a new pull request, #39821: [SPARK-42253][PYTHON] Add test for detecting duplicated error class
itholic opened a new pull request, #39821: URL: https://github.com/apache/spark/pull/39821 ### What changes were proposed in this pull request? This PR proposes to add test for detecting duplicated name of error classes to keep the error class unique. ### Why are the changes needed? The name of error class should be unique, so we should check if it's duplicated or not. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test in case `COLUMN_IN_LIST` is duplicated as below: ```shell == FAIL [0.006s]: test_error_classes_duplicated (pyspark.errors.tests.test_errors.ErrorsTest) -- Traceback (most recent call last): ... AssertionError: False is not true : Duplicate error class: COLUMN_IN_LIST -- Ran 2 tests in 0.007s FAILED (failures=1) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091535346 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => Review Comment: what's the behavior of `c as a` ? this code seems to return both `c` and `a`. I think the right way should be `if AttributSet(outputExpressions).contains(a) => // add a to buffer` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #39555: [SPARK-42051][SQL] Codegen Support for HiveGenericUDF
yaooqinn commented on PR #39555: URL: https://github.com/apache/spark/pull/39555#issuecomment-1409881348 Belated Happy new year! Comments addressed, thank you all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
peter-toth commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091531169 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => +val buffer = aliases(a.canonicalized) +if (buffer.size < aliasCandidateLimit) { + buffer += a +} + case _ => +} +aliases + } + + protected def hasAlias: Boolean = aliasMap.nonEmpty + + /** + * Return a stream of expressions in which the original expression is projected with `aliasMap`. + */ + protected def projectExpression(expr: Expression): Stream[Expression] = { +val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) +expr.multiTransformDown { + // Mapping with aliases + case e: Expression if aliasMap.contains(e.canonicalized) => +aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) Seq(e) else Seq.empty) + + // Prune if we encounter an attribute that we can't map and it is not in output set. + // This prune will go up to the closest `multiTransformDown()` call and returns `Stream.empty` + // there. + case a: Attribute if !outputSet.contains(a) => Seq.empty +} + } +} + +/** + * A trait that handles aliases in the `orderingExpressions` to produce `outputOrdering` that + * satisfies ordering requirements. + */ +trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]] + extends AliasAwareOutputExpression { self: QueryPlan[T] => + protected def orderingExpressions: Seq[SortOrder] + + override protected def strip(expr: Expression): Expression = expr match { +case e: Empty2Null => strip(e.child) +case _ => expr + } + + override final def outputOrdering: Seq[SortOrder] = { +if (hasAlias) { + orderingExpressions.flatMap { sortOrder => +val orderingSet = mutable.Set.empty[Expression] +val sameOrderings = sortOrder.children.toStream + .flatMap(projectExpression) + .filter(e => orderingSet.add(e.canonicalized)) + .take(aliasCandidateLimit) +if (sameOrderings.nonEmpty) { + Some(sortOrder.copy(child = sameOrderings.head, +sameOrderExpressions = sameOrderings.tail)) +} else { + None +} Review Comment: I'm not sure I g
[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
peter-toth commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091527242 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => Review Comment: If we have `c, c AS a` projection then we need to add both the original `c` attribute and `a` to the alternatives of `c`. But we don't need to add an attribute if it isn't aliased. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
peter-toth commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091527242 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => Review Comment: If we have `c, c AS a` projection then we need to add both the original `c` attribute and `a` to the alternatives of `a`. But we don't need to add an attribute if it isn't aliased. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
peter-toth commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091525727 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) Review Comment: Ok, thanks. I will change it today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39695: [SPARK-42156][CONNECT] SparkConnectClient supports RetryPolicies now
HyukjinKwon closed pull request #39695: [SPARK-42156][CONNECT] SparkConnectClient supports RetryPolicies now URL: https://github.com/apache/spark/pull/39695 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39695: [SPARK-42156][CONNECT] SparkConnectClient supports RetryPolicies now
HyukjinKwon commented on PR #39695: URL: https://github.com/apache/spark/pull/39695#issuecomment-1409865640 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
peter-toth commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091525383 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() Review Comment: I'm ok to add a new config but `aliasMap` will never be bigger than the projection (`outputExpressions`) so is this a real concern? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion
zhengruifeng commented on PR #39818: URL: https://github.com/apache/spark/pull/39818#issuecomment-1409861573 @HyukjinKwon thank you for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns
gengliangwang closed pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns URL: https://github.com/apache/spark/pull/39812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns
gengliangwang commented on PR #39812: URL: https://github.com/apache/spark/pull/39812#issuecomment-1409859608 Merging to master/3.4. cc @xinrong-meng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion
HyukjinKwon closed pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion URL: https://github.com/apache/spark/pull/39818 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion
HyukjinKwon commented on PR #39818: URL: https://github.com/apache/spark/pull/39818#issuecomment-1409835073 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window
gengliangwang commented on code in PR #39773: URL: https://github.com/apache/spark/pull/39773#discussion_r1091498423 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -177,8 +195,18 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { val newAggExprs = collection.mutable.Set.empty[NamedExpression] val expressionMap = collection.mutable.LinkedHashMap.empty[Expression, NamedExpression] - val projectExprs = aggregateExpressions.map { exp => -exp.transformDown { + // Extract the expressions to keep in the Aggregate. Return the transformed expression + // fully substituted with the attribute reference to the extracted expressions. + def extractExpressions(expr: Expression): Expression = { +expr match { + case w @ WindowExpression(function, spec) => +// Manually skip the handling on the function itself but iterate on its children +// instead. This is to avoid extracting a window expression's aggregate functions. +// For example, for `sum(sum(col1)) over (..)`, we should extract sum(col1) but not Review Comment: TBH I find this comment confusing... Can you provide more details? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #39811: [SPARK-42242][BUILD] Upgrade `snappy-java` to 1.1.9.0
dongjoon-hyun commented on PR #39811: URL: https://github.com/apache/spark/pull/39811#issuecomment-1409795852 It's insufficient, @LuciferYang . The runtimeClass (`rt.jar`) is required during building. Otherwise, Java 11 assumes Java 11 rt.jar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns
gengliangwang commented on code in PR #39812: URL: https://github.com/apache/spark/pull/39812#discussion_r1091472715 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3511,8 +3511,9 @@ object SQLConf { buildConf("spark.sql.inferTimestampNTZInDataSources.enabled") .doc("When true, the TimestampNTZ type is the prior choice of the schema inference " + "over built-in data sources. Otherwise, the inference result will be TimestampLTZ for " + -"backward compatibility. As a result, for JSON/CSV files written with TimestampNTZ " + -"columns, the inference results will still be of TimestampLTZ types.") +"backward compatibility. As a result, for JSON/CSV files and partition directories " + +"written with TimestampNTZ columns, the inference results will still be of TimestampLTZ " + +"types.") .version("3.4.0") .booleanConf .createWithDefault(false) Review Comment: Yup. Take partition directory naming formats as an example, the outputs from Timestamp NTZ and LTZ are exactly the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns
gengliangwang commented on code in PR #39812: URL: https://github.com/apache/spark/pull/39812#discussion_r1091472715 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3511,8 +3511,9 @@ object SQLConf { buildConf("spark.sql.inferTimestampNTZInDataSources.enabled") .doc("When true, the TimestampNTZ type is the prior choice of the schema inference " + "over built-in data sources. Otherwise, the inference result will be TimestampLTZ for " + -"backward compatibility. As a result, for JSON/CSV files written with TimestampNTZ " + -"columns, the inference results will still be of TimestampLTZ types.") +"backward compatibility. As a result, for JSON/CSV files and partition directories " + +"written with TimestampNTZ columns, the inference results will still be of TimestampLTZ " + +"types.") .version("3.4.0") .booleanConf .createWithDefault(false) Review Comment: Yup. Take partition directory naming as an example, the outputs from Timestamp NTZ and LTZ are exactly the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns
cloud-fan commented on code in PR #39812: URL: https://github.com/apache/spark/pull/39812#discussion_r1091468483 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3511,8 +3511,9 @@ object SQLConf { buildConf("spark.sql.inferTimestampNTZInDataSources.enabled") .doc("When true, the TimestampNTZ type is the prior choice of the schema inference " + "over built-in data sources. Otherwise, the inference result will be TimestampLTZ for " + -"backward compatibility. As a result, for JSON/CSV files written with TimestampNTZ " + -"columns, the inference results will still be of TimestampLTZ types.") +"backward compatibility. As a result, for JSON/CSV files and partition directories " + +"written with TimestampNTZ columns, the inference results will still be of TimestampLTZ " + +"types.") .version("3.4.0") .booleanConf .createWithDefault(false) Review Comment: does it mean users can't do NTZ roundtrip (write and read) in 3.4 by default? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091465324 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,35 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +if (digits.isEmpty()) return false; + int checkSum = 0; Review Comment: nit: the indentation is off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091465224 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,35 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. Review Comment: ```suggestion // Empty string is not a valid Luhn number ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #39820: [SPARK-42249][SQL] Refining html link for documentation in error messages.
itholic commented on PR #39820: URL: https://github.com/apache/spark/pull/39820#issuecomment-1409760795 cc @srielau @MaxGekk This basically refines the doc html string in error class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #39820: [SPARK-42249][SQL] Refining html link for documentation in error messages.
itholic opened a new pull request, #39820: URL: https://github.com/apache/spark/pull/39820 ### What changes were proposed in this pull request? This PR proposes to refine html link for documentation in error messages by introducing `Utils.DOC_ROOT_DIR` that contains global directory for documentation root link: `https://spark.apache.org/docs/latest` ### Why are the changes needed? To improve error class readability and make sure using right root document directory across all source codes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the whole existing tests related to this changes. Basically, run `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*"` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091450516 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,35 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +if (!digits.isEmpty() && digits.chars().allMatch(Character::isDigit)) { Review Comment: As @srowen mentioned, how about returning false in the for loop if space is found? the code can be ``` if (digits.isEmpty()) return false; for ... if (!Character.isDigit(digits.charAt(i))) return false; ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39811: [SPARK-42242][BUILD] Upgrade `snappy-java` to 1.1.9.0
LuciferYang commented on PR #39811: URL: https://github.com/apache/spark/pull/39811#issuecomment-1409758415 https://github.com/xerial/snappy-java/blob/master/build.sbt#L30 https://user-images.githubusercontent.com/1475305/215669005-9d91c3d5-e0ab-4dda-aae8-9677ef410118.png";> Interesting. It seems that the compilation target is still Java 8... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091447034 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +-- basic cases +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); +select luhn_check('378282246310005'); +select luhn_check('6011000990139424'); + +-- spaces in the beginning/middle/end +select luhn_check('4111'); +select luhn_check('411 1'); +select luhn_check(' 4111'); +-- space +select luhn_check(''); +select luhn_check(' '); +-- non-digits +select luhn_check('510B105105105106'); +select luhn_check('ABCDED'); +-- null +select luhn_check(null); +-- non string (test implicit cast) +select luhn_check(6017); +select luhn_check(6018); +select luhn_check(123.456); + +select luhn_check('0'); + +SELECT luhn_check(c1) from values ("4111"), ("5105105105105106") as tab(c1); Review Comment: Removed `from values` tests ## sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala: ## @@ -3081,4 +3081,19 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { checkAnswer(df.select($"dd" / ($"num" + 3)), Seq((Duration.ofDays(2))).toDF) } + + test("luhn check") { Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091446740 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +-- basic cases +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); +select luhn_check('378282246310005'); +select luhn_check('6011000990139424'); Review Comment: Updated input values to cover all code branches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cxzl25 commented on pull request #39798: [MINOR] Fix typo `Exlude` to `Exclude` in `HealthTracker`
cxzl25 commented on PR #39798: URL: https://github.com/apache/spark/pull/39798#issuecomment-1409749792 > Could you try rerunning tests? The failure isn't related, I'm sure Thanks, now GA passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091446189 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,35 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +if (!digits.trim().isEmpty() && digits.chars().allMatch(Character::isDigit)) { Review Comment: trim is not required, removed it now ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +-- basic cases +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); +select luhn_check('378282246310005'); +select luhn_check('6011000990139424'); + +-- spaces in the beginning/middle/end +select luhn_check('4111'); +select luhn_check('411 1'); +select luhn_check(' 4111'); +-- space +select luhn_check(''); +select luhn_check(' '); +-- non-digits +select luhn_check('510B105105105106'); +select luhn_check('ABCDED'); +-- null +select luhn_check(null); +-- non string (test implicit cast) +select luhn_check(6017); +select luhn_check(6018); +select luhn_check(123.456); + +select luhn_check('0'); Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091445199 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,39 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +java.util.function.Predicate isDigit = c -> Character.isDigit(c); +java.util.function.Predicate isNotWhitespace = c -> !Character.isWhitespace(c); +if (!digits.isEmpty() && +digits.chars().allMatch( +c -> Character.isDigit(c) && !Character.isWhitespace((char) c))) { Review Comment: By mistake that change got added, fixed the condition now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wayneguow opened a new pull request, #39819: [SPARK-42252][Core] Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
wayneguow opened a new pull request, #39819: URL: https://github.com/apache/spark/pull/39819 ### What changes were proposed in this pull request? Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config spark.shuffle.localDisk.file.output.buffer instead. ### Why are the changes needed? The old config is desgined to be used in UnsafeShuffleWriter, but now it has been used in all local shuffle writers through LocalDiskShuffleMapOutputWriter. ### Does this PR introduce _any_ user-facing change? Old still works, advised to use new. ### How was this patch tested? Passed existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
srowen commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091441853 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,39 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +java.util.function.Predicate isDigit = c -> Character.isDigit(c); +java.util.function.Predicate isNotWhitespace = c -> !Character.isWhitespace(c); +if (!digits.isEmpty() && +digits.chars().allMatch( +c -> Character.isDigit(c) && !Character.isWhitespace((char) c))) { Review Comment: (It does not) BTW, I think using allMatch is going to be non-trivial overhead. You're already iterating over chars, just check in the loop below and short-circuit if something isn't a digit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] anchovYu commented on a diff in pull request #39508: [SPARK-41985][SQL] Centralize more column resolution rules
anchovYu commented on code in PR #39508: URL: https://github.com/apache/spark/pull/39508#discussion_r1091428757 ## sql/core/src/test/resources/sql-tests/inputs/column-resolution-aggregate.sql: ## @@ -0,0 +1,30 @@ +-- Tests covering column resolution priority in Aggregate. + +CREATE TEMPORARY VIEW v1 AS VALUES (1, 1, 1), (2, 2, 1) AS t(a, b, k); +CREATE TEMPORARY VIEW v2 AS VALUES (1, 1, 1), (2, 2, 1) AS t(x, y, all); + +-- Relation output columns have higher priority than lateral column alias. This query +-- should fail as `b` is not in GROUP BY. +SELECT max(a) AS b, b FROM v1 GROUP BY k; + +-- Lateral column alias has higher priority than outer reference. +SELECT a FROM v1 WHERE (12, 13) IN (SELECT max(x + 10) AS a, a + 1 FROM v2); + +-- Relation output columns have higher priority than GROUP BY alias. This query should +-- fail as `a` is not in GROUP BY. +SELECT a AS k FROM v1 GROUP BY k; + +-- Relation output columns have higher priority than GROUP BY ALL. This query should +-- fail as `x` is not in GROUP BY. +SELECT x FROM v2 GROUP BY all; + +-- GROUP BY alias has higher priority than GROUP BY ALL, this query fails as `b` is not in GROUP BY. +SELECT a AS all, b FROM v1 GROUP BY all; + +-- GROUP BY alias/ALL does not support lateral column alias. +SELECT k AS lca, lca + 1 AS col FROM v1 GROUP BY k, col; +SELECT k AS lca, lca + 1 AS col FROM v1 GROUP BY all; Review Comment: Could you add one more test which is how I misunderstood the error message: ``` SELECT k AS lca, lca + 1 AS col FROM v1 GROUP BY lca; ``` Having LCA in SELECT list but group by alias not containing LCA should still resolve. ## sql/core/src/test/resources/sql-tests/inputs/column-resolution-sort.sql: ## @@ -0,0 +1,20 @@ +--SET spark.sql.leafNodeDefaultParallelism=1 +-- Tests covering column resolution priority in Sort. + +CREATE TEMPORARY VIEW v1 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, k); +CREATE TEMPORARY VIEW v2 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, all); + +-- Relation output columns have higher priority than missing reference. +-- Results will be [2, 1] if we order by the column `v1.b`. Review Comment: Test is good. just comment needs correction. ## sql/core/src/test/resources/sql-tests/inputs/column-resolution-sort.sql: ## @@ -0,0 +1,20 @@ +--SET spark.sql.leafNodeDefaultParallelism=1 +-- Tests covering column resolution priority in Sort. + +CREATE TEMPORARY VIEW v1 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, k); +CREATE TEMPORARY VIEW v2 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, all); + +-- Relation output columns have higher priority than missing reference. +-- Results will be [2, 1] if we order by the column `v1.b`. Review Comment: Missing reference won't be able to cover this case, it should only add grouping expressions to the Aggregate. v1.b is not a grouping expression, so if order by v1.b the query should throw exception. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveReferencesInSort.scala: ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.SortOrder +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, Sort} + +/** + * A virtual rule to resolve [[UnresolvedAttribute]] in [[Sort]]. It's only used by the real + * rule `ResolveReferences`. The column resolution order for [[Sort]] is: + * 1. Resolves the column to [[AttributeReference]] with the output of the child plan. This + *includes metadata columns as well. + * 2. Resolves the column to a literal function which is allowed to be invoked without braces, e.g. + *`SELECT col, current_date FROM t`. + * 3. If the child plan is Aggregate, resolves the column to [[TempResolvedColumn]] with the output + *of Aggregate's child plan. This is to allow Sort to host grouping expressions and aggregate + *functions, which can be pushed down to the Aggregate later. For example, + *`SELECT max(a) FROM t GROUP BY b ORDER BY min(a)`. + * 4. Resolves the column to [[AttributeReference]
[GitHub] [spark] zhengruifeng opened a new pull request, #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion
zhengruifeng opened a new pull request, #39818: URL: https://github.com/apache/spark/pull/39818 ### What changes were proposed in this pull request? Make `createDataFrame` support `AtomicType -> StringType` coercion ### Why are the changes needed? to be consistent with PySpark, this feature was added in https://github.com/apache/spark/commit/51b04406028e14fbe1986f6a3ffc67853bd82935 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UT and enabled UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091427571 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,39 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +java.util.function.Predicate isDigit = c -> Character.isDigit(c); +java.util.function.Predicate isNotWhitespace = c -> !Character.isWhitespace(c); +if (!digits.isEmpty() && +digits.chars().allMatch( +c -> Character.isDigit(c) && !Character.isWhitespace((char) c))) { Review Comment: I'm confused. `Character.isDigit(c)` returns true for spaces? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
cloud-fan closed pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 URL: https://github.com/apache/spark/pull/38760 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
cloud-fan commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409716946 thanks, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39817: [SPARK-42250][PYTHON][ML] `predict_batch_udf` with float fails when the batch size consists of single value
HyukjinKwon commented on PR #39817: URL: https://github.com/apache/spark/pull/39817#issuecomment-1409712256 cc @leewyang, @mengxr and @WeichenXu123. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39810: [SPARK-42241][CONNECT][TESTS] Fix the find connect jar condition in `SparkConnectServerUtils#findSparkConnectJar` for maven
LuciferYang commented on PR #39810: URL: https://github.com/apache/spark/pull/39810#issuecomment-1409712130 Thanks @HyukjinKwon I hope https://github.com/apache/spark/pull/39789 can be merged first, which seems to be blocking version release -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #39817: [SPARK-42250][PYTHON][ML] `predict_batch_udf` with float fails when the batch size consists of single value
HyukjinKwon opened a new pull request, #39817: URL: https://github.com/apache/spark/pull/39817 ### What changes were proposed in this pull request? This PR is sort of a followup of https://github.com/apache/spark/pull/37734 which handles the case when the batch is single scalar value. Essentially it proposes to work around the pandas behaviour by explicitly casting back to the original data type of NumPy Arrow: ```python >>> import numpy as np >>> import pandas as pd >>> np.squeeze(np.array([1.])).dtype dtype('float64') >>> pd.Series(np.squeeze(np.array([1.]))).dtype dtype('O') >>> pd.Series(np.squeeze(np.array([1., 1.]))).dtype dtype('float64') ``` ### Why are the changes needed? Using `predict_batch_udf` fails when the size of batch happen to have single value. For example, even when the batch size is set to 10, if the size of data is 21, it fails because the last batch consists of the single value. ```python import numpy as np import pandas as pd from pyspark.ml.functions import predict_batch_udf from pyspark.sql.types import ArrayType, FloatType, StructType, StructField from typing import Mapping df = spark.createDataFrame([[[0.0, 1.0, 2.0, 3.0], [0.0, 1.0, 2.0]]], schema=["t1", "t2"]) def make_multi_sum_fn(): def predict(x1: np.ndarray, x2: np.ndarray) -> np.ndarray: return np.sum(x1, axis=1) + np.sum(x2, axis=1) return predict multi_sum_udf = predict_batch_udf( make_multi_sum_fn, return_type=FloatType(), batch_size=1, input_tensor_shapes=[[4], [3]], ) df.select(multi_sum_udf("t1", "t2")).collect() ``` **Before** ``` File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 829, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 821, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 345, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 86, in dump_stream for batch in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 339, in init_stream_yield_batches batch = self._create_batch(series) File "/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 275, in _create_batch arrs.append(create_array(s, t)) File "/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 245, in create_array raise e File "/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 233, in create_array array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) File "pyarrow/array.pxi", line 1044, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 316, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert array(569.) with type numpy.ndarray: tried to convert to float32 ``` **After** ``` [Row(predict(t1, t2)=9.0)] ``` ### Does this PR introduce _any_ user-facing change? This feature has not been released yet, so no user-facing change to the end users. It fixes a bug in the unreleased feature. ### How was this patch tested? Unittest was added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #39789: [SPARK-42228][BUILD][CONNECT] Add shade and relocation rule of grpc to connect-client-jvm module
LuciferYang commented on PR #39789: URL: https://github.com/apache/spark/pull/39789#issuecomment-1409706471 > The reason to use the server shaded jar is because that's the only way to start spark with spark connect. Why is this the only way? Because the conflict package name of `sql` module and `connect-client-jvm` module? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
viirya commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409692951 > I think we need but it's a breaking change so how about creating a new pr for master Yea, let's do it in master branch. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
ulysses-you commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409665435 @viirya I think we need but it's a breaking change so how about creating a new pr for master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
cloud-fan commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409665363 > Do we need to add some assert in DecimalType to prevent zero precision? I think we should, but let's do it in the master branch only to officially ban 0 decimal precision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #39667: [SPARK-42131][SQL] Extract the function that construct the select statement for JDBC dialect.
beliefer commented on code in PR #39667: URL: https://github.com/apache/spark/pull/39667#discussion_r1091369388 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala: ## @@ -551,6 +551,34 @@ abstract class JdbcDialect extends Serializable with Logging { if (offset > 0 ) s"OFFSET $offset" else "" } + /** + * returns the SQL text for the SELECT statement. + * + * @param options - JDBC options that contains url, table and other information. + * @param columnList - The names of the columns or aggregate columns to SELECT. + * @param tableSampleClause - The table sample clause for the SELECT statement. + * @param whereClause - The WHERE clause for the SELECT statement. + * @param groupByClause - The group by clause for the SELECT statement. + * @param orderByClause - The order by clause for the SELECT statement. + * @param limitClause - The LIMIT clause for the SELECT statement. + * @param offsetClause - The OFFSET clause for the SELECT statement. Review Comment: `JdbcSQLBuilder` has been created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError`
cloud-fan closed pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError` URL: https://github.com/apache/spark/pull/39797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError`
cloud-fan commented on PR #39797: URL: https://github.com/apache/spark/pull/39797#issuecomment-1409651159 thanks, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
ulysses-you commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409650160 @cloud-fan updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into a
cloud-fan commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091361589 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => +val buffer = aliases(a.canonicalized) +if (buffer.size < aliasCandidateLimit) { + buffer += a +} + case _ => +} +aliases + } + + protected def hasAlias: Boolean = aliasMap.nonEmpty + + /** + * Return a stream of expressions in which the original expression is projected with `aliasMap`. + */ + protected def projectExpression(expr: Expression): Stream[Expression] = { +val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) +expr.multiTransformDown { + // Mapping with aliases + case e: Expression if aliasMap.contains(e.canonicalized) => +aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) Seq(e) else Seq.empty) + + // Prune if we encounter an attribute that we can't map and it is not in output set. + // This prune will go up to the closest `multiTransformDown()` call and returns `Stream.empty` + // there. + case a: Attribute if !outputSet.contains(a) => Seq.empty +} + } +} + +/** + * A trait that handles aliases in the `orderingExpressions` to produce `outputOrdering` that + * satisfies ordering requirements. + */ +trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]] + extends AliasAwareOutputExpression { self: QueryPlan[T] => + protected def orderingExpressions: Seq[SortOrder] + + override protected def strip(expr: Expression): Expression = expr match { +case e: Empty2Null => strip(e.child) +case _ => expr + } + + override final def outputOrdering: Seq[SortOrder] = { +if (hasAlias) { + orderingExpressions.flatMap { sortOrder => +val orderingSet = mutable.Set.empty[Expression] +val sameOrderings = sortOrder.children.toStream + .flatMap(projectExpression) + .filter(e => orderingSet.add(e.canonicalized)) + .take(aliasCandidateLimit) +if (sameOrderings.nonEmpty) { + Some(sortOrder.copy(child = sameOrderings.head, +sameOrderExpressions = sameOrderings.tail)) +} else { + None +} Review Comment: let's also add te
[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into a
cloud-fan commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091361465 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => +val buffer = aliases(a.canonicalized) +if (buffer.size < aliasCandidateLimit) { + buffer += a +} + case _ => +} +aliases + } + + protected def hasAlias: Boolean = aliasMap.nonEmpty + + /** + * Return a stream of expressions in which the original expression is projected with `aliasMap`. + */ + protected def projectExpression(expr: Expression): Stream[Expression] = { +val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) +expr.multiTransformDown { + // Mapping with aliases + case e: Expression if aliasMap.contains(e.canonicalized) => +aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) Seq(e) else Seq.empty) + + // Prune if we encounter an attribute that we can't map and it is not in output set. + // This prune will go up to the closest `multiTransformDown()` call and returns `Stream.empty` + // there. + case a: Attribute if !outputSet.contains(a) => Seq.empty +} + } +} + +/** + * A trait that handles aliases in the `orderingExpressions` to produce `outputOrdering` that + * satisfies ordering requirements. + */ +trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]] + extends AliasAwareOutputExpression { self: QueryPlan[T] => + protected def orderingExpressions: Seq[SortOrder] + + override protected def strip(expr: Expression): Expression = expr match { +case e: Empty2Null => strip(e.child) +case _ => expr + } + + override final def outputOrdering: Seq[SortOrder] = { +if (hasAlias) { + orderingExpressions.flatMap { sortOrder => +val orderingSet = mutable.Set.empty[Expression] +val sameOrderings = sortOrder.children.toStream + .flatMap(projectExpression) + .filter(e => orderingSet.add(e.canonicalized)) + .take(aliasCandidateLimit) +if (sameOrderings.nonEmpty) { + Some(sortOrder.copy(child = sameOrderings.head, +sameOrderExpressions = sameOrderings.tail)) +} else { + None +} Review Comment: Good point! Parti
[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into a
cloud-fan commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091360793 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091358506 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => +val buffer = aliases(a.canonicalized) +if (buffer.size < aliasCandidateLimit) { + buffer += a +} + case _ => +} +aliases + } + + protected def hasAlias: Boolean = aliasMap.nonEmpty + + /** + * Return a stream of expressions in which the original expression is projected with `aliasMap`. + */ + protected def projectExpression(expr: Expression): Stream[Expression] = { +val outputSet = AttributeSet(outputExpressions.map(_.toAttribute)) +expr.multiTransformDown { + // Mapping with aliases + case e: Expression if aliasMap.contains(e.canonicalized) => +aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) Seq(e) else Seq.empty) + + // Prune if we encounter an attribute that we can't map and it is not in output set. + // This prune will go up to the closest `multiTransformDown()` call and returns `Stream.empty` + // there. + case a: Attribute if !outputSet.contains(a) => Seq.empty +} + } +} + +/** + * A trait that handles aliases in the `orderingExpressions` to produce `outputOrdering` that + * satisfies ordering requirements. + */ +trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]] + extends AliasAwareOutputExpression { self: QueryPlan[T] => + protected def orderingExpressions: Seq[SortOrder] + + override protected def strip(expr: Expression): Expression = expr match { +case e: Empty2Null => strip(e.child) +case _ => expr + } + + override final def outputOrdering: Seq[SortOrder] = { +if (hasAlias) { + orderingExpressions.flatMap { sortOrder => +val orderingSet = mutable.Set.empty[Expression] +val sameOrderings = sortOrder.children.toStream + .flatMap(projectExpression) + .filter(e => orderingSet.add(e.canonicalized)) + .take(aliasCandidateLimit) +if (sameOrderings.nonEmpty) { + Some(sortOrder.copy(child = sameOrderings.head, +sameOrderExpressions = sameOrderings.tail)) +} else { + None +} Review Comment: is it correct ?
[GitHub] [spark] cloud-fan commented on pull request #39718: [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key
cloud-fan commented on PR #39718: URL: https://github.com/apache/spark/pull/39718#issuecomment-1409644259 thanks, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #39718: [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key
cloud-fan closed pull request #39718: [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key URL: https://github.com/apache/spark/pull/39718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091353269 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) Review Comment: I prefer `strip(child).canonicalized`. I have not seen other code places that match a canonicalized expression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39808: [SPARK-41970][SQL][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation
cloud-fan commented on PR #39808: URL: https://github.com/apache/spark/pull/39808#issuecomment-1409640068 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091351690 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() +outputExpressions.reverse.foreach { + case a @ Alias(child, _) => +val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), mutable.ArrayBuffer.empty) +if (buffer.size < aliasCandidateLimit) { + buffer += a.toAttribute +} + case _ => +} +outputExpressions.foreach { + case a: Attribute if aliases.contains(a.canonicalized) => Review Comment: What does this do ? do you mean `!aliases.contains(a.canonicalized)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into
ulysses-you commented on code in PR #37525: URL: https://github.com/apache/spark/pull/37525#discussion_r1091351231 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.SQLConfHelper +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder} +import org.apache.spark.sql.internal.SQLConf + +/** + * A trait that provides functionality to handle aliases in the `outputExpressions`. + */ +trait AliasAwareOutputExpression extends SQLConfHelper { + protected val aliasCandidateLimit = conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT) + protected def outputExpressions: Seq[NamedExpression] + /** + * This method can be used to strip expression which does not affect the result, for example: + * strip the expression which is ordering agnostic for output ordering. + */ + protected def strip(expr: Expression): Expression = expr + + // Build an `Expression` -> `Attribute` alias map. + // There can be multiple alias defined for the same expressions but it doesn't make sense to store + // more than `aliasCandidateLimit` attributes for an expression. In those cases the old logic + // handled only the last alias so we need to make sure that we give precedence to that. + // If the `outputExpressions` contain simple attributes we need to add those too to the map. + private lazy val aliasMap = { +val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]() Review Comment: The current config acutally limit the size of final preserved exprs. I think we should add one more config to limit the candidates size in `aliasMap` and it can be bigger by default. This `aliasMap` may harm driver memory for wide tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown
cloud-fan commented on PR #39660: URL: https://github.com/apache/spark/pull/39660#issuecomment-1409637668 This is a new feature and we don't need to rush (can't catch 3.4 anyway). I'd like to have an API refactor first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
cloud-fan commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409636538 Please link to the PR of decimal refactor. Then I think this is good to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091350090 ## sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala: ## @@ -3081,4 +3081,19 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { checkAnswer(df.select($"dd" / ($"num" + 3)), Seq((Duration.ofDays(2))).toDF) } + + test("luhn check") { Review Comment: I think the golden file tests fully cover it. If we add a scala API for luhn check, we need this test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091349674 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +-- basic cases +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); +select luhn_check('378282246310005'); +select luhn_check('6011000990139424'); + +-- spaces in the beginning/middle/end +select luhn_check('4111'); +select luhn_check('411 1'); +select luhn_check(' 4111'); +-- space +select luhn_check(''); +select luhn_check(' '); +-- non-digits +select luhn_check('510B105105105106'); +select luhn_check('ABCDED'); +-- null +select luhn_check(null); +-- non string (test implicit cast) +select luhn_check(6017); +select luhn_check(6018); +select luhn_check(123.456); + +select luhn_check('0'); + +SELECT luhn_check(c1) from values ("4111"), ("5105105105105106") as tab(c1); Review Comment: why do we need extra tests for `from values`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091349494 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +-- basic cases +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); +select luhn_check('378282246310005'); +select luhn_check('6011000990139424'); Review Comment: do these 6 tests cover different code branches? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091349245 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +-- basic cases +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); +select luhn_check('378282246310005'); +select luhn_check('6011000990139424'); + +-- spaces in the beginning/middle/end +select luhn_check('4111'); +select luhn_check('411 1'); +select luhn_check(' 4111'); +-- space +select luhn_check(''); +select luhn_check(' '); +-- non-digits +select luhn_check('510B105105105106'); +select luhn_check('ABCDED'); +-- null +select luhn_check(null); +-- non string (test implicit cast) +select luhn_check(6017); +select luhn_check(6018); +select luhn_check(123.456); + +select luhn_check('0'); Review Comment: shall we move it to the `-- basic cases` section? It's also the number string case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
cloud-fan commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091348733 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,35 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +if (!digits.trim().isEmpty() && digits.chars().allMatch(Character::isDigit)) { Review Comment: do we need the `trim()` call here? luhn doesn't support spaces. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #39816: [SPARK-42245][BUILD] Upgrade scalafmt from 3.6.1 to 3.7.1
panbingkun opened a new pull request, #39816: URL: https://github.com/apache/spark/pull/39816 ### What changes were proposed in this pull request? The pr aims to upgrade scalafmt from 3.6.1 to 3.7.1 ### Why are the changes needed? A. Release note: > https://github.com/scalameta/scalafmt/releases B. V3.6.1 VS V3.7.1 > https://github.com/scalameta/scalafmt/compare/v3.6.1...v3.7.1 C. Bring bug fix & some improvement: https://user-images.githubusercontent.com/15246973/215639186-47ad2abc-5827-4b0b-a401-10737bd05743.png";> https://user-images.githubusercontent.com/15246973/215639316-0df69d85-cb6b-40f8-acbf-d792193d1ba1.png";> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run: sh ./dev/scalafmt Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into account
cloud-fan commented on PR #37525: URL: https://github.com/apache/spark/pull/37525#issuecomment-1409625294 @peter-toth can you retrigger the tests? The pyspark failures may be flaky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic
hvanhovell closed pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic URL: https://github.com/apache/spark/pull/39760 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic
hvanhovell commented on PR #39760: URL: https://github.com/apache/spark/pull/39760#issuecomment-1409623518 Merging this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091338386 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -231,3 +231,42 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM VALUES ('invalidFormat SELECT to_binary('abc', fmtField) FROM fmtTable; -- Clean up DROP VIEW IF EXISTS fmtTable; +-- luhn_check +select luhn_check('4111'); +select luhn_check('4111'); +select luhn_check('5504'); +select luhn_check('349'); +select luhn_check('60110004'); Review Comment: I added suggested test cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0
ulysses-you commented on PR #38760: URL: https://github.com/apache/spark/pull/38760#issuecomment-1409618212 @cloud-fan sure, has updated the description ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #39799: [SPARK-42232][SQL] Rename error class: `UNSUPPORTED_FEATURE.JDBC_TRANSACTION`
itholic commented on PR #39799: URL: https://github.com/apache/spark/pull/39799#issuecomment-1409618136 cc @MaxGekk @srielau @cloud-fan FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown
beliefer commented on code in PR #39660: URL: https://github.com/apache/spark/pull/39660#discussion_r1091335134 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala: ## @@ -307,11 +307,12 @@ private[jdbc] class JDBCRDD( "" } +val myTopExpression: String = dialect.getTopExpression(limit) // SQL Server Limit alternative val myLimitClause: String = dialect.getLimitClause(limit) val myOffsetClause: String = dialect.getOffsetClause(offset) val sqlText = options.prepareQuery + - s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" + + s"SELECT $myTopExpression $columnList FROM ${options.tableOrQuery} $myTableSampleClause" + Review Comment: I will do the work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #39815: [SPARK-42244][PYTHON] Refine error classes and messages
itholic opened a new pull request, #39815: URL: https://github.com/apache/spark/pull/39815 ### What changes were proposed in this pull request? This PR proposes to refine error classes and messages by using the Python type object name for error class name and its messages. nit - Also add missing dots at the end of error messages. ### Why are the changes needed? The type name was inconsistent in the previous error messages, e.g. string and str, int and integer, and the sometimes the error class name is too long to read. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng opened a new pull request, #39814: [WIP][SPARK-42208][CONNECT][PYTHON] Reuse UDF test cases under `pyspark.sql.tests`
xinrong-meng opened a new pull request, #39814: URL: https://github.com/apache/spark/pull/39814 ### What changes were proposed in this pull request? Reuse UDF test cases under `pyspark.sql.tests`. ### Why are the changes needed? Reach parity to the vanilla PySpark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect
xinrong-meng commented on PR #39753: URL: https://github.com/apache/spark/pull/39753#issuecomment-1409604165 Merged it to unblock follow-up PRs. Feel free to leave comments, I will address them in the follow-ups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #39381: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…
srowen commented on PR #39381: URL: https://github.com/apache/spark/pull/39381#issuecomment-1409602431 Roger that - any chance you can retrigger these tests? they're hung for some reason -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect
xinrong-meng commented on PR #39753: URL: https://github.com/apache/spark/pull/39753#issuecomment-1409601795 Merged to master and branch-3.4. Thanks all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng closed pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect
xinrong-meng closed pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect URL: https://github.com/apache/spark/pull/39753 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091322757 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ## @@ -3039,3 +3039,43 @@ case class SplitPart ( partNum = newChildren.apply(2)) } } + +/** + * Function to check if a given number string is a valid Luhn number. Returns true, if the number + * string is a valid Luhn number, false otherwise. + */ +@ExpressionDescription( + usage = """ +_FUNC_(str ) - Checks that a string of digits is valid according to the Luhn algorithm. +This checksum function is widely applied on credit card numbers and government identification +numbers to distinguish valid numbers from mistyped, incorrect numbers. Review Comment: To check newlines, I tried following tests `spark-sql> desc function luhn_check;` ``` Function: luhn_check Class: org.apache.spark.sql.catalyst.expressions.Luhncheck Usage: luhn_check(str ) - Checks that a string of digits is valid according to the Luhn algorithm. This checksum function is widely applied on credit card numbers and government identification numbers to distinguish valid numbers from mistyped, incorrect numbers. ``` I did the same test on other builtin functions like `substring_index`, `rpad` ``` spark-sql> desc function substring_index; Function: substring_index Class: org.apache.spark.sql.catalyst.expressions.SubstringIndex Usage: substring_index(str, delim, count) - Returns the substring from `str` before `count` occurrences of the delimiter `delim`. If `count` is positive, everything to the left of the final delimiter (counting from the left) is returned. If `count` is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for `delim`. ``` Please suggest, if it is not suggested to add new lines, I can change the function doc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic
zhenlineo commented on PR #39760: URL: https://github.com/apache/spark/pull/39760#issuecomment-1409594157 @LuciferYang could you have another look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fe2s commented on pull request #39381: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…
fe2s commented on PR #39381: URL: https://github.com/apache/spark/pull/39381#issuecomment-1409594093 > Is there a backport for 3.3. too? Created https://github.com/apache/spark/pull/39813. Sorry for the delay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window
gengliangwang commented on code in PR #39773: URL: https://github.com/apache/spark/pull/39773#discussion_r1091320427 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -17,7 +17,8 @@ package org.apache.spark.sql.catalyst.analysis -import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeMap, Expression, LateralColumnAliasReference, NamedExpression, ScalarSubquery} +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeMap, Expression, LateralColumnAliasReference, NamedExpression, ScalarSubquery, WindowExpression, WindowSpecDefinition} Review Comment: nit: let's just use `import org.apache.spark.sql.catalyst.expressions._` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fe2s opened a new pull request, #39813: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…
fe2s opened a new pull request, #39813: URL: https://github.com/apache/spark/pull/39813 …ore than 18 This is a backport PR for https://github.com/apache/spark/pull/39099 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on pull request #39789: [SPARK-42228][BUILD][CONNECT] Add shade and relocation rule of grpc to connect-client-jvm module
zhenlineo commented on PR #39789: URL: https://github.com/apache/spark/pull/39789#issuecomment-1409593261 Hi @LuciferYang Thanks for fixing this error. Regarding the E2E test, I want to clarify the purpose of E2E: It is used by scala client to verify client methods are returning the correct results. Due to the nature of the client development, we can hardly verify a lot behaviors without a real server, thus the E2E tests contains a lot of very basic tests and need to run daily. It was not to testing the server shading or client shading. The reason to use the server shaded jar is because that's the only way to start spark with spark connect. I am not super favor the idea to use E2E also for client/server shading test. Can I suggest we have a separate test to test the client/server shading, following the idea that each test only test one thing? Really great thanks to help to look into better E2E tests. I also tried to put the E2E in a separate module, I did not get too much benefits but created an empty module that only contains tests. It would be really great if you could work out a way to run the tests as if they are unit tests. Thanks again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] techaddict commented on pull request #39614: [SPARK-42002][CONNECT][PYTHON] Implement DataFrameWriterV2
techaddict commented on PR #39614: URL: https://github.com/apache/spark/pull/39614#issuecomment-1409591664 @HyukjinKwon thanks, updated the PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window
gengliangwang commented on code in PR #39773: URL: https://github.com/apache/spark/pull/39773#discussion_r1091316642 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -104,7 +112,10 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { // phase 2: unwrap plan.resolveOperatorsUpWithPruning( _.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE), ruleId) { -case p @ Project(projectList, child) if p.resolved +case p @ Project(projectList, child) if + (p.resolved +// can also work on nodes that contains Window but have all others resolved +|| (projectList.exists(hasWindowExpression) && exprsAndChildResolved(p))) Review Comment: `p.resolved || (projectList.exists(hasWindowExpression) && exprsAndChildResolved(p))` shows up multiple times. Shall we abstract them into a method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window
gengliangwang commented on code in PR #39773: URL: https://github.com/apache/spark/pull/39773#discussion_r1091315214 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -81,6 +82,9 @@ import org.apache.spark.sql.internal.SQLConf *+- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, *dept#14] * +- Child [dept#14,name#15,salary#16,bonus#17] + * + * + * TODO: example with Window Review Comment: @anchovYu could you fix the TODO here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38947: [SPARK-41233][SQL] Add `array_prepend` function
zhengruifeng commented on PR #38947: URL: https://github.com/apache/spark/pull/38947#issuecomment-1409581323 @navinvishy mind fix the scala linter failure? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'
vinodkc commented on code in PR #39747: URL: https://github.com/apache/spark/pull/39747#discussion_r1091310570 ## sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java: ## @@ -35,6 +35,35 @@ public class ExpressionImplUtils { private static final int GCM_IV_LEN = 12; private static final int GCM_TAG_LEN = 128; + /** + * Function to check if a given number string is a valid Luhn number + * @param numberString + * the number string to check + * @return + * true if the number string is a valid Luhn number, false otherwise. + */ + public static boolean isLuhnNumber(UTF8String numberString) { +String digits = numberString.toString(); +// Check if all characters in the input string are digits. +if (!digits.trim().isEmpty() && digits.chars().allMatch(Character::isDigit)) { + int checkSum = 0; + boolean isSecond = false; + for (int i = digits.length() - 1; i >= 0; i--) { +int digit = Character.getNumericValue(digits.charAt(i)); Review Comment: spaces are not supported in a Luhn number check, so it will return false -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #39795: [SPARK-42234][SQL] Rename error class: `UNSUPPORTED_FEATURE.REPEATED_PIVOT`
itholic commented on code in PR #39795: URL: https://github.com/apache/spark/pull/39795#discussion_r1091304905 ## core/src/main/resources/error/error-classes.json: ## @@ -1207,6 +1207,12 @@ ], "sqlState" : "42K03" }, + "REPEATED_CLAUSE" : { +"message" : [ + "The clause may be used at most once per operation." +], +"sqlState" : "2202G" Review Comment: Thanks for the correction! Just updated to 42614. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #39705: [SPARK-41488][SQL] Assign name to _LEGACY_ERROR_TEMP_1176 (and 1177)
itholic commented on code in PR #39705: URL: https://github.com/apache/spark/pull/39705#discussion_r1091295305 ## core/src/main/resources/error/error-classes.json: ## @@ -598,6 +598,29 @@ "Please try to re-create the view by running: ." ] }, + "INCOMPLETE_TYPE_DEFINITION" : { +"message" : [ + "Incomplete complex type:" +], +"subClass" : { + "ARRAY" : { +"message" : [ + "The definition of \"ARRAY\" type is incomplete. You must provide an element type. For example: \"ARRAY\"." Review Comment: Oh, actually `` here is a parameter(unlike other sub error classes) so we already shows message like you suggested as below: ```scala case "array" => new ParseException( errorClass = "INCOMPLETE_TYPE_DEFINITION.ARRAY", messageParameters = Map("elementType" -> ""), ctx) case "struct" => new ParseException( errorClass = "INCOMPLETE_TYPE_DEFINITION.STRUCT", messageParameters = Map.empty, ctx) case "map" => new ParseException( errorClass = "INCOMPLETE_TYPE_DEFINITION.MAP", messageParameters = Map.empty, ctx) ``` It looks confusing tho... 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #39795: [SPARK-42234][SQL] Rename error class: `UNSUPPORTED_FEATURE.REPEATED_PIVOT`
srielau commented on code in PR #39795: URL: https://github.com/apache/spark/pull/39795#discussion_r1091299476 ## core/src/main/resources/error/error-classes.json: ## @@ -1207,6 +1207,12 @@ ], "sqlState" : "42K03" }, + "REPEATED_CLAUSE" : { +"message" : [ + "The clause may be used at most once per operation." +], +"sqlState" : "2202G" Review Comment: I mistypes. Should have been 42614 Duplicate clause. 42616 is Db2 LUW specific (I drew on DB2 zOS to fill the table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #39795: [SPARK-42234][SQL] Rename error class: `UNSUPPORTED_FEATURE.REPEATED_PIVOT`
srielau commented on code in PR #39795: URL: https://github.com/apache/spark/pull/39795#discussion_r1090896713 ## core/src/main/resources/error/error-classes.json: ## @@ -1207,6 +1207,12 @@ ], "sqlState" : "42K03" }, + "REPEATED_CLAUSE" : { +"message" : [ + "The clause may be used at most once per operation." +], +"sqlState" : "2202G" Review Comment: 2202G is: invalid repeat argument in a sample clause is quite specific. ```suggestion "sqlState" : "42614" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39614: [SPARK-42002][CONNECT][PYTHON] Implement DataFrameWriterV2
HyukjinKwon commented on PR #39614: URL: https://github.com/apache/spark/pull/39614#issuecomment-1409559076 @techaddict mind rebasing this so we can get the latest result of the testing? Sorry it slipped through my fingers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError`
itholic commented on PR #39797: URL: https://github.com/apache/spark/pull/39797#issuecomment-1409558942 test passed. cc @MaxGekk @srielau @cloud-fan FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39614: [SPARK-42002][CONNECT][PYTHON] Implement DataFrameWriterV2
HyukjinKwon commented on code in PR #39614: URL: https://github.com/apache/spark/pull/39614#discussion_r1091297030 ## python/pyspark/sql/connect/plan.py: ## @@ -1413,6 +1413,95 @@ def _repr_html_(self) -> str: pass +class WriteOperationV2(LogicalPlan): +def __init__(self, child: "LogicalPlan", table_name: str) -> None: +super(WriteOperationV2, self).__init__(child) +self.table_name: Optional[str] = table_name +self.provider: Optional[str] = None +self.partitioning_columns: List["ColumnOrName"] = [] +self.options: dict[str, Optional[str]] = {} +self.table_properties: dict[str, Optional[str]] = {} +self.mode: Optional[str] = None +self.overwrite_condition: Optional["ColumnOrName"] = None + +def col_to_expr(self, col: "ColumnOrName", session: "SparkConnectClient") -> proto.Expression: +if isinstance(col, Column): +return col.to_plan(session) +else: +return self.unresolved_attr(col) + +def command(self, session: "SparkConnectClient") -> proto.Command: +assert self._child is not None +plan = proto.Command() +plan.write_operation_v2.input.CopyFrom(self._child.plan(session)) +if self.table_name is not None: +plan.write_operation_v2.table_name = self.table_name +if self.provider is not None: +plan.write_operation_v2.provider = self.provider + +plan.write_operation_v2.partitioning_columns.extend( +[self.col_to_expr(x, session) for x in self.partitioning_columns] +) + +for k in self.options: +if self.options[k] is None: +plan.write_operation_v2.options.pop(k, None) +else: +plan.write_operation_v2.options[k] = cast(str, self.options[k]) + +for k in self.table_properties: +if self.table_properties[k] is None: +plan.write_operation_v2.table_properties.pop(k, None) +else: +plan.write_operation_v2.table_properties[k] = cast(str, self.table_properties[k]) + +if self.mode is not None: +wm = self.mode.lower() +if wm == "create": +plan.write_operation_v2.mode = proto.WriteOperationV2.Mode.MODE_CREATE +elif wm == "overwrite": +plan.write_operation_v2.mode = proto.WriteOperationV2.Mode.MODE_OVERWRITE +elif wm == "overwrite_partition": +plan.write_operation_v2.mode = proto.WriteOperationV2.Mode.MODE_OVERWRITE_PARTITIONS +elif wm == "append": +plan.write_operation_v2.mode = proto.WriteOperationV2.Mode.MODE_APPEND +elif wm == "replace": +plan.write_operation_v2.mode = proto.WriteOperationV2.Mode.MODE_REPLACE +if self.overwrite_condition is not None: +plan.write_operation_v2.overwrite_condition.CopyFrom( +self.col_to_expr(self.overwrite_condition, session) +) +elif wm == "create_or_replace": +plan.write_operation_v2.mode = proto.WriteOperationV2.Mode.MODE_CREATE_OR_REPLACE +else: +raise ValueError(f"Unknown Mode value for DataFrame: {self.mode}") +return plan + +def print(self, indent: int = 0) -> str: +i = " " * indent +return ( +f"{i}" +f"" +) + +def _repr_html_(self) -> str: Review Comment: I actually think you can remove `print` and `_repr_html_` implementation (and let the default implementation handle it) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org