[GitHub] [spark] wayneguow commented on pull request #39819: [SPARK-42252][CORE] Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-01-30 Thread via GitHub


wayneguow commented on PR #39819:
URL: https://github.com/apache/spark/pull/39819#issuecomment-1409895352

   cc @mccheah and @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


peter-toth commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091544557


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>
+val buffer = aliases(a.canonicalized)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a
+}
+  case _ =>
+}
+aliases
+  }
+
+  protected def hasAlias: Boolean = aliasMap.nonEmpty
+
+  /**
+   * Return a stream of expressions in which the original expression is 
projected with `aliasMap`.
+   */
+  protected def projectExpression(expr: Expression): Stream[Expression] = {
+val outputSet = AttributeSet(outputExpressions.map(_.toAttribute))
+expr.multiTransformDown {
+  // Mapping with aliases
+  case e: Expression if aliasMap.contains(e.canonicalized) =>
+aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) 
Seq(e) else Seq.empty)
+
+  // Prune if we encounter an attribute that we can't map and it is not in 
output set.
+  // This prune will go up to the closest `multiTransformDown()` call and 
returns `Stream.empty`
+  // there.
+  case a: Attribute if !outputSet.contains(a) => Seq.empty
+}
+  }
+}
+
+/**
+ * A trait that handles aliases in the `orderingExpressions` to produce 
`outputOrdering` that
+ * satisfies ordering requirements.
+ */
+trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]]
+  extends AliasAwareOutputExpression { self: QueryPlan[T] =>
+  protected def orderingExpressions: Seq[SortOrder]
+
+  override protected def strip(expr: Expression): Expression = expr match {
+case e: Empty2Null => strip(e.child)
+case _ => expr
+  }
+
+  override final def outputOrdering: Seq[SortOrder] = {
+if (hasAlias) {
+  orderingExpressions.flatMap { sortOrder =>
+val orderingSet = mutable.Set.empty[Expression]
+val sameOrderings = sortOrder.children.toStream
+  .flatMap(projectExpression)
+  .filter(e => orderingSet.add(e.canonicalized))
+  .take(aliasCandidateLimit)
+if (sameOrderings.nonEmpty) {
+  Some(sortOrder.copy(child = sameOrderings.head,
+sameOrderExpressions = sameOrderings.tail))
+} else {
+  None
+}

Review Comment:
   Hmm, I might see

[GitHub] [spark] itholic opened a new pull request, #39821: [SPARK-42253][PYTHON] Add test for detecting duplicated error class

2023-01-30 Thread via GitHub


itholic opened a new pull request, #39821:
URL: https://github.com/apache/spark/pull/39821

   ### What changes were proposed in this pull request?
   
   This PR proposes to add test for detecting duplicated name of error classes 
to keep the error class unique.
   
   
   ### Why are the changes needed?
   
   The name of error class should be unique, so we should check if it's 
duplicated or not.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   
   ### How was this patch tested?
   
   
   Manually test in case `COLUMN_IN_LIST` is duplicated as below:
   ```shell
   ==
   FAIL [0.006s]: test_error_classes_duplicated 
(pyspark.errors.tests.test_errors.ErrorsTest)
   --
   Traceback (most recent call last):
   ...
   AssertionError: False is not true : Duplicate error class: COLUMN_IN_LIST
   
   --
   Ran 2 tests in 0.007s
   
   FAILED (failures=1)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


ulysses-you commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091535346


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>

Review Comment:
   what's the behavior of `c as a` ? this code seems to return both `c` and 
`a`. I think the right way should be
   `if AttributSet(outputExpressions).contains(a) => // add a to buffer`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #39555: [SPARK-42051][SQL] Codegen Support for HiveGenericUDF

2023-01-30 Thread via GitHub


yaooqinn commented on PR #39555:
URL: https://github.com/apache/spark/pull/39555#issuecomment-1409881348

   Belated Happy new year! 
   Comments addressed, thank you all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


peter-toth commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091531169


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>
+val buffer = aliases(a.canonicalized)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a
+}
+  case _ =>
+}
+aliases
+  }
+
+  protected def hasAlias: Boolean = aliasMap.nonEmpty
+
+  /**
+   * Return a stream of expressions in which the original expression is 
projected with `aliasMap`.
+   */
+  protected def projectExpression(expr: Expression): Stream[Expression] = {
+val outputSet = AttributeSet(outputExpressions.map(_.toAttribute))
+expr.multiTransformDown {
+  // Mapping with aliases
+  case e: Expression if aliasMap.contains(e.canonicalized) =>
+aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) 
Seq(e) else Seq.empty)
+
+  // Prune if we encounter an attribute that we can't map and it is not in 
output set.
+  // This prune will go up to the closest `multiTransformDown()` call and 
returns `Stream.empty`
+  // there.
+  case a: Attribute if !outputSet.contains(a) => Seq.empty
+}
+  }
+}
+
+/**
+ * A trait that handles aliases in the `orderingExpressions` to produce 
`outputOrdering` that
+ * satisfies ordering requirements.
+ */
+trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]]
+  extends AliasAwareOutputExpression { self: QueryPlan[T] =>
+  protected def orderingExpressions: Seq[SortOrder]
+
+  override protected def strip(expr: Expression): Expression = expr match {
+case e: Empty2Null => strip(e.child)
+case _ => expr
+  }
+
+  override final def outputOrdering: Seq[SortOrder] = {
+if (hasAlias) {
+  orderingExpressions.flatMap { sortOrder =>
+val orderingSet = mutable.Set.empty[Expression]
+val sameOrderings = sortOrder.children.toStream
+  .flatMap(projectExpression)
+  .filter(e => orderingSet.add(e.canonicalized))
+  .take(aliasCandidateLimit)
+if (sameOrderings.nonEmpty) {
+  Some(sortOrder.copy(child = sameOrderings.head,
+sameOrderExpressions = sameOrderings.tail))
+} else {
+  None
+}

Review Comment:
   I'm not sure I g

[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


peter-toth commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091527242


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>

Review Comment:
   If we have `c, c AS a` projection then we need to add both the original `c` 
attribute and `a` to the alternatives of `c`. But we don't need to add an 
attribute if it isn't aliased.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


peter-toth commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091527242


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>

Review Comment:
   If we have `c, c AS a` projection then we need to add both the original `c` 
attribute and `a` to the alternatives of `a`. But we don't need to add an 
attribute if it isn't aliased.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


peter-toth commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091525727


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)

Review Comment:
   Ok, thanks. I will change it today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39695: [SPARK-42156][CONNECT] SparkConnectClient supports RetryPolicies now

2023-01-30 Thread via GitHub


HyukjinKwon closed pull request #39695: [SPARK-42156][CONNECT] 
SparkConnectClient supports RetryPolicies now
URL: https://github.com/apache/spark/pull/39695


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39695: [SPARK-42156][CONNECT] SparkConnectClient supports RetryPolicies now

2023-01-30 Thread via GitHub


HyukjinKwon commented on PR #39695:
URL: https://github.com/apache/spark/pull/39695#issuecomment-1409865640

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


peter-toth commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091525383


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()

Review Comment:
   I'm ok to add a new config but `aliasMap` will never be bigger than the 
projection (`outputExpressions`) so is this a real concern?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion

2023-01-30 Thread via GitHub


zhengruifeng commented on PR #39818:
URL: https://github.com/apache/spark/pull/39818#issuecomment-1409861573

   @HyukjinKwon thank you for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns

2023-01-30 Thread via GitHub


gengliangwang closed pull request #39812: [SPARK-42243][SQL] Use 
`spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on 
partition columns
URL: https://github.com/apache/spark/pull/39812


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns

2023-01-30 Thread via GitHub


gengliangwang commented on PR #39812:
URL: https://github.com/apache/spark/pull/39812#issuecomment-1409859608

   Merging to master/3.4. cc @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion

2023-01-30 Thread via GitHub


HyukjinKwon closed pull request #39818: 
[SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support 
`AtomicType -> StringType` coercion
URL: https://github.com/apache/spark/pull/39818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion

2023-01-30 Thread via GitHub


HyukjinKwon commented on PR #39818:
URL: https://github.com/apache/spark/pull/39818#issuecomment-1409835073

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window

2023-01-30 Thread via GitHub


gengliangwang commented on code in PR #39773:
URL: https://github.com/apache/spark/pull/39773#discussion_r1091498423


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -177,8 +195,18 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
 
   val newAggExprs = collection.mutable.Set.empty[NamedExpression]
   val expressionMap = 
collection.mutable.LinkedHashMap.empty[Expression, NamedExpression]
-  val projectExprs = aggregateExpressions.map { exp =>
-exp.transformDown {
+  // Extract the expressions to keep in the Aggregate. Return the 
transformed expression
+  // fully substituted with the attribute reference to the extracted 
expressions.
+  def extractExpressions(expr: Expression): Expression = {
+expr match {
+  case w @ WindowExpression(function, spec) =>
+// Manually skip the handling on the function itself but 
iterate on its children
+// instead. This is to avoid extracting a window expression's 
aggregate functions.
+// For example, for `sum(sum(col1)) over (..)`, we should 
extract sum(col1) but not

Review Comment:
   TBH I find this comment confusing... Can you provide more details?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39811: [SPARK-42242][BUILD] Upgrade `snappy-java` to 1.1.9.0

2023-01-30 Thread via GitHub


dongjoon-hyun commented on PR #39811:
URL: https://github.com/apache/spark/pull/39811#issuecomment-1409795852

   It's insufficient, @LuciferYang . The runtimeClass (`rt.jar`) is required 
during building. Otherwise, Java 11 assumes Java 11 rt.jar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns

2023-01-30 Thread via GitHub


gengliangwang commented on code in PR #39812:
URL: https://github.com/apache/spark/pull/39812#discussion_r1091472715


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3511,8 +3511,9 @@ object SQLConf {
 buildConf("spark.sql.inferTimestampNTZInDataSources.enabled")
   .doc("When true, the TimestampNTZ type is the prior choice of the schema 
inference " +
 "over built-in data sources. Otherwise, the inference result will be 
TimestampLTZ for " +
-"backward compatibility. As a result, for JSON/CSV files written with 
TimestampNTZ " +
-"columns, the inference results will still be of TimestampLTZ types.")
+"backward compatibility. As a result, for JSON/CSV files and partition 
directories  " +
+"written with TimestampNTZ columns, the inference results will still 
be of TimestampLTZ " +
+"types.")
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)

Review Comment:
   Yup. Take partition directory naming formats as an example, the outputs from 
Timestamp NTZ and LTZ are exactly the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns

2023-01-30 Thread via GitHub


gengliangwang commented on code in PR #39812:
URL: https://github.com/apache/spark/pull/39812#discussion_r1091472715


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3511,8 +3511,9 @@ object SQLConf {
 buildConf("spark.sql.inferTimestampNTZInDataSources.enabled")
   .doc("When true, the TimestampNTZ type is the prior choice of the schema 
inference " +
 "over built-in data sources. Otherwise, the inference result will be 
TimestampLTZ for " +
-"backward compatibility. As a result, for JSON/CSV files written with 
TimestampNTZ " +
-"columns, the inference results will still be of TimestampLTZ types.")
+"backward compatibility. As a result, for JSON/CSV files and partition 
directories  " +
+"written with TimestampNTZ columns, the inference results will still 
be of TimestampLTZ " +
+"types.")
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)

Review Comment:
   Yup. Take partition directory naming as an example, the outputs from 
Timestamp NTZ and LTZ are exactly the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39812: [SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39812:
URL: https://github.com/apache/spark/pull/39812#discussion_r1091468483


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3511,8 +3511,9 @@ object SQLConf {
 buildConf("spark.sql.inferTimestampNTZInDataSources.enabled")
   .doc("When true, the TimestampNTZ type is the prior choice of the schema 
inference " +
 "over built-in data sources. Otherwise, the inference result will be 
TimestampLTZ for " +
-"backward compatibility. As a result, for JSON/CSV files written with 
TimestampNTZ " +
-"columns, the inference results will still be of TimestampLTZ types.")
+"backward compatibility. As a result, for JSON/CSV files and partition 
directories  " +
+"written with TimestampNTZ columns, the inference results will still 
be of TimestampLTZ " +
+"types.")
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)

Review Comment:
   does it mean users can't do NTZ roundtrip (write and read) in 3.4 by default?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091465324


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,35 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+if (digits.isEmpty()) return false;
+  int checkSum = 0;

Review Comment:
   nit: the indentation is off.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091465224


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,35 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.

Review Comment:
   ```suggestion
   // Empty string is not a valid Luhn number
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39820: [SPARK-42249][SQL] Refining html link for documentation in error messages.

2023-01-30 Thread via GitHub


itholic commented on PR #39820:
URL: https://github.com/apache/spark/pull/39820#issuecomment-1409760795

   cc @srielau @MaxGekk This basically refines the doc html string in error 
class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #39820: [SPARK-42249][SQL] Refining html link for documentation in error messages.

2023-01-30 Thread via GitHub


itholic opened a new pull request, #39820:
URL: https://github.com/apache/spark/pull/39820

   
   
   ### What changes were proposed in this pull request?
   
   This PR proposes to refine html link for documentation in error messages by 
introducing `Utils.DOC_ROOT_DIR` that contains global directory for 
documentation root link: `https://spark.apache.org/docs/latest`
   
   
   ### Why are the changes needed?
   
   To improve error class readability and make sure using right root document 
directory across all source codes.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   ### How was this patch tested?
   
   
   Updated the whole existing tests related to this changes.
   
   Basically, run `./build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite*"`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091450516


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,35 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+if (!digits.isEmpty() && digits.chars().allMatch(Character::isDigit)) {

Review Comment:
   As @srowen mentioned, how about returning false in the for loop if space is 
found? the code can be
   ```
   if (digits.isEmpty()) return false;
   for ...
 if (!Character.isDigit(digits.charAt(i))) return false;
 ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39811: [SPARK-42242][BUILD] Upgrade `snappy-java` to 1.1.9.0

2023-01-30 Thread via GitHub


LuciferYang commented on PR #39811:
URL: https://github.com/apache/spark/pull/39811#issuecomment-1409758415

   https://github.com/xerial/snappy-java/blob/master/build.sbt#L30
   
   https://user-images.githubusercontent.com/1475305/215669005-9d91c3d5-e0ab-4dda-aae8-9677ef410118.png";>
   
   Interesting. It seems that the compilation target is still Java 8...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091447034


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+-- basic cases
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');
+select luhn_check('378282246310005');
+select luhn_check('6011000990139424');
+
+-- spaces in the beginning/middle/end
+select luhn_check('4111');
+select luhn_check('411 1');
+select luhn_check(' 4111');
+-- space
+select luhn_check('');
+select luhn_check('  ');
+-- non-digits
+select luhn_check('510B105105105106');
+select luhn_check('ABCDED');
+-- null
+select luhn_check(null);
+-- non string (test implicit cast)
+select luhn_check(6017);
+select luhn_check(6018);
+select luhn_check(123.456);
+
+select luhn_check('0');
+
+SELECT luhn_check(c1) from values ("4111"), ("5105105105105106") 
as tab(c1);

Review Comment:
   Removed `from values` tests



##
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala:
##
@@ -3081,4 +3081,19 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSparkSession {
 checkAnswer(df.select($"dd" / ($"num" + 3)),
   Seq((Duration.ofDays(2))).toDF)
   }
+
+  test("luhn check") {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091446740


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+-- basic cases
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');
+select luhn_check('378282246310005');
+select luhn_check('6011000990139424');

Review Comment:
   Updated input values to cover all code branches. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cxzl25 commented on pull request #39798: [MINOR] Fix typo `Exlude` to `Exclude` in `HealthTracker`

2023-01-30 Thread via GitHub


cxzl25 commented on PR #39798:
URL: https://github.com/apache/spark/pull/39798#issuecomment-1409749792

   > Could you try rerunning tests? The failure isn't related, I'm sure
   
   Thanks, now GA passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091446189


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,35 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+if (!digits.trim().isEmpty() && 
digits.chars().allMatch(Character::isDigit)) {

Review Comment:
trim is not required, removed it now



##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+-- basic cases
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');
+select luhn_check('378282246310005');
+select luhn_check('6011000990139424');
+
+-- spaces in the beginning/middle/end
+select luhn_check('4111');
+select luhn_check('411 1');
+select luhn_check(' 4111');
+-- space
+select luhn_check('');
+select luhn_check('  ');
+-- non-digits
+select luhn_check('510B105105105106');
+select luhn_check('ABCDED');
+-- null
+select luhn_check(null);
+-- non string (test implicit cast)
+select luhn_check(6017);
+select luhn_check(6018);
+select luhn_check(123.456);
+
+select luhn_check('0');

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091445199


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,39 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+java.util.function.Predicate isDigit = c -> 
Character.isDigit(c);
+java.util.function.Predicate isNotWhitespace = c -> 
!Character.isWhitespace(c);
+if (!digits.isEmpty() &&
+digits.chars().allMatch(
+c -> Character.isDigit(c) && 
!Character.isWhitespace((char) c))) {

Review Comment:
   By mistake that change got added, fixed the condition now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wayneguow opened a new pull request, #39819: [SPARK-42252][Core] Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-01-30 Thread via GitHub


wayneguow opened a new pull request, #39819:
URL: https://github.com/apache/spark/pull/39819

   
   
   ### What changes were proposed in this pull request?
   Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config 
spark.shuffle.localDisk.file.output.buffer instead.
   
   
   ### Why are the changes needed?
   The old config is desgined to be used in UnsafeShuffleWriter, but now it has 
been used in all local shuffle writers through LocalDiskShuffleMapOutputWriter.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Old still works, advised to use new.
   
   ### How was this patch tested?
   Passed existing tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


srowen commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091441853


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,39 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+java.util.function.Predicate isDigit = c -> 
Character.isDigit(c);
+java.util.function.Predicate isNotWhitespace = c -> 
!Character.isWhitespace(c);
+if (!digits.isEmpty() &&
+digits.chars().allMatch(
+c -> Character.isDigit(c) && 
!Character.isWhitespace((char) c))) {

Review Comment:
   (It does not)
   
   BTW, I think using allMatch is going to be non-trivial overhead. You're 
already iterating over chars, just check in the loop below and short-circuit if 
something isn't a digit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] anchovYu commented on a diff in pull request #39508: [SPARK-41985][SQL] Centralize more column resolution rules

2023-01-30 Thread via GitHub


anchovYu commented on code in PR #39508:
URL: https://github.com/apache/spark/pull/39508#discussion_r1091428757


##
sql/core/src/test/resources/sql-tests/inputs/column-resolution-aggregate.sql:
##
@@ -0,0 +1,30 @@
+-- Tests covering column resolution priority in Aggregate.
+
+CREATE TEMPORARY VIEW v1 AS VALUES (1, 1, 1), (2, 2, 1) AS t(a, b, k);
+CREATE TEMPORARY VIEW v2 AS VALUES (1, 1, 1), (2, 2, 1) AS t(x, y, all);
+
+-- Relation output columns have higher priority than lateral column alias. 
This query
+-- should fail as `b` is not in GROUP BY.
+SELECT max(a) AS b, b FROM v1 GROUP BY k;
+
+-- Lateral column alias has higher priority than outer reference.
+SELECT a FROM v1 WHERE (12, 13) IN (SELECT max(x + 10) AS a, a + 1 FROM v2);
+
+-- Relation output columns have higher priority than GROUP BY alias. This 
query should
+-- fail as `a` is not in GROUP BY.
+SELECT a AS k FROM v1 GROUP BY k;
+
+-- Relation output columns have higher priority than GROUP BY ALL. This query 
should
+-- fail as `x` is not in GROUP BY.
+SELECT x FROM v2 GROUP BY all;
+
+-- GROUP BY alias has higher priority than GROUP BY ALL, this query fails as 
`b` is not in GROUP BY.
+SELECT a AS all, b FROM v1 GROUP BY all;
+
+-- GROUP BY alias/ALL does not support lateral column alias.
+SELECT k AS lca, lca + 1 AS col FROM v1 GROUP BY k, col;
+SELECT k AS lca, lca + 1 AS col FROM v1 GROUP BY all;

Review Comment:
   Could you add one more test which is how I misunderstood the error message:
   ```
   SELECT k AS lca, lca + 1 AS col FROM v1 GROUP BY lca;
   ``` 
   Having LCA in SELECT list but group by alias not containing LCA should still 
resolve.



##
sql/core/src/test/resources/sql-tests/inputs/column-resolution-sort.sql:
##
@@ -0,0 +1,20 @@
+--SET spark.sql.leafNodeDefaultParallelism=1
+-- Tests covering column resolution priority in Sort.
+
+CREATE TEMPORARY VIEW v1 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, k);
+CREATE TEMPORARY VIEW v2 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, all);
+
+-- Relation output columns have higher priority than missing reference.
+-- Results will be [2, 1] if we order by the column `v1.b`.

Review Comment:
   Test is good. just comment needs correction.



##
sql/core/src/test/resources/sql-tests/inputs/column-resolution-sort.sql:
##
@@ -0,0 +1,20 @@
+--SET spark.sql.leafNodeDefaultParallelism=1
+-- Tests covering column resolution priority in Sort.
+
+CREATE TEMPORARY VIEW v1 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, k);
+CREATE TEMPORARY VIEW v2 AS VALUES (1, 2, 2), (2, 1, 1) AS t(a, b, all);
+
+-- Relation output columns have higher priority than missing reference.
+-- Results will be [2, 1] if we order by the column `v1.b`.

Review Comment:
   Missing reference won't be able to cover this case, it should only add 
grouping expressions to the Aggregate. v1.b is not a grouping expression, so if 
order by v1.b the query should throw exception.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveReferencesInSort.scala:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.SortOrder
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, Sort}
+
+/**
+ * A virtual rule to resolve [[UnresolvedAttribute]] in [[Sort]]. It's only 
used by the real
+ * rule `ResolveReferences`. The column resolution order for [[Sort]] is:
+ * 1. Resolves the column to [[AttributeReference]] with the output of the 
child plan. This
+ *includes metadata columns as well.
+ * 2. Resolves the column to a literal function which is allowed to be invoked 
without braces, e.g.
+ *`SELECT col, current_date FROM t`.
+ * 3. If the child plan is Aggregate, resolves the column to 
[[TempResolvedColumn]] with the output
+ *of Aggregate's child plan. This is to allow Sort to host grouping 
expressions and aggregate
+ *functions, which can be pushed down to the Aggregate later. For example,
+ *`SELECT max(a) FROM t GROUP BY b ORDER BY min(a)`.
+ * 4. Resolves the column to [[AttributeReference]

[GitHub] [spark] zhengruifeng opened a new pull request, #39818: [SPARK-42023][SPARK-42024][CONNECT][PYTHON] Make `createDataFrame` support `AtomicType -> StringType` coercion

2023-01-30 Thread via GitHub


zhengruifeng opened a new pull request, #39818:
URL: https://github.com/apache/spark/pull/39818

   ### What changes were proposed in this pull request?
   Make `createDataFrame` support `AtomicType -> StringType` coercion
   
   
   ### Why are the changes needed?
   to be consistent with PySpark, this feature was added in 
https://github.com/apache/spark/commit/51b04406028e14fbe1986f6a3ffc67853bd82935
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   added UT and enabled UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091427571


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,39 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+java.util.function.Predicate isDigit = c -> 
Character.isDigit(c);
+java.util.function.Predicate isNotWhitespace = c -> 
!Character.isWhitespace(c);
+if (!digits.isEmpty() &&
+digits.chars().allMatch(
+c -> Character.isDigit(c) && 
!Character.isWhitespace((char) c))) {

Review Comment:
   I'm confused. `Character.isDigit(c)` returns true for spaces?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


cloud-fan closed pull request #38760: [SPARK-41219][SQL] IntegralDivide use 
decimal(1, 0) to represent 0
URL: https://github.com/apache/spark/pull/38760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


cloud-fan commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409716946

   thanks, merging to master/3.4!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39817: [SPARK-42250][PYTHON][ML] `predict_batch_udf` with float fails when the batch size consists of single value

2023-01-30 Thread via GitHub


HyukjinKwon commented on PR #39817:
URL: https://github.com/apache/spark/pull/39817#issuecomment-1409712256

   cc @leewyang, @mengxr and @WeichenXu123.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39810: [SPARK-42241][CONNECT][TESTS] Fix the find connect jar condition in `SparkConnectServerUtils#findSparkConnectJar` for maven

2023-01-30 Thread via GitHub


LuciferYang commented on PR #39810:
URL: https://github.com/apache/spark/pull/39810#issuecomment-1409712130

   Thanks @HyukjinKwon
   
   I hope https://github.com/apache/spark/pull/39789  can be merged first, 
which seems to be blocking version release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #39817: [SPARK-42250][PYTHON][ML] `predict_batch_udf` with float fails when the batch size consists of single value

2023-01-30 Thread via GitHub


HyukjinKwon opened a new pull request, #39817:
URL: https://github.com/apache/spark/pull/39817

   ### What changes were proposed in this pull request?
   
   This PR is sort of a followup of https://github.com/apache/spark/pull/37734 
which handles the case when the batch is single scalar value.
   
   Essentially it proposes to work around the pandas behaviour by explicitly 
casting back to the original data type of NumPy Arrow:
   
   ```python
   >>> import numpy as np
   >>> import pandas as pd
   >>> np.squeeze(np.array([1.])).dtype
   dtype('float64')
   >>> pd.Series(np.squeeze(np.array([1.]))).dtype
   dtype('O')
   >>> pd.Series(np.squeeze(np.array([1., 1.]))).dtype
   dtype('float64')
   ```
   
   
   ### Why are the changes needed?
   
   Using `predict_batch_udf` fails when the size of batch happen to have single 
value. For example, even when the batch size is set to 10, if the size of data 
is 21, it fails because the last batch consists of the single value.
   
   ```python
   import numpy as np
   import pandas as pd
   from pyspark.ml.functions import predict_batch_udf
   from pyspark.sql.types import ArrayType, FloatType, StructType, StructField
   from typing import Mapping
   
   df = spark.createDataFrame([[[0.0, 1.0, 2.0, 3.0], [0.0, 1.0, 2.0]]], 
schema=["t1", "t2"])
   
   def make_multi_sum_fn():
   def predict(x1: np.ndarray, x2: np.ndarray) -> np.ndarray:
   return np.sum(x1, axis=1) + np.sum(x2, axis=1)
   return predict
   
   multi_sum_udf = predict_batch_udf(
   make_multi_sum_fn,
   return_type=FloatType(),
   batch_size=1,
   input_tensor_shapes=[[4], [3]],
   )
   
   df.select(multi_sum_udf("t1", "t2")).collect()
   ```
   
   **Before**
   
   ```
File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 829, in 
main
   process()
 File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 821, in 
process
   serializer.dump_stream(out_iter, outfile)
 File 
"/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 
345, in dump_stream
   return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
 File 
"/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 86, 
in dump_stream
   for batch in iterator:
 File 
"/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 
339, in init_stream_yield_batches
   batch = self._create_batch(series)
 File 
"/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 
275, in _create_batch
   arrs.append(create_array(s, t))
 File 
"/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 
245, in create_array
   raise e
 File 
"/.../spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 
233, in create_array
   array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
 File "pyarrow/array.pxi", line 1044, in pyarrow.lib.Array.from_pandas
 File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
 File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Could not convert array(569.) with type 
numpy.ndarray: tried to convert to float32
   ```
   
   **After**
   
   ```
   [Row(predict(t1, t2)=9.0)]
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   This feature has not been released yet, so no user-facing change to the end 
users.
   It fixes a bug in the unreleased feature.
   
   ### How was this patch tested?
   
   Unittest was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39789: [SPARK-42228][BUILD][CONNECT] Add shade and relocation rule of grpc to connect-client-jvm module

2023-01-30 Thread via GitHub


LuciferYang commented on PR #39789:
URL: https://github.com/apache/spark/pull/39789#issuecomment-1409706471

   > The reason to use the server shaded jar is because that's the only way to 
start spark with spark connect.
   
   Why is this the only way? Because the conflict package name of `sql` module 
and `connect-client-jvm` module?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


viirya commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409692951

   > I think we need but it's a breaking change so how about creating a new pr 
for master
   
   Yea, let's do it in master branch. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


ulysses-you commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409665435

   @viirya I think we need but it's a breaking change so how about creating a 
new pr for master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


cloud-fan commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409665363

   > Do we need to add some assert in DecimalType to prevent zero precision?
   
   I think we should, but let's do it in the master branch only to officially 
ban 0 decimal precision.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #39667: [SPARK-42131][SQL] Extract the function that construct the select statement for JDBC dialect.

2023-01-30 Thread via GitHub


beliefer commented on code in PR #39667:
URL: https://github.com/apache/spark/pull/39667#discussion_r1091369388


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -551,6 +551,34 @@ abstract class JdbcDialect extends Serializable with 
Logging {
 if (offset > 0 ) s"OFFSET $offset" else ""
   }
 
+  /**
+   * returns the SQL text for the SELECT statement.
+   *
+   * @param options - JDBC options that contains url, table and other 
information.
+   * @param columnList - The names of the columns or aggregate columns to 
SELECT.
+   * @param tableSampleClause - The table sample clause for the SELECT 
statement.
+   * @param whereClause - The WHERE clause for the SELECT statement.
+   * @param groupByClause - The group by clause for the SELECT statement.
+   * @param orderByClause - The order by clause for the SELECT statement.
+   * @param limitClause - The LIMIT clause for the SELECT statement.
+   * @param offsetClause - The OFFSET clause for the SELECT statement.

Review Comment:
   `JdbcSQLBuilder` has been created.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError`

2023-01-30 Thread via GitHub


cloud-fan closed pull request #39797: [SPARK-42231][SQL] Turn 
`MISSING_STATIC_PARTITION_COLUMN` into `internalError`
URL: https://github.com/apache/spark/pull/39797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError`

2023-01-30 Thread via GitHub


cloud-fan commented on PR #39797:
URL: https://github.com/apache/spark/pull/39797#issuecomment-1409651159

   thanks, merging to master/3.4!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


ulysses-you commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409650160

   @cloud-fan updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into a

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091361589


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>
+val buffer = aliases(a.canonicalized)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a
+}
+  case _ =>
+}
+aliases
+  }
+
+  protected def hasAlias: Boolean = aliasMap.nonEmpty
+
+  /**
+   * Return a stream of expressions in which the original expression is 
projected with `aliasMap`.
+   */
+  protected def projectExpression(expr: Expression): Stream[Expression] = {
+val outputSet = AttributeSet(outputExpressions.map(_.toAttribute))
+expr.multiTransformDown {
+  // Mapping with aliases
+  case e: Expression if aliasMap.contains(e.canonicalized) =>
+aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) 
Seq(e) else Seq.empty)
+
+  // Prune if we encounter an attribute that we can't map and it is not in 
output set.
+  // This prune will go up to the closest `multiTransformDown()` call and 
returns `Stream.empty`
+  // there.
+  case a: Attribute if !outputSet.contains(a) => Seq.empty
+}
+  }
+}
+
+/**
+ * A trait that handles aliases in the `orderingExpressions` to produce 
`outputOrdering` that
+ * satisfies ordering requirements.
+ */
+trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]]
+  extends AliasAwareOutputExpression { self: QueryPlan[T] =>
+  protected def orderingExpressions: Seq[SortOrder]
+
+  override protected def strip(expr: Expression): Expression = expr match {
+case e: Empty2Null => strip(e.child)
+case _ => expr
+  }
+
+  override final def outputOrdering: Seq[SortOrder] = {
+if (hasAlias) {
+  orderingExpressions.flatMap { sortOrder =>
+val orderingSet = mutable.Set.empty[Expression]
+val sameOrderings = sortOrder.children.toStream
+  .flatMap(projectExpression)
+  .filter(e => orderingSet.add(e.canonicalized))
+  .take(aliasCandidateLimit)
+if (sameOrderings.nonEmpty) {
+  Some(sortOrder.copy(child = sameOrderings.head,
+sameOrderExpressions = sameOrderings.tail))
+} else {
+  None
+}

Review Comment:
   let's also add te

[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into a

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091361465


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>
+val buffer = aliases(a.canonicalized)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a
+}
+  case _ =>
+}
+aliases
+  }
+
+  protected def hasAlias: Boolean = aliasMap.nonEmpty
+
+  /**
+   * Return a stream of expressions in which the original expression is 
projected with `aliasMap`.
+   */
+  protected def projectExpression(expr: Expression): Stream[Expression] = {
+val outputSet = AttributeSet(outputExpressions.map(_.toAttribute))
+expr.multiTransformDown {
+  // Mapping with aliases
+  case e: Expression if aliasMap.contains(e.canonicalized) =>
+aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) 
Seq(e) else Seq.empty)
+
+  // Prune if we encounter an attribute that we can't map and it is not in 
output set.
+  // This prune will go up to the closest `multiTransformDown()` call and 
returns `Stream.empty`
+  // there.
+  case a: Attribute if !outputSet.contains(a) => Seq.empty
+}
+  }
+}
+
+/**
+ * A trait that handles aliases in the `orderingExpressions` to produce 
`outputOrdering` that
+ * satisfies ordering requirements.
+ */
+trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]]
+  extends AliasAwareOutputExpression { self: QueryPlan[T] =>
+  protected def orderingExpressions: Seq[SortOrder]
+
+  override protected def strip(expr: Expression): Expression = expr match {
+case e: Empty2Null => strip(e.child)
+case _ => expr
+  }
+
+  override final def outputOrdering: Seq[SortOrder] = {
+if (hasAlias) {
+  orderingExpressions.flatMap { sortOrder =>
+val orderingSet = mutable.Set.empty[Expression]
+val sameOrderings = sortOrder.children.toStream
+  .flatMap(projectExpression)
+  .filter(e => orderingSet.add(e.canonicalized))
+  .take(aliasCandidateLimit)
+if (sameOrderings.nonEmpty) {
+  Some(sortOrder.copy(child = sameOrderings.head,
+sameOrderExpressions = sameOrderings.tail))
+} else {
+  None
+}

Review Comment:
   Good point! Parti

[GitHub] [spark] cloud-fan commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into a

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091360793


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


ulysses-you commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091358506


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>
+val buffer = aliases(a.canonicalized)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a
+}
+  case _ =>
+}
+aliases
+  }
+
+  protected def hasAlias: Boolean = aliasMap.nonEmpty
+
+  /**
+   * Return a stream of expressions in which the original expression is 
projected with `aliasMap`.
+   */
+  protected def projectExpression(expr: Expression): Stream[Expression] = {
+val outputSet = AttributeSet(outputExpressions.map(_.toAttribute))
+expr.multiTransformDown {
+  // Mapping with aliases
+  case e: Expression if aliasMap.contains(e.canonicalized) =>
+aliasMap(e.canonicalized).toSeq ++ (if (e.containsChild.nonEmpty) 
Seq(e) else Seq.empty)
+
+  // Prune if we encounter an attribute that we can't map and it is not in 
output set.
+  // This prune will go up to the closest `multiTransformDown()` call and 
returns `Stream.empty`
+  // there.
+  case a: Attribute if !outputSet.contains(a) => Seq.empty
+}
+  }
+}
+
+/**
+ * A trait that handles aliases in the `orderingExpressions` to produce 
`outputOrdering` that
+ * satisfies ordering requirements.
+ */
+trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]]
+  extends AliasAwareOutputExpression { self: QueryPlan[T] =>
+  protected def orderingExpressions: Seq[SortOrder]
+
+  override protected def strip(expr: Expression): Expression = expr match {
+case e: Empty2Null => strip(e.child)
+case _ => expr
+  }
+
+  override final def outputOrdering: Seq[SortOrder] = {
+if (hasAlias) {
+  orderingExpressions.flatMap { sortOrder =>
+val orderingSet = mutable.Set.empty[Expression]
+val sameOrderings = sortOrder.children.toStream
+  .flatMap(projectExpression)
+  .filter(e => orderingSet.add(e.canonicalized))
+  .take(aliasCandidateLimit)
+if (sameOrderings.nonEmpty) {
+  Some(sortOrder.copy(child = sameOrderings.head,
+sameOrderExpressions = sameOrderings.tail))
+} else {
+  None
+}

Review Comment:
   is it correct ?

[GitHub] [spark] cloud-fan commented on pull request #39718: [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key

2023-01-30 Thread via GitHub


cloud-fan commented on PR #39718:
URL: https://github.com/apache/spark/pull/39718#issuecomment-1409644259

   thanks, merging to master/3.4!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #39718: [SPARK-42163][SQL] Fix schema pruning for non-foldable array index or map key

2023-01-30 Thread via GitHub


cloud-fan closed pull request #39718: [SPARK-42163][SQL] Fix schema pruning for 
non-foldable array index or map key
URL: https://github.com/apache/spark/pull/39718


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


ulysses-you commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091353269


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)

Review Comment:
   I prefer `strip(child).canonicalized`. I have not seen other code places 
that match a canonicalized expression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #39808: [SPARK-41970][SQL][FOLLOWUP] Revert SparkPath changes to FileIndex and FileRelation

2023-01-30 Thread via GitHub


cloud-fan commented on PR #39808:
URL: https://github.com/apache/spark/pull/39808#issuecomment-1409640068

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


ulysses-you commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091351690


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()
+outputExpressions.reverse.foreach {
+  case a @ Alias(child, _) =>
+val buffer = aliases.getOrElseUpdate(strip(child.canonicalized), 
mutable.ArrayBuffer.empty)
+if (buffer.size < aliasCandidateLimit) {
+  buffer += a.toAttribute
+}
+  case _ =>
+}
+outputExpressions.foreach {
+  case a: Attribute if aliases.contains(a.canonicalized) =>

Review Comment:
   What does this do ? do you mean  `!aliases.contains(a.canonicalized)` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into

2023-01-30 Thread via GitHub


ulysses-you commented on code in PR #37525:
URL: https://github.com/apache/spark/pull/37525#discussion_r1091351231


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeSet, Empty2Null, Expression, NamedExpression, SortOrder}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * A trait that provides functionality to handle aliases in the 
`outputExpressions`.
+ */
+trait AliasAwareOutputExpression extends SQLConfHelper {
+  protected val aliasCandidateLimit = 
conf.getConf(SQLConf.EXPRESSION_PROJECTION_CANDIDATE_LIMIT)
+  protected def outputExpressions: Seq[NamedExpression]
+  /**
+   * This method can be used to strip expression which does not affect the 
result, for example:
+   * strip the expression which is ordering agnostic for output ordering.
+   */
+  protected def strip(expr: Expression): Expression = expr
+
+  // Build an `Expression` -> `Attribute` alias map.
+  // There can be multiple alias defined for the same expressions but it 
doesn't make sense to store
+  // more than `aliasCandidateLimit` attributes for an expression. In those 
cases the old logic
+  // handled only the last alias so we need to make sure that we give 
precedence to that.
+  // If the `outputExpressions` contain simple attributes we need to add those 
too to the map.
+  private lazy val aliasMap = {
+val aliases = mutable.Map[Expression, mutable.ArrayBuffer[Attribute]]()

Review Comment:
   The current config acutally limit the size of final preserved exprs. I think 
we should add one more config to limit the candidates size in `aliasMap` and it 
can be bigger by default. This `aliasMap` may harm driver memory for wide 
tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown

2023-01-30 Thread via GitHub


cloud-fan commented on PR #39660:
URL: https://github.com/apache/spark/pull/39660#issuecomment-1409637668

   This is a new feature and we don't need to rush (can't catch 3.4 anyway). 
I'd like to have an API refactor first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


cloud-fan commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409636538

   Please link to the PR of decimal refactor. Then I think this is good to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091350090


##
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala:
##
@@ -3081,4 +3081,19 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSparkSession {
 checkAnswer(df.select($"dd" / ($"num" + 3)),
   Seq((Duration.ofDays(2))).toDF)
   }
+
+  test("luhn check") {

Review Comment:
   I think the golden file tests fully cover it. If we add a scala API for luhn 
check, we need this test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091349674


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+-- basic cases
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');
+select luhn_check('378282246310005');
+select luhn_check('6011000990139424');
+
+-- spaces in the beginning/middle/end
+select luhn_check('4111');
+select luhn_check('411 1');
+select luhn_check(' 4111');
+-- space
+select luhn_check('');
+select luhn_check('  ');
+-- non-digits
+select luhn_check('510B105105105106');
+select luhn_check('ABCDED');
+-- null
+select luhn_check(null);
+-- non string (test implicit cast)
+select luhn_check(6017);
+select luhn_check(6018);
+select luhn_check(123.456);
+
+select luhn_check('0');
+
+SELECT luhn_check(c1) from values ("4111"), ("5105105105105106") 
as tab(c1);

Review Comment:
   why do we need extra tests for `from values`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091349494


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+-- basic cases
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');
+select luhn_check('378282246310005');
+select luhn_check('6011000990139424');

Review Comment:
   do these 6 tests cover different code branches?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091349245


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,39 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+-- basic cases
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');
+select luhn_check('378282246310005');
+select luhn_check('6011000990139424');
+
+-- spaces in the beginning/middle/end
+select luhn_check('4111');
+select luhn_check('411 1');
+select luhn_check(' 4111');
+-- space
+select luhn_check('');
+select luhn_check('  ');
+-- non-digits
+select luhn_check('510B105105105106');
+select luhn_check('ABCDED');
+-- null
+select luhn_check(null);
+-- non string (test implicit cast)
+select luhn_check(6017);
+select luhn_check(6018);
+select luhn_check(123.456);
+
+select luhn_check('0');

Review Comment:
   shall we move it to the `-- basic cases` section? It's also the number 
string case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


cloud-fan commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091348733


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,35 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+if (!digits.trim().isEmpty() && 
digits.chars().allMatch(Character::isDigit)) {

Review Comment:
   do we need the `trim()` call here? luhn doesn't support spaces.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #39816: [SPARK-42245][BUILD] Upgrade scalafmt from 3.6.1 to 3.7.1

2023-01-30 Thread via GitHub


panbingkun opened a new pull request, #39816:
URL: https://github.com/apache/spark/pull/39816

   ### What changes were proposed in this pull request?
   The pr aims to upgrade scalafmt from 3.6.1 to 3.7.1
   
   ### Why are the changes needed?
   A. Release note:
   > https://github.com/scalameta/scalafmt/releases
   
   B. V3.6.1 VS V3.7.1
   > https://github.com/scalameta/scalafmt/compare/v3.6.1...v3.7.1
   
   C. Bring bug fix & some improvement:
   https://user-images.githubusercontent.com/15246973/215639186-47ad2abc-5827-4b0b-a401-10737bd05743.png";>
   https://user-images.githubusercontent.com/15246973/215639316-0df69d85-cb6b-40f8-acbf-d792193d1ba1.png";>
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Manually run: sh ./dev/scalafmt
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37525: [SPARK-40086][SPARK-42049][SQL] Improve AliasAwareOutputPartitioning and AliasAwareQueryOutputOrdering to take all aliases into account

2023-01-30 Thread via GitHub


cloud-fan commented on PR #37525:
URL: https://github.com/apache/spark/pull/37525#issuecomment-1409625294

   @peter-toth can you retrigger the tests? The pyspark failures may be flaky.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic

2023-01-30 Thread via GitHub


hvanhovell closed pull request #39760: [SPARK-42202][Connect][Test] Improve the 
E2E test server stop logic
URL: https://github.com/apache/spark/pull/39760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic

2023-01-30 Thread via GitHub


hvanhovell commented on PR #39760:
URL: https://github.com/apache/spark/pull/39760#issuecomment-1409623518

   Merging this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091338386


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -231,3 +231,42 @@ CREATE TEMPORARY VIEW fmtTable(fmtField) AS SELECT * FROM 
VALUES ('invalidFormat
 SELECT to_binary('abc', fmtField) FROM fmtTable;
 -- Clean up
 DROP VIEW IF EXISTS fmtTable;
+-- luhn_check
+select luhn_check('4111');
+select luhn_check('4111');
+select luhn_check('5504');
+select luhn_check('349');
+select luhn_check('60110004');

Review Comment:
   I added suggested test cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #38760: [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0

2023-01-30 Thread via GitHub


ulysses-you commented on PR #38760:
URL: https://github.com/apache/spark/pull/38760#issuecomment-1409618212

   @cloud-fan sure, has updated the description !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39799: [SPARK-42232][SQL] Rename error class: `UNSUPPORTED_FEATURE.JDBC_TRANSACTION`

2023-01-30 Thread via GitHub


itholic commented on PR #39799:
URL: https://github.com/apache/spark/pull/39799#issuecomment-1409618136

   cc @MaxGekk @srielau @cloud-fan FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #39660: [SPARK-42128][SQL] Support TOP (N) for MS SQL Server dialect as an alternative to Limit pushdown

2023-01-30 Thread via GitHub


beliefer commented on code in PR #39660:
URL: https://github.com/apache/spark/pull/39660#discussion_r1091335134


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##
@@ -307,11 +307,12 @@ private[jdbc] class JDBCRDD(
   ""
 }
 
+val myTopExpression: String = dialect.getTopExpression(limit) // SQL 
Server Limit alternative
 val myLimitClause: String = dialect.getLimitClause(limit)
 val myOffsetClause: String = dialect.getOffsetClause(offset)
 
 val sqlText = options.prepareQuery +
-  s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
+  s"SELECT $myTopExpression $columnList FROM ${options.tableOrQuery} 
$myTableSampleClause" +

Review Comment:
   I will do the work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #39815: [SPARK-42244][PYTHON] Refine error classes and messages

2023-01-30 Thread via GitHub


itholic opened a new pull request, #39815:
URL: https://github.com/apache/spark/pull/39815

   ### What changes were proposed in this pull request?
   
   This PR proposes to refine error classes and messages by using the Python 
type object name for error class name and its messages.
   
   nit - Also add missing dots at the end of error messages.
   
   
   ### Why are the changes needed?
   
   
   The type name was inconsistent in the previous error messages, e.g. string 
and str, int and integer, and the sometimes the error class name is too long to 
read.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   
   ### How was this patch tested?
   
   
   The existing CI should pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #39814: [WIP][SPARK-42208][CONNECT][PYTHON] Reuse UDF test cases under `pyspark.sql.tests`

2023-01-30 Thread via GitHub


xinrong-meng opened a new pull request, #39814:
URL: https://github.com/apache/spark/pull/39814

   ### What changes were proposed in this pull request?
   Reuse UDF test cases under `pyspark.sql.tests`.
   
   
   ### Why are the changes needed?
   Reach parity to the vanilla PySpark.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect

2023-01-30 Thread via GitHub


xinrong-meng commented on PR #39753:
URL: https://github.com/apache/spark/pull/39753#issuecomment-1409604165

   Merged it to unblock follow-up PRs. Feel free to leave comments, I will 
address them in the follow-ups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #39381: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…

2023-01-30 Thread via GitHub


srowen commented on PR #39381:
URL: https://github.com/apache/spark/pull/39381#issuecomment-1409602431

   Roger that - any chance you can retrigger these tests? they're hung for some 
reason


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect

2023-01-30 Thread via GitHub


xinrong-meng commented on PR #39753:
URL: https://github.com/apache/spark/pull/39753#issuecomment-1409601795

   Merged to master and branch-3.4. Thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng closed pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas UDF in Spark Connect

2023-01-30 Thread via GitHub


xinrong-meng closed pull request #39753: [SPARK-42125][CONNECT][PYTHON] Pandas 
UDF in Spark Connect
URL: https://github.com/apache/spark/pull/39753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091322757


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -3039,3 +3039,43 @@ case class SplitPart (
   partNum = newChildren.apply(2))
   }
 }
+
+/**
+ * Function to check if a given number string is a valid Luhn number. Returns 
true, if the number
+ * string is a valid Luhn number, false otherwise.
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(str ) - Checks that a string of digits is valid according to the 
Luhn algorithm.
+This checksum function is widely applied on credit card numbers and 
government identification
+numbers to distinguish valid numbers from mistyped, incorrect numbers.

Review Comment:
   To check newlines, I tried following tests
   
   `spark-sql> desc function luhn_check;`
   ```
   Function: luhn_check
   Class: org.apache.spark.sql.catalyst.expressions.Luhncheck
   Usage:
   luhn_check(str ) - Checks that a string of digits is valid according to 
the Luhn algorithm.
   This checksum function is widely applied on credit card numbers and 
government identification
   numbers to distinguish valid numbers from mistyped, incorrect numbers.
   ```
   
   I did the same test on other builtin functions like  `substring_index`, 
`rpad`
   ```
   spark-sql> desc function substring_index;
   Function: substring_index
   Class: org.apache.spark.sql.catalyst.expressions.SubstringIndex
   Usage:
   substring_index(str, delim, count) - Returns the substring from `str` 
before `count` occurrences of the delimiter `delim`.
 If `count` is positive, everything to the left of the final delimiter 
(counting from the
 left) is returned. If `count` is negative, everything to the right of 
the final delimiter
 (counting from the right) is returned. The function substring_index 
performs a case-sensitive match
 when searching for `delim`.
   ```
   Please suggest, if it is not suggested to add new lines,  I can change the 
function doc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on pull request #39760: [SPARK-42202][Connect][Test] Improve the E2E test server stop logic

2023-01-30 Thread via GitHub


zhenlineo commented on PR #39760:
URL: https://github.com/apache/spark/pull/39760#issuecomment-1409594157

   @LuciferYang could you have another look? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] fe2s commented on pull request #39381: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…

2023-01-30 Thread via GitHub


fe2s commented on PR #39381:
URL: https://github.com/apache/spark/pull/39381#issuecomment-1409594093

   > Is there a backport for 3.3. too?
   
   Created https://github.com/apache/spark/pull/39813. Sorry for the delay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window

2023-01-30 Thread via GitHub


gengliangwang commented on code in PR #39773:
URL: https://github.com/apache/spark/pull/39773#discussion_r1091320427


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -17,7 +17,8 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeMap, Expression, LateralColumnAliasReference, NamedExpression, 
ScalarSubquery}
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeMap, Expression, LateralColumnAliasReference, NamedExpression, 
ScalarSubquery, WindowExpression, WindowSpecDefinition}

Review Comment:
   nit: let's just use `import org.apache.spark.sql.catalyst.expressions._`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] fe2s opened a new pull request, #39813: [SPARK-41554] fix changing of Decimal scale when scale decreased by m…

2023-01-30 Thread via GitHub


fe2s opened a new pull request, #39813:
URL: https://github.com/apache/spark/pull/39813

   …ore than 18
   
   This is a backport PR for https://github.com/apache/spark/pull/39099


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on pull request #39789: [SPARK-42228][BUILD][CONNECT] Add shade and relocation rule of grpc to connect-client-jvm module

2023-01-30 Thread via GitHub


zhenlineo commented on PR #39789:
URL: https://github.com/apache/spark/pull/39789#issuecomment-1409593261

   Hi @LuciferYang Thanks for fixing this error. 
   
   Regarding the E2E test, I want to clarify the purpose of E2E: It is used by 
scala client to verify client methods are returning the correct results. Due to 
the nature of the client development, we can hardly verify a lot behaviors 
without a real server, thus the E2E tests contains a lot of very basic tests 
and need to run daily. It was not to testing the server shading or client 
shading. The reason to use the server shaded jar is because that's the only way 
to start spark with spark connect. 
   
   I am not super favor the idea to use E2E also for client/server shading 
test. Can I suggest we have a separate test to test the client/server shading, 
following the idea that each test only test one thing?
   
   Really great thanks to help to look into better E2E tests. I also tried to 
put the E2E in a separate module, I did not get too much benefits but created 
an empty module that only contains tests. It would be really great if you could 
work out a way to run the tests as if they are unit tests. Thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] techaddict commented on pull request #39614: [SPARK-42002][CONNECT][PYTHON] Implement DataFrameWriterV2

2023-01-30 Thread via GitHub


techaddict commented on PR #39614:
URL: https://github.com/apache/spark/pull/39614#issuecomment-1409591664

   @HyukjinKwon thanks, updated the PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window

2023-01-30 Thread via GitHub


gengliangwang commented on code in PR #39773:
URL: https://github.com/apache/spark/pull/39773#discussion_r1091316642


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -104,7 +112,10 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
   // phase 2: unwrap
   plan.resolveOperatorsUpWithPruning(
 _.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE), ruleId) {
-case p @ Project(projectList, child) if p.resolved
+case p @ Project(projectList, child) if
+  (p.resolved
+// can also work on nodes that contains Window but have all others 
resolved
+|| (projectList.exists(hasWindowExpression) && 
exprsAndChildResolved(p)))

Review Comment:
   `p.resolved || (projectList.exists(hasWindowExpression) && 
exprsAndChildResolved(p))` shows up multiple times. Shall we abstract them into 
a method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #39773: [SPARK-42217][SQL] Support implicit lateral column alias in queries with Window

2023-01-30 Thread via GitHub


gengliangwang commented on code in PR #39773:
URL: https://github.com/apache/spark/pull/39773#discussion_r1091315214


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -81,6 +82,9 @@ import org.apache.spark.sql.internal.SQLConf
  *+- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) 
AS avg(bonus)#27,
  *dept#14]
  *   +- Child [dept#14,name#15,salary#16,bonus#17]
+ *
+ *
+ * TODO: example with Window

Review Comment:
   @anchovYu could you fix the TODO here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38947: [SPARK-41233][SQL] Add `array_prepend` function

2023-01-30 Thread via GitHub


zhengruifeng commented on PR #38947:
URL: https://github.com/apache/spark/pull/38947#issuecomment-1409581323

   @navinvishy mind fix the scala linter failure?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #39747: [SPARK-42191][SQL] Support udf 'luhn_check'

2023-01-30 Thread via GitHub


vinodkc commented on code in PR #39747:
URL: https://github.com/apache/spark/pull/39747#discussion_r1091310570


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java:
##
@@ -35,6 +35,35 @@ public class ExpressionImplUtils {
   private static final int GCM_IV_LEN = 12;
   private static final int GCM_TAG_LEN = 128;
 
+  /**
+   * Function to check if a given number string is a valid Luhn number
+   * @param numberString
+   *  the number string to check
+   * @return
+   *  true if the number string is a valid Luhn number, false otherwise.
+   */
+  public static boolean isLuhnNumber(UTF8String numberString) {
+String digits = numberString.toString();
+// Check if all characters in the input string are digits.
+if (!digits.trim().isEmpty() && 
digits.chars().allMatch(Character::isDigit)) {
+  int checkSum = 0;
+  boolean isSecond = false;
+  for (int i = digits.length() - 1; i >= 0; i--) {
+int digit = Character.getNumericValue(digits.charAt(i));

Review Comment:
   spaces are not supported in a Luhn number check, so it will return false



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39795: [SPARK-42234][SQL] Rename error class: `UNSUPPORTED_FEATURE.REPEATED_PIVOT`

2023-01-30 Thread via GitHub


itholic commented on code in PR #39795:
URL: https://github.com/apache/spark/pull/39795#discussion_r1091304905


##
core/src/main/resources/error/error-classes.json:
##
@@ -1207,6 +1207,12 @@
 ],
 "sqlState" : "42K03"
   },
+  "REPEATED_CLAUSE" : {
+"message" : [
+  "The  clause may be used at most once per  operation."
+],
+"sqlState" : "2202G"

Review Comment:
   Thanks for the correction! Just updated to 42614.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39705: [SPARK-41488][SQL] Assign name to _LEGACY_ERROR_TEMP_1176 (and 1177)

2023-01-30 Thread via GitHub


itholic commented on code in PR #39705:
URL: https://github.com/apache/spark/pull/39705#discussion_r1091295305


##
core/src/main/resources/error/error-classes.json:
##
@@ -598,6 +598,29 @@
   "Please try to re-create the view by running: ."
 ]
   },
+  "INCOMPLETE_TYPE_DEFINITION" : {
+"message" : [
+  "Incomplete complex type:"
+],
+"subClass" : {
+  "ARRAY" : {
+"message" : [
+  "The definition of \"ARRAY\" type is incomplete. You must provide an 
element type. For example: \"ARRAY\"."

Review Comment:
   Oh, actually `` here is a parameter(unlike other sub error 
classes) so we already shows message like you suggested as below:
   ```scala
 case "array" =>
   new ParseException(
 errorClass = "INCOMPLETE_TYPE_DEFINITION.ARRAY",
 messageParameters = Map("elementType" -> ""),
 ctx)
 case "struct" =>
   new ParseException(
 errorClass = "INCOMPLETE_TYPE_DEFINITION.STRUCT",
 messageParameters = Map.empty,
 ctx)
 case "map" =>
   new ParseException(
 errorClass = "INCOMPLETE_TYPE_DEFINITION.MAP",
 messageParameters = Map.empty,
 ctx)
   ```
   
   It looks confusing tho... 😅 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #39795: [SPARK-42234][SQL] Rename error class: `UNSUPPORTED_FEATURE.REPEATED_PIVOT`

2023-01-30 Thread via GitHub


srielau commented on code in PR #39795:
URL: https://github.com/apache/spark/pull/39795#discussion_r1091299476


##
core/src/main/resources/error/error-classes.json:
##
@@ -1207,6 +1207,12 @@
 ],
 "sqlState" : "42K03"
   },
+  "REPEATED_CLAUSE" : {
+"message" : [
+  "The  clause may be used at most once per  operation."
+],
+"sqlState" : "2202G"

Review Comment:
   I mistypes. Should have been 42614 Duplicate clause.
   42616 is Db2 LUW specific (I drew on DB2 zOS to fill the table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #39795: [SPARK-42234][SQL] Rename error class: `UNSUPPORTED_FEATURE.REPEATED_PIVOT`

2023-01-30 Thread via GitHub


srielau commented on code in PR #39795:
URL: https://github.com/apache/spark/pull/39795#discussion_r1090896713


##
core/src/main/resources/error/error-classes.json:
##
@@ -1207,6 +1207,12 @@
 ],
 "sqlState" : "42K03"
   },
+  "REPEATED_CLAUSE" : {
+"message" : [
+  "The  clause may be used at most once per  operation."
+],
+"sqlState" : "2202G"

Review Comment:
   2202G is: invalid repeat argument in a sample clause is quite specific.
   ```suggestion
   "sqlState" : "42614"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39614: [SPARK-42002][CONNECT][PYTHON] Implement DataFrameWriterV2

2023-01-30 Thread via GitHub


HyukjinKwon commented on PR #39614:
URL: https://github.com/apache/spark/pull/39614#issuecomment-1409559076

   @techaddict mind rebasing this so we can get the latest result of the 
testing? Sorry it slipped through my fingers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39797: [SPARK-42231][SQL] Turn `MISSING_STATIC_PARTITION_COLUMN` into `internalError`

2023-01-30 Thread via GitHub


itholic commented on PR #39797:
URL: https://github.com/apache/spark/pull/39797#issuecomment-1409558942

   test passed.
   cc @MaxGekk @srielau @cloud-fan FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39614: [SPARK-42002][CONNECT][PYTHON] Implement DataFrameWriterV2

2023-01-30 Thread via GitHub


HyukjinKwon commented on code in PR #39614:
URL: https://github.com/apache/spark/pull/39614#discussion_r1091297030


##
python/pyspark/sql/connect/plan.py:
##
@@ -1413,6 +1413,95 @@ def _repr_html_(self) -> str:
 pass
 
 
+class WriteOperationV2(LogicalPlan):
+def __init__(self, child: "LogicalPlan", table_name: str) -> None:
+super(WriteOperationV2, self).__init__(child)
+self.table_name: Optional[str] = table_name
+self.provider: Optional[str] = None
+self.partitioning_columns: List["ColumnOrName"] = []
+self.options: dict[str, Optional[str]] = {}
+self.table_properties: dict[str, Optional[str]] = {}
+self.mode: Optional[str] = None
+self.overwrite_condition: Optional["ColumnOrName"] = None
+
+def col_to_expr(self, col: "ColumnOrName", session: "SparkConnectClient") 
-> proto.Expression:
+if isinstance(col, Column):
+return col.to_plan(session)
+else:
+return self.unresolved_attr(col)
+
+def command(self, session: "SparkConnectClient") -> proto.Command:
+assert self._child is not None
+plan = proto.Command()
+plan.write_operation_v2.input.CopyFrom(self._child.plan(session))
+if self.table_name is not None:
+plan.write_operation_v2.table_name = self.table_name
+if self.provider is not None:
+plan.write_operation_v2.provider = self.provider
+
+plan.write_operation_v2.partitioning_columns.extend(
+[self.col_to_expr(x, session) for x in self.partitioning_columns]
+)
+
+for k in self.options:
+if self.options[k] is None:
+plan.write_operation_v2.options.pop(k, None)
+else:
+plan.write_operation_v2.options[k] = cast(str, self.options[k])
+
+for k in self.table_properties:
+if self.table_properties[k] is None:
+plan.write_operation_v2.table_properties.pop(k, None)
+else:
+plan.write_operation_v2.table_properties[k] = cast(str, 
self.table_properties[k])
+
+if self.mode is not None:
+wm = self.mode.lower()
+if wm == "create":
+plan.write_operation_v2.mode = 
proto.WriteOperationV2.Mode.MODE_CREATE
+elif wm == "overwrite":
+plan.write_operation_v2.mode = 
proto.WriteOperationV2.Mode.MODE_OVERWRITE
+elif wm == "overwrite_partition":
+plan.write_operation_v2.mode = 
proto.WriteOperationV2.Mode.MODE_OVERWRITE_PARTITIONS
+elif wm == "append":
+plan.write_operation_v2.mode = 
proto.WriteOperationV2.Mode.MODE_APPEND
+elif wm == "replace":
+plan.write_operation_v2.mode = 
proto.WriteOperationV2.Mode.MODE_REPLACE
+if self.overwrite_condition is not None:
+plan.write_operation_v2.overwrite_condition.CopyFrom(
+self.col_to_expr(self.overwrite_condition, session)
+)
+elif wm == "create_or_replace":
+plan.write_operation_v2.mode = 
proto.WriteOperationV2.Mode.MODE_CREATE_OR_REPLACE
+else:
+raise ValueError(f"Unknown Mode value for DataFrame: 
{self.mode}")
+return plan
+
+def print(self, indent: int = 0) -> str:
+i = " " * indent
+return (
+f"{i}"
+f""
+)
+
+def _repr_html_(self) -> str:

Review Comment:
   I actually think you can remove `print` and `_repr_html_` implementation 
(and let the default implementation handle it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >