Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]

2024-03-06 Thread via GitHub


HyukjinKwon closed pull request #45306: [SPARK-47155][PYTHON] Fix Error Class 
Issue
URL: https://github.com/apache/spark/pull/45306


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45306:
URL: https://github.com/apache/spark/pull/45306#issuecomment-1982702140

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]

2024-03-06 Thread via GitHub


wForget commented on code in PR #45397:
URL: https://github.com/apache/spark/pull/45397#discussion_r1515650747


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, 
Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, 
Limit, LocalRelation, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT
+
+/**
+ * Converts local operations (i.e. ones that don't require data exchange) on 
`CommandResult`
+ * to `LocalRelation`.
+ */
+object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
+_.containsPattern(COMMAND_RESULT)) {
+case Project(projectList, CommandResult(output, _, _, rows))
+  if !projectList.exists(hasUnevaluableExpr) =>
+  val projection = new InterpretedMutableProjection(projectList, output)
+  projection.initialize(0)
+  LocalRelation(projectList.map(_.toAttribute), 
rows.map(projection(_).copy()))
+
+case Limit(IntegerLiteral(limit), CommandResult(output, _, _, rows)) =>
+  LocalRelation(output, rows.take(limit))
+
+case Filter(condition, CommandResult(output, _, _, rows))

Review Comment:
   Thanks for the explanation, it makes sense to me, can we continue to review 
#45373?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]

2024-03-06 Thread via GitHub


AngersZh commented on PR #33934:
URL: https://github.com/apache/spark/pull/33934#issuecomment-1982645154

   ping @holdenk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47315][SQL][TEST] Clean up tempView for `createTempView` UT [spark]

2024-03-06 Thread via GitHub


wForget commented on code in PR #45417:
URL: https://github.com/apache/spark/pull/45417#discussion_r1515636612


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -1420,18 +1420,20 @@ class DatasetSuite extends QueryTest
 dataset.sparkSession.catalog.dropTempView("tempView")
 
 withDatabase("test_db") {

Review Comment:
   > We're cleaning the temp db up so we're good?
   
   `tempView` does not seem to be cleaned up because the temporary view does 
not belong to a db.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on code in PR #45397:
URL: https://github.com/apache/spark/pull/45397#discussion_r1515634282


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, 
Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, 
Limit, LocalRelation, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT
+
+/**
+ * Converts local operations (i.e. ones that don't require data exchange) on 
`CommandResult`
+ * to `LocalRelation`.
+ */
+object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
+_.containsPattern(COMMAND_RESULT)) {
+case Project(projectList, CommandResult(output, _, _, rows))
+  if !projectList.exists(hasUnevaluableExpr) =>
+  val projection = new InterpretedMutableProjection(projectList, output)
+  projection.initialize(0)
+  LocalRelation(projectList.map(_.toAttribute), 
rows.map(projection(_).copy()))
+
+case Limit(IntegerLiteral(limit), CommandResult(output, _, _, rows)) =>
+  LocalRelation(output, rows.take(limit))
+
+case Filter(condition, CommandResult(output, _, _, rows))

Review Comment:
   ic thanks for explanation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1515631182


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   This class isn't actually an API. Even we change this for API usage, we 
can't just change the signature; otherwise, it will break binary compatibility



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47315][SQL][TEST] Clean up tempView for `createTempView` UT [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on code in PR #45417:
URL: https://github.com/apache/spark/pull/45417#discussion_r1515629669


##
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##
@@ -1420,18 +1420,20 @@ class DatasetSuite extends QueryTest
 dataset.sparkSession.catalog.dropTempView("tempView")
 
 withDatabase("test_db") {

Review Comment:
   We're cleaning the temp db up so we're good?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-06 Thread via GitHub


panbingkun commented on code in PR #45368:
URL: https://github.com/apache/spark/pull/45368#discussion_r1515628066


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -84,28 +85,28 @@ class BasicInMemoryTableCatalog extends TableCatalog {
 invalidatedTables.add(ident)
   }
 
-  // TODO: remove it when no tests calling this deprecated method.
+  // TODO: remove it when the deprecated method `createTable(..., StructType, 
...)`

Review Comment:
   I'm actually a bit hesitant, because if we `eventually` decide to 
`completely remove` this in `a future version`, keeping it now may be a good 
reminder, and I want to hear everyone's opinions. @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-06 Thread via GitHub


LuciferYang commented on code in PR #45368:
URL: https://github.com/apache/spark/pull/45368#discussion_r1515621179


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala:
##
@@ -249,6 +241,16 @@ class V2SessionCatalog(catalog: SessionCatalog)
 null // Return null to save the `loadTable` call for CREATE TABLE without 
AS SELECT.
   }
 
+  // TODO: remove it when the deprecated method `createTable(..., StructType, 
...)`

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-06 Thread via GitHub


LuciferYang commented on code in PR #45368:
URL: https://github.com/apache/spark/pull/45368#discussion_r1515620693


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -84,28 +85,28 @@ class BasicInMemoryTableCatalog extends TableCatalog {
 invalidatedTables.add(ident)
   }
 
-  // TODO: remove it when no tests calling this deprecated method.
+  // TODO: remove it when the deprecated method `createTable(..., StructType, 
...)`

Review Comment:
   Should we remove this `TODO`? Isn't the current solution the final one?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-06 Thread via GitHub


LuciferYang commented on code in PR #45368:
URL: https://github.com/apache/spark/pull/45368#discussion_r1515620693


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -84,28 +85,28 @@ class BasicInMemoryTableCatalog extends TableCatalog {
 invalidatedTables.add(ident)
   }
 
-  // TODO: remove it when no tests calling this deprecated method.
+  // TODO: remove it when the deprecated method `createTable(..., StructType, 
...)`

Review Comment:
   Should we remove this TODO? Isn't the current solution the final one?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47300][SQL] `quoteIfNeeded` should quote identifier starts with digits [spark]

2024-03-06 Thread via GitHub


dongjoon-hyun commented on PR #45401:
URL: https://github.com/apache/spark/pull/45401#issuecomment-1982580157

   It seems that the same failure (which @MaxGekk 's reported) still exists at 
the last CI.
   ```
   StringUtilsSuite:
   [info] - escapeLikeRegex (1 millisecond)
   [info] - filter pattern (1 millisecond)
   [info] - string concatenation (1 millisecond)
   [info] - string concatenation with limit (1 millisecond)
   [info] - string concatenation return value (1 millisecond)
   [info] - SPARK-31916: StringConcat doesn't overflow on many inputs (190 
milliseconds)
   [info] - SPARK-31916: verify that PlanStringConcat's output shows the actual 
length of the plan (3 milliseconds)
   [info] - SPARK-34872: quoteIfNeeded should quote a string which contains 
non-word characters *** FAILED *** (2 milliseconds)
   [info]   "[`1a`]" did not equal "[1a]" (StringUtilsSuite.scala:136)
   [info]   Analysis:
   [info]   "[`1a`]" -> "[1a]"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1515617343


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -475,6 +475,24 @@
 ],
 "sqlState" : "42704"
   },
+  "COLLATION_MISMATCH" : {
+"message" : [
+  "Could not determine which collation to use for string comparison."
+],
+"subClass" : {
+  "EXPLICIT" : {
+"message" : [
+  "Error occurred due to the mismatch between explicit collations: 
"

Review Comment:
   should we add an instruction for the user here? for example: `Please use the 
same collation for both strings.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]

2024-03-06 Thread via GitHub


panbingkun commented on code in PR #45368:
URL: https://github.com/apache/spark/pull/45368#discussion_r1515616923


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -84,28 +84,29 @@ class BasicInMemoryTableCatalog extends TableCatalog {
 invalidatedTables.add(ident)
   }
 
-  // TODO: remove it when no tests calling this deprecated method.
+  // TODO: remove it when the deprecated method `createTable(..., StructType, 
...)`

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1515613758


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -509,18 +509,10 @@ abstract class StringPredicate extends BinaryExpression
   return checkResult
 }
 // Additional check needed for collation compatibility
-val rightCollationId: Int = 
right.dataType.asInstanceOf[StringType].collationId
-if (collationId != rightCollationId) {
-  DataTypeMismatch(
-errorSubClass = "COLLATION_MISMATCH",
-messageParameters = Map(
-  "collationNameLeft" -> 
CollationFactory.fetchCollation(collationId).collationName,
-  "collationNameRight" -> 
CollationFactory.fetchCollation(rightCollationId).collationName
-)
-  )
-} else {
-  TypeCheckResult.TypeCheckSuccess
-}
+val outputCollationId: Int = TypeCoercion

Review Comment:
   also note that in:
   `final lazy val collationId: Int = 
left.dataType.asInstanceOf[StringType].collationId`
   
   `left` is chosen arbitrarily, and shouldn't be used to check whether a 
function supports that collation type _before_ checking whether `right` has the 
same collation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1515609105


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -509,18 +509,10 @@ abstract class StringPredicate extends BinaryExpression
   return checkResult
 }
 // Additional check needed for collation compatibility
-val rightCollationId: Int = 
right.dataType.asInstanceOf[StringType].collationId
-if (collationId != rightCollationId) {
-  DataTypeMismatch(
-errorSubClass = "COLLATION_MISMATCH",
-messageParameters = Map(
-  "collationNameLeft" -> 
CollationFactory.fetchCollation(collationId).collationName,
-  "collationNameRight" -> 
CollationFactory.fetchCollation(rightCollationId).collationName
-)
-  )
-} else {
-  TypeCheckResult.TypeCheckSuccess
-}
+val outputCollationId: Int = TypeCoercion

Review Comment:
   I'd say `COLLATION_MISMATCH` first, `UNSUPPORTED_COLLATION.FOR_FUNCTION` 
second. If the user specifies COLLATION_1 for `left` and COLLATION_2 for 
`right`, how would we know which one to use when checking whether the functions 
supports this type of collation? (in this case, suppose a function supports 
COLLATION_1, but not COLLATION_2 - does the 
`UNSUPPORTED_COLLATION.FOR_FUNCTION` pass or fail?)
   
   Hence, I think we would first need to establish that COLLATION_1 and 
COLLATION_2 are the same (no `COLLATION_MISMATCH`), before checking whether the 
function supports the requested collation (no 
`UNSUPPORTED_COLLATION.FOR_FUNCTION`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1515609105


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -509,18 +509,10 @@ abstract class StringPredicate extends BinaryExpression
   return checkResult
 }
 // Additional check needed for collation compatibility
-val rightCollationId: Int = 
right.dataType.asInstanceOf[StringType].collationId
-if (collationId != rightCollationId) {
-  DataTypeMismatch(
-errorSubClass = "COLLATION_MISMATCH",
-messageParameters = Map(
-  "collationNameLeft" -> 
CollationFactory.fetchCollation(collationId).collationName,
-  "collationNameRight" -> 
CollationFactory.fetchCollation(rightCollationId).collationName
-)
-  )
-} else {
-  TypeCheckResult.TypeCheckSuccess
-}
+val outputCollationId: Int = TypeCoercion

Review Comment:
   I'd say `COLLATION_MISMATCH` first, `UNSUPPORTED_COLLATION.FOR_FUNCTION` 
second. If the user specifies COLLATION_1 for `left` and COLLATION_2 for 
`right`, how would we know which one to use when checking whether the functions 
supports this type of collation? (in this case, suppose a function supports 
COLLATION_1, but not COLLATION_2 - does the 
`UNSUPPORTED_COLLATION.FOR_FUNCTION` pass or fail?)
   
   Hence, I think we would first need to establish that COLLATION_1 and 
COLLATION_2 are the same (no `COLLATION_MISMATCH `), before checking whether 
the function supports the requested collation (no 
`UNSUPPORTED_COLLATION.FOR_FUNCTION`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380


##
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -1096,7 +1096,7 @@ colPosition
 ;
 
 collateClause
-: COLLATE collationName=stringLit
+: COLLATE collationName=multipartIdentifier

Review Comment:
   @cloud-fan related to your comment, I'm just wondering what would be a 
better rule for this (if any)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380


##
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -1096,7 +1096,7 @@ colPosition
 ;
 
 collateClause
-: COLLATE collationName=stringLit
+: COLLATE collationName=multipartIdentifier

Review Comment:
   @cloud-fan related to your comment, I'm just wondering what would be a 
better rule for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44259][CONNECT][TESTS] Make `connect-client-jvm` pass on Java 21 except `RemoteSparkSession`-based tests [spark]

2024-03-06 Thread via GitHub


LuciferYang commented on PR #41805:
URL: https://github.com/apache/spark/pull/41805#issuecomment-1982513050

   > Ya, @LuciferYang is right.
   > 
   > To @Midhunpottammal , you need 
[SPARK-43831](https://issues.apache.org/jira/browse/SPARK-43831) for Java 21 
support.
   
   @Midhunpottammal As @dongjoon-hyun  said, all the relevant patches in 
[SPARK-43831](https://issues.apache.org/jira/browse/SPARK-43831) are needed for 
Java 21 support.  So this is not a job that can be accomplished with minor 
changes on Spark 3.5.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming [spark]

2024-03-06 Thread via GitHub


HeartSaVioR closed pull request #45406: [SPARK-47305][SQL] Fix PruneFilters to 
tag the isStreaming flag of LocalRelation correctly when the plan has both 
batch and streaming
URL: https://github.com/apache/spark/pull/45406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44259][CONNECT][TESTS] Make `connect-client-jvm` pass on Java 21 except `RemoteSparkSession`-based tests [spark]

2024-03-06 Thread via GitHub


Midhunpottammal commented on PR #41805:
URL: https://github.com/apache/spark/pull/41805#issuecomment-1982489254

   @dongjoon-hyun @LuciferYang Thank you, 
   
   I experimented with different versions of Java, Spark, and Arrow.I managed 
to get Arrow working with a lower version of Java in Spark 3.5.0. Here's my 
stack:
   
   > pyarrow==15.0.0
   > pyspark==3.5.0
   > java == Java(TM) SE Runtime Environment (build 17.0.10+11-LTS-240)
   
   When I try to move to Java version 21, I encounter the same error


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070


##
python/pyspark/sql/tests/test_types.py:
##
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
 if k != "varchar" and k != "char":
 self.assertEqual(t(), _parse_datatype_string(k))
 self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 
'UCS_BASIC'"))
+self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 
UCS_BASIC"))
 self.assertEqual(StringType(0), _parse_datatype_string("string"))
-self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 
'UCS_BASIC'"))
-self.assertEqual(StringType(0), _parse_datatype_string("string   
COLLATE 'UCS_BASIC'"))
-self.assertEqual(StringType(0), _parse_datatype_string("string 
COLLATE'UCS_BASIC'"))
-self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 
'UCS_BASIC_LCASE'"))
-self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 
'UCS_BASIC_LCASE'"))
-self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 
'UNICODE'"))
-self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 
'UNICODE_CI'"))
+self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 
UCS_BASIC"))

Review Comment:
   perhaps that would be best as a separate change? this one seems already 
scattered enough, and I think there's plenty of other places that may require 
changes w/ respect to naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45405:
URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070


##
python/pyspark/sql/tests/test_types.py:
##
@@ -862,15 +862,13 @@ def test_parse_datatype_string(self):
 if k != "varchar" and k != "char":
 self.assertEqual(t(), _parse_datatype_string(k))
 self.assertEqual(IntegerType(), _parse_datatype_string("int"))
-self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 
'UCS_BASIC'"))
+self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 
UCS_BASIC"))
 self.assertEqual(StringType(0), _parse_datatype_string("string"))
-self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 
'UCS_BASIC'"))
-self.assertEqual(StringType(0), _parse_datatype_string("string   
COLLATE 'UCS_BASIC'"))
-self.assertEqual(StringType(0), _parse_datatype_string("string 
COLLATE'UCS_BASIC'"))
-self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 
'UCS_BASIC_LCASE'"))
-self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 
'UCS_BASIC_LCASE'"))
-self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 
'UNICODE'"))
-self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 
'UNICODE_CI'"))
+self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 
UCS_BASIC"))

Review Comment:
   perhaps that would be best as a separate change? this one seems already 
scattered enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45409:
URL: https://github.com/apache/spark/pull/45409#discussion_r1515582120


##
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##
@@ -175,6 +175,25 @@ trait CreatableRelationProvider {
   mode: SaveMode,
   parameters: Map[String, String],
   data: DataFrame): BaseRelation
+
+  /**
+   * Check if the relation supports the given data type.
+   *
+   * @param dt Data type to check
+   * @return True if the data type is supported

Review Comment:
   let's add `@since 4.0.0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming [spark]

2024-03-06 Thread via GitHub


HeartSaVioR commented on PR #45406:
URL: https://github.com/apache/spark/pull/45406#issuecomment-1982445057

   Thanks! Merging to master/3.5/3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47248][SQL][COLLATION] Improved string function support: contains [spark]

2024-03-06 Thread via GitHub


uros-db commented on code in PR #45382:
URL: https://github.com/apache/spark/pull/45382#discussion_r1515579768


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -145,6 +149,18 @@ public Collation(
 }
   }
 
+  /**
+   * Auxiliary methods for collation aware string operations.
+   */
+
+  public static StringSearch getStringSearch(final UTF8String left, final 
UTF8String right,
+  final int collationId) {

Review Comment:
   definitely, that's in 
([SPARK-47295](https://issues.apache.org/jira/browse/SPARK-47295)) and will 
serve as an onboarding task for a new eng (of course, I will help them out with 
this)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][TEXT] Clean up tempView for `createTempView` UT [spark]

2024-03-06 Thread via GitHub


wForget opened a new pull request, #45417:
URL: https://github.com/apache/spark/pull/45417

   
   
   ### What changes were proposed in this pull request?
   
   Clean up `tempView` for `createTempView` UT
   
   ### Why are the changes needed?
   
   
   `tempView` created in `createTempView` UT is not cleaned up.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   no need
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45125:
URL: https://github.com/apache/spark/pull/45125#discussion_r1515561516


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -34,7 +34,7 @@ import 
org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, WITH_EX
  */
 object RewriteWithExpression extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = {
-plan.transformWithPruning(_.containsPattern(WITH_EXPRESSION)) {
+
plan.transformDownWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {

Review Comment:
   Then this rule becomes O(n^2) complexity as every level of subqueries runs 
this rule for all subqueries under its level.
   
   Thinking about it more, why do we handle the count bug in two places that 
are far away?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45397:
URL: https://github.com/apache/spark/pull/45397#discussion_r1515557219


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, 
Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, 
Limit, LocalRelation, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT
+
+/**
+ * Converts local operations (i.e. ones that don't require data exchange) on 
`CommandResult`
+ * to `LocalRelation`.
+ */
+object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
+_.containsPattern(COMMAND_RESULT)) {
+case Project(projectList, CommandResult(output, _, _, rows))
+  if !projectList.exists(hasUnevaluableExpr) =>
+  val projection = new InterpretedMutableProjection(projectList, output)
+  projection.initialize(0)
+  LocalRelation(projectList.map(_.toAttribute), 
rows.map(projection(_).copy()))
+
+case Limit(IntegerLiteral(limit), CommandResult(output, _, _, rows)) =>
+  LocalRelation(output, rows.take(limit))
+
+case Filter(condition, CommandResult(output, _, _, rows))

Review Comment:
   By looking at this rule, I'm on the fence now. The original target of 
`CommandResult` is for better UI support: 
https://github.com/apache/spark/commit/8013f985a4d07a948b0c22638314162819bfb2be
   
   e.g. , if you do `sql("show tables").filter(...)`, we do want to see a 
command result node under a filter node in the UI, even if it means extra jobs.
   
   I think certain DataFrame operations such as `df.show()`, `df.isEmpty` 
should just be exceptions. It looks like a single operation to users and we 
should not have extra jobs. But this should not be general to all operations on 
`CommandResult`
   
   cc @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]

2024-03-06 Thread via GitHub


ulysses-you commented on code in PR #45397:
URL: https://github.com/apache/spark/pull/45397#discussion_r1515554208


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, 
Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, 
Limit, LocalRelation, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT
+
+/**
+ * Converts local operations (i.e. ones that don't require data exchange) on 
`CommandResult`
+ * to `LocalRelation`.
+ */
+object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] {

Review Comment:
   Probably, we can add a new trait `LocalRelationConverable` to let 
`LocalRelation` and `CommandResult` inherit it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45397:
URL: https://github.com/apache/spark/pull/45397#discussion_r1515551231


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala:
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, 
Unevaluable}
+import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, 
Limit, LocalRelation, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT
+
+/**
+ * Converts local operations (i.e. ones that don't require data exchange) on 
`CommandResult`
+ * to `LocalRelation`.
+ */
+object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] {

Review Comment:
   oh I see!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cashmand commented on PR #45409:
URL: https://github.com/apache/spark/pull/45409#issuecomment-1982352373

   @cloud-fan Thanks for the review! I've updated with your feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]

2024-03-06 Thread via GitHub


cloud-fan closed pull request #45348: [SPARK-47238][SQL] Reduce executor memory 
usage by making generated code in WSCG a broadcast variable
URL: https://github.com/apache/spark/pull/45348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on PR #45348:
URL: https://github.com/apache/spark/pull/45348#issuecomment-1982346069

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cashmand commented on code in PR #45409:
URL: https://github.com/apache/spark/pull/45409#discussion_r1515515440


##
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##
@@ -175,6 +175,32 @@ trait CreatableRelationProvider {
   mode: SaveMode,
   parameters: Map[String, String],
   data: DataFrame): BaseRelation
+
+  /**
+   * Check if the relation supports the given data type.
+   *
+   * @param dt Data type to check
+   * @return True if the data type is supported
+   */
+  def supportsDataType(
+  dt: DataType
+  ): Boolean = {
+dt match {
+  case ArrayType(e, _) => supportsDataType(e)
+  case MapType(k, v, _) =>
+supportsDataType(k) && supportsDataType(v)
+  case StructType(fields) => fields.forall(f => 
supportsDataType(f.dataType))
+  case udt: UserDefinedType[_] => supportsDataType(udt.sqlType)
+  case _: AnsiIntervalType | CalendarIntervalType | VariantType => false
+  case BinaryType | BooleanType | ByteType | CalendarIntervalType | 
CharType(_) | DateType |
+   DayTimeIntervalType(_, _) | _ : DecimalType |  DoubleType | 
FloatType |

Review Comment:
   Oh, that was an accident. I'll remove from this list. I can have it only 
list supported types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47007][SQL][PYTHON][R][CONNECT] MapSort function [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on PR #45069:
URL: https://github.com/apache/spark/pull/45069#issuecomment-1982325384

   update the title since it also touch python/r/connect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub


wbo4958 commented on PR #45232:
URL: https://github.com/apache/spark/pull/45232#issuecomment-1982322729

   > > Does this PR introduce any user-facing change?
   > > Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through 
the connect pysprark client.
   > 
   > I think you are adding the ResourceProfile api to spark connect for 
anything to use, correct? I guess since Spark Connect doesn't have SparkContext 
support its only usable by these apis? It would be nice to have more info in 
the description about what you are adding and if they match the current python 
api's exactly?
   
   There are no APIs changed/added for ResourceProfile, this PR just changed 
ResourceProfile internally to support Spark Connect. Like, 
[here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-245bd4d0d3aea466e32424806642d225ffa4f07e75ef1d45f7fd7a68fbb3e4c3R118-R141)
 
[here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-e2439904a861b6dd69efec067e87ff43070e3b673911ab286aca382897954c2bR167-R469)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub


wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515490084


##
dev/sparktestsupport/modules.py:
##
@@ -554,6 +554,7 @@ def __hash__(self):
 "pyspark.resource.profile",
 # unittests
 "pyspark.resource.tests.test_resources",
+"pyspark.resource.tests.test_connect_resources",

Review Comment:
   This pull request already includes 
[pyspark.sql.tests.connect.test_resources](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-1ebc67a99155114aa6bced912ce7942956ce784eeb59aadab75b82814684bd27R1031)
 to test the general mapInPandas/mapInArrow functionality with ResourceProfile. 
On the other hand, `pyspark.resource.tests.test_connect_resources` is 
specifically for testing special cases like creating a ResourceProfile before 
establishing a remote session. Therefore, it seems appropriate to keep the 
tests in their respective locations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub


wbo4958 commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515485716


##
python/pyspark/sql/connect/resource/profile.py:
##
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class ResourceProfile:

Review Comment:
   This ResourceProfile is not user-facing; it is primarily used internally to 
create the resource profile on the server side and retrieve the associated 
resource profile ID.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]

2024-03-06 Thread via GitHub


HyukjinKwon closed pull request #45414: [SPARK-47311][SQL][PYTHON] Suppress 
Python exceptions where PySpark is not in the Python path
URL: https://github.com/apache/spark/pull/45414


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45414:
URL: https://github.com/apache/spark/pull/45414#issuecomment-1982313126

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515475836


##
python/pyspark/sql/tests/test_session.py:
##
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
 },
 )
 
+def test_clear_memory_type(self):

Review Comment:
   nit, it seems we don't have a parity test for `test_session`. does it make 
sense to move `SparkSessionProfileTests` out of `test_session` and add parity 
test for it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on PR #45413:
URL: https://github.com/apache/spark/pull/45413#issuecomment-1982294025

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` [spark]

2024-03-06 Thread via GitHub


zhengruifeng closed pull request #45413: [SPARK-46988][CONNECT][TESTS] Add 
tests for `map` type in `ProtoUtils.abbreviate`
URL: https://github.com/apache/spark/pull/45413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515455169


##
python/pyspark/sql/connect/resource/profile.py:
##
@@ -0,0 +1,69 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional, Dict
+
+from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest
+
+import pyspark.sql.connect.proto as pb2
+
+
+class ResourceProfile:

Review Comment:
   is this class user-facing?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47314][DOC] Correct the `ExternalSorter#writePartitionedMapOutput` method comment [spark]

2024-03-06 Thread via GitHub


zwangsheng commented on PR #45415:
URL: https://github.com/apache/spark/pull/45415#issuecomment-1982290244

   Hi @mridulm @yaooqinn @mccheah, aims to correct the comment, please take a 
review, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on code in PR #45232:
URL: https://github.com/apache/spark/pull/45232#discussion_r1515448887


##
dev/sparktestsupport/modules.py:
##
@@ -554,6 +554,7 @@ def __hash__(self):
 "pyspark.resource.profile",
 # unittests
 "pyspark.resource.tests.test_resources",
+"pyspark.resource.tests.test_connect_resources",

Review Comment:
   this test is for spark connect, I think we should move it to Module 
`pyspark_connect`?
   
   maybe we can move the test cases in it to 
`pyspark.sql.tests.connect.test_resources`?



##
dev/sparktestsupport/modules.py:
##
@@ -554,6 +554,7 @@ def __hash__(self):
 "pyspark.resource.profile",
 # unittests
 "pyspark.resource.tests.test_resources",
+"pyspark.resource.tests.test_connect_resources",

Review Comment:
   this test is for spark connect, I think we should move it to Module 
`pyspark_connect`?
   
   or maybe we can move the test cases in it to 
`pyspark.sql.tests.connect.test_resources`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP] align gihub workflow run ubuntu version [spark]

2024-03-06 Thread via GitHub


panbingkun opened a new pull request, #45416:
URL: https://github.com/apache/spark/pull/45416

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45409:
URL: https://github.com/apache/spark/pull/45409#discussion_r1515426129


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommandSuite.scala:
##
@@ -71,6 +71,39 @@ class SaveIntoDataSourceCommandSuite extends QueryTest with 
SharedSparkSession {
 
 FakeV1DataSource.data = null
   }
+
+  test("Data type support") {
+
+val dataSource = DataSource(
+  sparkSession = spark,
+  className = "jdbc",
+  partitionColumns = Nil,
+  options = Map())
+
+val df = spark.range(1).selectExpr(
+"cast('a' as binary) a", "true b", "cast(1 as byte) c", "1.23 d")
+dataSource.planForWriting(SaveMode.ErrorIfExists, df.logicalPlan)
+
+withSQLConf("spark.databricks.variant.enabled" -> "true") {
+  // Variant and Interval types are disallowed by default.
+  val unsupportedTypes = Seq(
+  "parse_json('1') v",
+  "array(parse_json('1'))",
+  "struct(1, parse_json('1')) s",
+  "map(1, parse_json('1')) s",
+  "INTERVAL '1' MONTH i",
+  "make_ym_interval(1, 2) ym",
+  "make_dt_interval(1, 2, 3, 4) dt")
+
+  unsupportedTypes.foreach { expr =>
+val df = spark.range(1).selectExpr(expr)
+val e = intercept[AnalysisException] {
+  dataSource.planForWriting(SaveMode.ErrorIfExists, df.logicalPlan)
+}
+assert(e.getMessage.contains("UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE"))

Review Comment:
   let's use `checkError` for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45409:
URL: https://github.com/apache/spark/pull/45409#discussion_r1515423962


##
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##
@@ -175,6 +175,32 @@ trait CreatableRelationProvider {
   mode: SaveMode,
   parameters: Map[String, String],
   data: DataFrame): BaseRelation
+
+  /**
+   * Check if the relation supports the given data type.
+   *
+   * @param dt Data type to check
+   * @return True if the data type is supported
+   */
+  def supportsDataType(
+  dt: DataType
+  ): Boolean = {
+dt match {
+  case ArrayType(e, _) => supportsDataType(e)
+  case MapType(k, v, _) =>
+supportsDataType(k) && supportsDataType(v)
+  case StructType(fields) => fields.forall(f => 
supportsDataType(f.dataType))
+  case udt: UserDefinedType[_] => supportsDataType(udt.sqlType)
+  case _: AnsiIntervalType | CalendarIntervalType | VariantType => false
+  case BinaryType | BooleanType | ByteType | CalendarIntervalType | 
CharType(_) | DateType |
+   DayTimeIntervalType(_, _) | _ : DecimalType |  DoubleType | 
FloatType |

Review Comment:
   It's confusing to have the interval types in both the true and false case 
matches. Shall we stick with the allowlist approach and only list the supported 
types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45409:
URL: https://github.com/apache/spark/pull/45409#discussion_r1515421569


##
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##
@@ -175,6 +175,32 @@ trait CreatableRelationProvider {
   mode: SaveMode,
   parameters: Map[String, String],
   data: DataFrame): BaseRelation
+
+  /**
+   * Check if the relation supports the given data type.
+   *
+   * @param dt Data type to check
+   * @return True if the data type is supported
+   */
+  def supportsDataType(
+  dt: DataType
+  ): Boolean = {
+dt match {
+  case ArrayType(e, _) => supportsDataType(e)
+  case MapType(k, v, _) =>
+supportsDataType(k) && supportsDataType(v)

Review Comment:
   ```suggestion
 case MapType(k, v, _) => supportsDataType(k) && supportsDataType(v)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45409:
URL: https://github.com/apache/spark/pull/45409#discussion_r1515421350


##
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:
##
@@ -175,6 +175,32 @@ trait CreatableRelationProvider {
   mode: SaveMode,
   parameters: Map[String, String],
   data: DataFrame): BaseRelation
+
+  /**
+   * Check if the relation supports the given data type.
+   *
+   * @param dt Data type to check
+   * @return True if the data type is supported
+   */
+  def supportsDataType(
+  dt: DataType
+  ): Boolean = {

Review Comment:
   ```suggestion
 def supportsDataType(dt: DataType): Boolean = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47314][DOC] Correct the `ExternalSorter#writePartitionedMapOutput` method comment [spark]

2024-03-06 Thread via GitHub


zwangsheng opened a new pull request, #45415:
URL: https://github.com/apache/spark/pull/45415

   ### What changes were proposed in this pull request?
   Correct the comment of `ExternalSorter#writePartitionedMapOutput`.
   
   `ExternalSorter#writePartitionedMapOutput` return nothing, and will update 
the partition's length when call partition pair writer to close.
   
   ### Why are the changes needed?
   
   Correct comment.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, developers will meet this change in source code.
   
   ### How was this patch tested?
   
   No need.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]

2024-03-06 Thread via GitHub


jwang0306 commented on code in PR #45348:
URL: https://github.com/apache/spark/pull/45348#discussion_r1514792504


##
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##
@@ -899,4 +900,28 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSparkSession
   }
 }
   }
+
+  test("SPARK-47238: Test broadcast threshold for generated code") {
+// case 1: threshold is -1, shouldn't broadcast since smaller than 0 means 
disabled.
+// case 2: threshold is a larger number, shouldn't broadcast since not yet 
exceeded.
+// case 3: threshold is 0, should broadcast since it's always smaller than 
generated code size.
+Seq((-1, false), (10, false), (0, true)).foreach { case 
(threshold, shouldBroadcast) =>
+  withSQLConf(SQLConf.WHOLESTAGE_BROADCAST_CLEANED_SOURCE_THRESHOLD.key -> 
threshold.toString) {
+val df = Seq(0, 1, 2).toDF().groupBy("value").sum()
+// Invoke tryBroadcastCleanedSource and make sure it returns the 
desired variables.

Review Comment:
   Good idea, otherwise I was even thinking of logging something to validate 
the broadcast. However, due to the way `evaluatorFactory` and 
`cleanedSourceOpt` are declared, they are not directly accessible, and thus we 
would need to retrieve them using reflection (unless we'd rather directly 
modify the visibility of the constructor fields).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]

2024-03-06 Thread via GitHub


jwang0306 commented on code in PR #45348:
URL: https://github.com/apache/spark/pull/45348#discussion_r1514792504


##
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##
@@ -899,4 +900,28 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSparkSession
   }
 }
   }
+
+  test("SPARK-47238: Test broadcast threshold for generated code") {
+// case 1: threshold is -1, shouldn't broadcast since smaller than 0 means 
disabled.
+// case 2: threshold is a larger number, shouldn't broadcast since not yet 
exceeded.
+// case 3: threshold is 0, should broadcast since it's always smaller than 
generated code size.
+Seq((-1, false), (10, false), (0, true)).foreach { case 
(threshold, shouldBroadcast) =>
+  withSQLConf(SQLConf.WHOLESTAGE_BROADCAST_CLEANED_SOURCE_THRESHOLD.key -> 
threshold.toString) {
+val df = Seq(0, 1, 2).toDF().groupBy("value").sum()
+// Invoke tryBroadcastCleanedSource and make sure it returns the 
desired variables.

Review Comment:
   Good idea, otherwise I was even thinking of logging something to validate 
the broadcast. However, due to the way `evaluatorFactory` and 
`cleanedSourceOpt` are declared, they are not directly accessible, and thus we 
would need to retrieve them using reflection (unless we'd rather directly 
modify the visibility of the constructor definition).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45350:
URL: https://github.com/apache/spark/pull/45350#discussion_r1515378900


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   }
 }
 
+// We must wait until all expressions except for generator functions are 
resolved before
+// rewriting generator functions in Project/Aggregate. This is necessary 
to make this rule
+// stable for different execution orders of analyzer rules. See also 
SPARK-47241.
+private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean 
= {
+  namedExprs.forall { ne =>
+ne.resolved || {
+  trimNonTopLevelAliases(ne) match {
+case AliasedGenerator(_, _, _) => true
+case _ => false
+  }
+}
+  }
+}
+
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsUpWithPruning(
   _.containsPattern(GENERATOR), ruleId) {
   case Project(projectList, _) if projectList.exists(hasNestedGenerator) =>
 val nestedGenerator = projectList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
-  case Project(projectList, _) if projectList.count(hasGenerator) > 1 =>
-val generators = projectList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"SELECT")
-
   case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) =>
 val nestedGenerator = aggList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
   case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 =>
 val generators = aggList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"aggregate")
+throw QueryCompilationErrors.moreThanOneGeneratorError(generators)

Review Comment:
   Another reason is we can't always figure out if it's aggregate or not. If 
there is no GROUP BY, the plan is still `Project` and we may fail before 
analyzer rewrite it to `Aggregate`, then we report `SELECT clause` anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]

2024-03-06 Thread via GitHub


viirya commented on code in PR #45350:
URL: https://github.com/apache/spark/pull/45350#discussion_r1515378800


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   }
 }
 
+// We must wait until all expressions except for generator functions are 
resolved before
+// rewriting generator functions in Project/Aggregate. This is necessary 
to make this rule
+// stable for different execution orders of analyzer rules. See also 
SPARK-47241.
+private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean 
= {
+  namedExprs.forall { ne =>
+ne.resolved || {
+  trimNonTopLevelAliases(ne) match {
+case AliasedGenerator(_, _, _) => true
+case _ => false
+  }
+}
+  }
+}
+
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsUpWithPruning(
   _.containsPattern(GENERATOR), ruleId) {
   case Project(projectList, _) if projectList.exists(hasNestedGenerator) =>
 val nestedGenerator = projectList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
-  case Project(projectList, _) if projectList.count(hasGenerator) > 1 =>
-val generators = projectList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"SELECT")
-
   case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) =>
 val nestedGenerator = aggList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
   case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 =>
 val generators = aggList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"aggregate")
+throw QueryCompilationErrors.moreThanOneGeneratorError(generators)

Review Comment:
   Hmm, okay.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45350:
URL: https://github.com/apache/spark/pull/45350#discussion_r1515378900


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   }
 }
 
+// We must wait until all expressions except for generator functions are 
resolved before
+// rewriting generator functions in Project/Aggregate. This is necessary 
to make this rule
+// stable for different execution orders of analyzer rules. See also 
SPARK-47241.
+private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean 
= {
+  namedExprs.forall { ne =>
+ne.resolved || {
+  trimNonTopLevelAliases(ne) match {
+case AliasedGenerator(_, _, _) => true
+case _ => false
+  }
+}
+  }
+}
+
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsUpWithPruning(
   _.containsPattern(GENERATOR), ruleId) {
   case Project(projectList, _) if projectList.exists(hasNestedGenerator) =>
 val nestedGenerator = projectList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
-  case Project(projectList, _) if projectList.count(hasGenerator) > 1 =>
-val generators = projectList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"SELECT")
-
   case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) =>
 val nestedGenerator = aggList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
   case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 =>
 val generators = aggList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"aggregate")
+throw QueryCompilationErrors.moreThanOneGeneratorError(generators)

Review Comment:
   Another reason is we can't always figure out if it's aggregate or nor. If 
there is no GROUP BY, the plan is still `Project` and we may fail before 
analyzer rewrite it to `Aggregate`, then we report `SELECT clause` anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45350:
URL: https://github.com/apache/spark/pull/45350#discussion_r1515379202


##
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala:
##
@@ -553,6 +552,32 @@ class GeneratorFunctionSuite extends QueryTest with 
SharedSparkSession {
   checkAnswer(df, Row(0.7604953758285915d))
 }
   }
+
+  test("SPARK-47241: two generator functions in SELECT") {
+def testTwoGenerators(needImplicitCast: Boolean): Unit = {
+  val df = sql(
+s"""
+  |SELECT
+  |explode(array('a', 'b')) as c1,
+  |explode(array(0L, ${if (needImplicitCast) "0L + 1" else "1L"})) as 
c2
+  |""".stripMargin)
+  checkAnswer(df, Seq(Row("a", 0L), Row("a", 1L), Row("b", 0L), Row("b", 
1L)))
+}
+testTwoGenerators(needImplicitCast = true)
+testTwoGenerators(needImplicitCast = false)
+  }
+
+  test("SPARK-47241: generator function after SELECT *") {

Review Comment:
   ```suggestion
 test("SPARK-47241: generator function after wildcard in SELECT") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45350:
URL: https://github.com/apache/spark/pull/45350#discussion_r1515379047


##
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala:
##
@@ -553,6 +552,32 @@ class GeneratorFunctionSuite extends QueryTest with 
SharedSparkSession {
   checkAnswer(df, Row(0.7604953758285915d))
 }
   }
+
+  test("SPARK-47241: two generator functions in SELECT") {
+def testTwoGenerators(needImplicitCast: Boolean): Unit = {
+  val df = sql(
+s"""
+  |SELECT
+  |explode(array('a', 'b')) as c1,
+  |explode(array(0L, ${if (needImplicitCast) "0L + 1" else "1L"})) as 
c2
+  |""".stripMargin)
+  checkAnswer(df, Seq(Row("a", 0L), Row("a", 1L), Row("b", 0L), Row("b", 
1L)))
+}
+testTwoGenerators(needImplicitCast = true)
+testTwoGenerators(needImplicitCast = false)
+  }
+
+  test("SPARK-47241: generator function after SELECT *") {
+val df = sql(
+  s"""
+ |SELECT *, explode(array('a', 'b')) as c1
+ |FROM
+ |(
+ |  SELECT id FROM range(1) GROUP BY 1
+ |)
+ |""".stripMargin)
+checkAnswer(df, Seq(Row(0, "a"), Row(0, "b")))
+  }

Review Comment:
   good suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]

2024-03-06 Thread via GitHub


cloud-fan commented on code in PR #45350:
URL: https://github.com/apache/spark/pull/45350#discussion_r1515378012


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   }
 }
 
+// We must wait until all expressions except for generator functions are 
resolved before
+// rewriting generator functions in Project/Aggregate. This is necessary 
to make this rule
+// stable for different execution orders of analyzer rules. See also 
SPARK-47241.
+private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean 
= {
+  namedExprs.forall { ne =>
+ne.resolved || {
+  trimNonTopLevelAliases(ne) match {
+case AliasedGenerator(_, _, _) => true
+case _ => false
+  }
+}
+  }
+}
+
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsUpWithPruning(
   _.containsPattern(GENERATOR), ruleId) {
   case Project(projectList, _) if projectList.exists(hasNestedGenerator) =>
 val nestedGenerator = projectList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
-  case Project(projectList, _) if projectList.count(hasGenerator) > 1 =>
-val generators = projectList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"SELECT")
-
   case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) =>
 val nestedGenerator = aggList.find(hasNestedGenerator).get
 throw 
QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator))
 
   case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 =>
 val generators = aggList.filter(hasGenerator).map(trimAlias)
-throw QueryCompilationErrors.moreThanOneGeneratorError(generators, 
"aggregate")
+throw QueryCompilationErrors.moreThanOneGeneratorError(generators)

Review Comment:
   I find `aggregate clause` confusing, as what end users write is a `SELECT` 
query with GROUP BY or aggregate functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon closed pull request #45412: [SPARK-47070][SQL][FOLLOW-UP] Add a 
flag guarding a subquery in aggregate rewrite
URL: https://github.com/apache/spark/pull/45412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45412:
URL: https://github.com/apache/spark/pull/45412#issuecomment-1982197769

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on code in PR #45412:
URL: https://github.com/apache/spark/pull/45412#discussion_r1515374917


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3509,6 +3509,17 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val WRAP_EXISTS_IN_AGGREGATE_FUNCTION =
+buildConf("spark.sql.optimizer.wrapExistsInAggregateFunction")
+  .internal()
+  .doc("When true, the optimizer will wrap newly introduced `exists` 
attributes in an" +
+  "aggregate function to ensure that Aggregate nodes preserve semantic 
invariant that each" +
+  "variable among agg expressions appears either in grouping expressions 
or belongs to" +

Review Comment:
   nit, missing spaces
   
   ```suggestion
 .doc("When true, the optimizer will wrap newly introduced `exists` 
attributes in an " +
 "aggregate function to ensure that Aggregate nodes preserve semantic 
invariant that each " +
 "variable among agg expressions appears either in grouping expressions 
or belongs to " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]

2024-03-06 Thread via GitHub


jwang0306 commented on code in PR #45348:
URL: https://github.com/apache/spark/pull/45348#discussion_r1514792504


##
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##
@@ -899,4 +900,28 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSparkSession
   }
 }
   }
+
+  test("SPARK-47238: Test broadcast threshold for generated code") {
+// case 1: threshold is -1, shouldn't broadcast since smaller than 0 means 
disabled.
+// case 2: threshold is a larger number, shouldn't broadcast since not yet 
exceeded.
+// case 3: threshold is 0, should broadcast since it's always smaller than 
generated code size.
+Seq((-1, false), (10, false), (0, true)).foreach { case 
(threshold, shouldBroadcast) =>
+  withSQLConf(SQLConf.WHOLESTAGE_BROADCAST_CLEANED_SOURCE_THRESHOLD.key -> 
threshold.toString) {
+val df = Seq(0, 1, 2).toDF().groupBy("value").sum()
+// Invoke tryBroadcastCleanedSource and make sure it returns the 
desired variables.

Review Comment:
   Good idea, otherwise I was even thinking of logging something to validate 
the broadcast. However, due to the way `evaluatorFactory` and 
`cleanedSourceOpt` are declared, they are not directly accessible, and thus we 
would need to retrieve them using reflection (unless we'd rather directly 
modify the visibility of the constructor fields).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


zhengruifeng commented on code in PR #45412:
URL: https://github.com/apache/spark/pull/45412#discussion_r1515374917


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3509,6 +3509,17 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val WRAP_EXISTS_IN_AGGREGATE_FUNCTION =
+buildConf("spark.sql.optimizer.wrapExistsInAggregateFunction")
+  .internal()
+  .doc("When true, the optimizer will wrap newly introduced `exists` 
attributes in an" +
+  "aggregate function to ensure that Aggregate nodes preserve semantic 
invariant that each" +
+  "variable among agg expressions appears either in grouping expressions 
or belongs to" +

Review Comment:
   nit, missing spaces
   
   ```suggestion
 .doc("When true, the optimizer will wrap newly introduced `exists` 
attributes in an " +
 "aggregate function to ensure that Aggregate nodes preserve semantic 
invariant that each " +
 "variable among agg expressions appears either in grouping expressions 
or belongs to " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45414:
URL: https://github.com/apache/spark/pull/45414#issuecomment-1982190942

   cc @cloud-fan @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]

2024-03-06 Thread via GitHub


HyukjinKwon opened a new pull request, #45414:
URL: https://github.com/apache/spark/pull/45414

   ### What changes were proposed in this pull request?
   
   This PR proposes to suppress Python exceptions where PySpark is not in the 
Python path
   
   ### Why are the changes needed?
   
   `pyspark` library itself might be missing when users run Scala/Java and R 
Spark applications. In that case, this warning messages might too nosiy.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it will hide the warning message when a user meet all conditions below:
   - Use Scala or R only Spark application
   - Have a Python, but the Python does not have have PySpark in their Python 
path
 - Either because `SPARK_HOME` is undefined for some reasons,
 - Or, by other environment problems.
   
   ### How was this patch tested?
   
   Manually tested.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515363804


##
python/pyspark/sql/profiler.py:
##
@@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) 
-> None:
 The UDF ID whose profiling results should be cleared.
 If not specified, all the results will be cleared.
 """
-ids_to_remove = [
-result_id
-for result_id, (perf, _, *_) in self._profile_results.items()
-if perf is not None
-]
 with self._lock:
 if id is not None:
-if id in ids_to_remove:
-self._profile_results.pop(id, None)
+if id in self._profile_results:
+perf, mem, *rest = self._profile_results[id]
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+self._profile_results.pop(id, None)
 else:
-for id_to_remove in ids_to_remove:
-self._profile_results.pop(id_to_remove, None)
+ids_to_remove = []
+for id, (perf, mem, *rest) in 
list(self._profile_results.items()):
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+ids_to_remove.append(id)

Review Comment:
   Good idea! Adjusted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` [spark]

2024-03-06 Thread via GitHub


zhengruifeng opened a new pull request, #45413:
URL: https://github.com/apache/spark/pull/45413

   ### What changes were proposed in this pull request?
   Add tests for map types in `ProtoUtils.abbreviate`
   
   
   ### Why are the changes needed?
   when I started working on `SPARK-46988` to make `ProtoUtils.abbreviate` 
support `map` type, I just found that it had already been supported when we 
supported `repeated Message`.
   
   A `map` field internally is also a `repeated` field, and each element is a 
`Message`.
   
   see 
https://cloud.google.com/java/docs/reference/protobuf/latest/com.google.protobuf.MapEntry
   
   > In reflection API, map fields will be treated as repeated message fields 
and each map entry is accessed as a message. This MapEntry class is used to 
represent these map entry messages in reflection API.
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test-only
   
   
   ### How was this patch tested?
   added tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-06 Thread via GitHub


anishshri-db commented on PR #45341:
URL: https://github.com/apache/spark/pull/45341#issuecomment-1982166978

   @jingz-db - could you please fix this style error ?
   
   ```
   [info] compiling 1 Scala source to 
/home/runner/work/spark/spark/tools/target/scala-2.13/classes ...
   [error] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala:195:0:
 Whitespace at end of line
   [info]   Compilation completed in 14.316s.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]

2024-03-06 Thread via GitHub


github-actions[bot] commented on PR #42398:
URL: https://github.com/apache/spark/pull/42398#issuecomment-1982086020

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-06 Thread via GitHub


jingz-db commented on code in PR #45341:
URL: https://github.com/apache/spark/pull/45341#discussion_r1515298710


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala:
##
@@ -86,3 +88,53 @@ object StateTypesEncoder {
 new StateTypesEncoder[GK](keySerializer, stateName)
   }
 }
+
+class CompositeKeyStateEncoder[GK, K](
+keySerializer: Serializer[GK],
+stateName: String,
+userKeyEnc: Encoder[K])
+  extends StateTypesEncoder[GK](keySerializer: Serializer[GK], stateName: 
String) {
+
+  private val schemaForCompositeKeyRow: StructType =
+new StructType()
+  .add("key", BinaryType)
+  .add("userKey", BinaryType)
+  private val compositeKeyProjection = 
UnsafeProjection.create(schemaForCompositeKeyRow)
+  private val reuseRow = new UnsafeRow(userKeyEnc.schema.fields.length)
+  private val userKeyExpressionEnc = encoderFor(userKeyEnc)
+
+  private val userKeyRowToObjDeserializer =
+userKeyExpressionEnc.resolveAndBind().createDeserializer()
+  private val userKeySerializer = encoderFor(userKeyEnc).createSerializer()
+
+  def encodeCompositeKey(userKey: K): UnsafeRow = {
+val keyOption = ImplicitGroupingKeyTracker.getImplicitKeyOption
+if (keyOption.isEmpty) {
+  throw StateStoreErrors.implicitKeyNotFound(stateName)
+}
+val groupingKey = keyOption.get.asInstanceOf[GK]
+// generate grouping key byte array
+val groupingKeyByteArr = 
keySerializer.apply(groupingKey).asInstanceOf[UnsafeRow].getBytes()
+// generate user key byte array
+val userKeyBytesArr = 
userKeySerializer.apply(userKey).asInstanceOf[UnsafeRow].getBytes()
+
+val compositeKeyRow = 
compositeKeyProjection(InternalRow(groupingKeyByteArr, userKeyBytesArr))
+compositeKeyRow
+  }
+
+  def decodeCompositeKey(row: UnsafeRow): K = {
+val bytes = row.getBinary(1)
+reuseRow.pointTo(bytes, bytes.length)
+val value = userKeyRowToObjDeserializer.apply(reuseRow)
+value
+  }
+}
+
+object CompositeKeyStateEncoder {

Review Comment:
   I was following Bhuwan's style in the base class. Maybe I am missing 
something but did not find anything useful in the [style 
guide](https://github.com/databricks/scala-style-guide).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45306:
URL: https://github.com/apache/spark/pull/45306#issuecomment-1982064029

   ```
   ==
   FAIL [0.001s]: test_error_classes_sorted 
(pyspark.errors.tests.test_errors.ErrorsTest)
   --
   Traceback (most recent call last):
 File "/__w/spark/spark/python/pyspark/errors/tests/test_errors.py", line 
33, in test_error_classes_sorted
   self.assertTrue(
   AssertionError: False is not true : Error class 
[DATA_SOURCE_RETURN_SCHEMA_MISMATCH] should place after 
[DATA_SOURCE_CREATE_ERROR].
   
   Run 'cd $SPARK_HOME; bin/pyspark' and 'from pyspark.errors.exceptions import 
_write_self; _write_self()' to automatically sort them.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45412:
URL: https://github.com/apache/spark/pull/45412#issuecomment-1982062236

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

2024-03-06 Thread via GitHub


HyukjinKwon closed pull request #45374: [SPARK-47271][DOCS] Explain importance 
of statistics on SQL performance tuning page
URL: https://github.com/apache/spark/pull/45374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45374:
URL: https://github.com/apache/spark/pull/45374#issuecomment-1982061440

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45412:
URL: https://github.com/apache/spark/pull/45412#issuecomment-1982034513

   @anton5798 mind keeping the PR description template? 
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47070][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on code in PR #45412:
URL: https://github.com/apache/spark/pull/45412#discussion_r1515285320


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3509,6 +3509,17 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val WRAP_EXISTS_IN_AGGREGATE_FUNCTION =
+buildConf("spark.sql.optimizer.wrapExistsInAggregateFunction")
+  .internal()
+  .doc("When true, the optimizer will wrap newly introduced `exists` 
attributes in an" +
+  "aggregate function to ensure that Aggregate nodes preserve semantic 
invariant that each" +
+  "variable among agg expressions appears either in grouping expressions 
or belongs to" +
+  "and aggregate function.")
+  .version("3.5.0")

Review Comment:
   3.5.2 or 4.0.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45959][SQL] : Deduplication of relations causes ambiguous exception to be raised, though logically it is Unambiguous [spark]

2024-03-06 Thread via GitHub


ahshahid commented on PR #45343:
URL: https://github.com/apache/spark/pull/45343#issuecomment-1982020269

   While fixing the issue, found that tests in DataFrameSelfJoinSuite, where 
the two legs have distinct datasetId , the tests are expecting 
AnalysisException due to ambiguity in the join condition, which IMO is not 
correct as it is possible to dis-ambiguate using resolution via Dataset ID.
   This PR fixes the above and have test modifications accordingly.
   The same applies to some tests in DataFrameAsofJoinSuite.
   I believe there is still need to do some enhancement  in the existing  
source code related to AsofJoin, as the condition part is still not trying to 
disambiguate properly ( unlike the changes done for Join). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515240180


##
python/pyspark/sql/profiler.py:
##
@@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) 
-> None:
 The UDF ID whose profiling results should be cleared.
 If not specified, all the results will be cleared.
 """
-ids_to_remove = [
-result_id
-for result_id, (perf, _, *_) in self._profile_results.items()
-if perf is not None
-]
 with self._lock:
 if id is not None:
-if id in ids_to_remove:
-self._profile_results.pop(id, None)
+if id in self._profile_results:
+perf, mem, *rest = self._profile_results[id]
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+self._profile_results.pop(id, None)
 else:
-for id_to_remove in ids_to_remove:
-self._profile_results.pop(id_to_remove, None)
+ids_to_remove = []
+for id, (perf, mem, *rest) in 
list(self._profile_results.items()):
+self._profile_results[id] = (None, mem, *rest)
+if mem is None:
+ids_to_remove.append(id)

Review Comment:
   nit: Can't we pop it here?



##
python/pyspark/sql/profiler.py:
##
@@ -262,15 +266,21 @@ def clear_memory_profiles(self, id: Optional[int] = None) 
-> None:
 If not specified, all the results will be cleared.
 """
 with self._lock:
-ids_to_remove = [
-id for id, (_, mem, *_) in self._profile_results.items() if 
mem is not None
-]
 if id is not None:
-if id in ids_to_remove:
-self._profile_results.pop(id, None)
+if id in self._profile_results:
+perf, mem, *rest = self._profile_results[id]
+self._profile_results[id] = (perf, None, *rest)
+if perf is None:
+self._profile_results.pop(id, None)
 else:
-for id_to_remove in ids_to_remove:
-self._profile_results.pop(id_to_remove, None)
+ids_to_remove = []
+for id, (perf, mem, *rest) in 
list(self._profile_results.items()):
+self._profile_results[id] = (perf, None, *rest)
+if perf is None:
+ids_to_remove.append(id)

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515243849


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##
@@ -148,6 +148,17 @@ abstract class QueryStageExec extends LeafExecNode {
  */
 abstract class ExchangeQueryStageExec extends QueryStageExec {
 
+  /**
+   * This flag aims to detect if the stage materialization is started. This 
helps
+   * to avoid unnecessary stage materialization when the stage is canceled.
+   */
+  @transient protected val materializationStarted = new AtomicBoolean()

Review Comment:
   Yes, it is addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec(
 currentPhysicalPlan.foreach {
   // earlyFailedStage is the stage which failed before calling 
doMaterialize,
   // so we should avoid calling cancel on it to re-trigger the failure 
again.
-  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) =>
+  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) &&
+s.isMaterializationStarted() =>

Review Comment:
   This depends on the query complexity (`numOfExchangeQueryStage`) and which 
stage' s materialization is failed such as first stage or last stage. For 
example, if the query introduces multiple `ShuffleQueryStages` and if the first 
stage' s materialization is failed, currently, cancellation is being skipped 
for all of the remaining stages by new logic due to their 
`materializationStarted = false` such as:
   Let there are 3 `ShuffleQueryStages`:
   ```
   - ShuffleQueryStage-0' s materialization is failed (and its 
materializationStarted = true),
   - ShuffleQueryStage-1' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   - ShuffleQueryStage-2' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   ...
   ```
   I can also reproduce same behaviour for more complex queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec(
 currentPhysicalPlan.foreach {
   // earlyFailedStage is the stage which failed before calling 
doMaterialize,
   // so we should avoid calling cancel on it to re-trigger the failure 
again.
-  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) =>
+  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) &&
+s.isMaterializationStarted() =>

Review Comment:
   This depends on the query complexity (`numOfExchangeQueryStage`) and which 
stage' s materialization is failed such as first stage or last stage. For 
example, if the query introduces multiple `ShuffleQueryStages` and if the first 
stage' s materialization is failed, currently, cancellation is being skipped 
for all of the remaining stages by new logic due to their 
`materializationStarted = false` such as:
   Let there are 3 `ShuffleQueryStages`:
   ```
   - ShuffleQueryStage-0' s materialization is failed (and its 
materializationStarted = true),
   - ShuffleQueryStage-1' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   - ShuffleQueryStage-2' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false.
   ```
   I can also reproduce same behaviour for more complex queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec(
 currentPhysicalPlan.foreach {
   // earlyFailedStage is the stage which failed before calling 
doMaterialize,
   // so we should avoid calling cancel on it to re-trigger the failure 
again.
-  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) =>
+  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) &&
+s.isMaterializationStarted() =>

Review Comment:
   This depends on the query complexity (`numOfExchangeQueryStage`) and which 
stage' s materialization is failed such as first stage or last stage. For 
example, if the query introduces multiple `ShuffleQueryStages` and if the first 
stage' s materialization is failed, currently, cancellation is being skipped 
for all of the remaining stages by new logic due to their 
`materializationStarted = false` such as:
   Let there are 3 `ShuffleQueryStages`:
   ```
   - ShuffleQueryStage-0' s materialization is failed,
   - ShuffleQueryStage-1' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   - ShuffleQueryStage-2' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false.
   ```
   I can also reproduce same behaviour for more complex queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47303][CORE][TESTS] Restructure MasterSuite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon closed pull request #45366: [SPARK-47303][CORE][TESTS] Restructure 
MasterSuite
URL: https://github.com/apache/spark/pull/45366


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec(
 currentPhysicalPlan.foreach {
   // earlyFailedStage is the stage which failed before calling 
doMaterialize,
   // so we should avoid calling cancel on it to re-trigger the failure 
again.
-  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) =>
+  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) &&
+s.isMaterializationStarted() =>

Review Comment:
   This depends on the query complexity / `numOfExchangeQueryStage` and which 
stage' s materialization is failed such as first stage or last stage. For 
example, if the query introduces multiple `ShuffleQueryStages` and if the first 
stage' s materialization is failed, currently, cancellation is being skipped 
for all of the remaining stages by new logic due to their 
`materializationStarted = false` such as:
   Let there are 3 `ShuffleQueryStages`:
   ```
   - ShuffleQueryStage-0' s materialization is failed,
   - ShuffleQueryStage-1' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   - ShuffleQueryStage-2' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false.
   ```
   I can also reproduce same behaviour for more complex queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47303][CORE][TESTS] Restructure MasterSuite [spark]

2024-03-06 Thread via GitHub


HyukjinKwon commented on PR #45366:
URL: https://github.com/apache/spark/pull/45366#issuecomment-1981968073

   Thanks.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec(
 currentPhysicalPlan.foreach {
   // earlyFailedStage is the stage which failed before calling 
doMaterialize,
   // so we should avoid calling cancel on it to re-trigger the failure 
again.
-  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) =>
+  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) &&
+s.isMaterializationStarted() =>

Review Comment:
   This depends on the query complexity/numOfExchangeQueryStage and which 
stage' s materialization is failed such as first stage or last stage. For 
example, if the query introduces multiple `ShuffleQueryStages` and if the first 
stage' s materialization is failed, currently, cancellation is being skipped 
for all of the remaining stages by new logic due to their 
`materializationStarted = false` such as:
   Let there are 3 `ShuffleQueryStages`:
   ```
   - ShuffleQueryStage-0' s materialization is failed,
   - ShuffleQueryStage-1' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   - ShuffleQueryStage-2' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false.
   ```
   I can also reproduce same behaviour for more complex queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]

2024-03-06 Thread via GitHub


erenavsarogullari commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala:
##
@@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec(
 currentPhysicalPlan.foreach {
   // earlyFailedStage is the stage which failed before calling 
doMaterialize,
   // so we should avoid calling cancel on it to re-trigger the failure 
again.
-  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) =>
+  case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) &&
+s.isMaterializationStarted() =>

Review Comment:
   This depends on the query complexity and which stage' s materialization is 
failed such as first stage or last stage. For example, if the query introduces 
multiple `ShuffleQueryStages` and if the first stage' s materialization is 
failed, currently, cancellation is being skipped for all of the remaining 
stages by new logic due to their `materializationStarted = false` such as:
   Let there are 3 `ShuffleQueryStages`:
   ```
   - ShuffleQueryStage-0' s materialization is failed,
   - ShuffleQueryStage-1' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false,
   - ShuffleQueryStage-2' s materialization is not started yet so it is not 
tried to cancel due to materializationStarted = false.
   ```
   I can also reproduce same behaviour for more complex queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515212345


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+id : int, optional
+The UDF ID whose profiling results should be cleared.
+If not specified, all the results will be cleared.
+"""
+ids_to_remove = [
+result_id
+for result_id, (perf, _, *_) in self._profile_results.items()
+if perf is not None
+]
+with self._lock:
+if id is not None:
+if id in ids_to_remove:
+self._profile_results.pop(id, None)

Review Comment:
   Good catch! Thanks for the example. I adjusted the code and added tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]

2024-03-06 Thread via GitHub


jingz-db commented on PR #45341:
URL: https://github.com/apache/spark/pull/45341#issuecomment-1981833646

   > @jingz-db - test failure seems related ?
   
   Weirdly is passing locally. Let me resolve your comments and retrigger the 
CI and see if it still fails. Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-06 Thread via GitHub


dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1981829489

   Hi, @ted-jenks . Could you elaborate your correctness situation a little 
more? It sounds like you have other systems to read Spark's data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47070][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]

2024-03-06 Thread via GitHub


anton5798 opened a new pull request, #45412:
URL: https://github.com/apache/spark/pull/45412

   ### What changes were proposed in this pull request?
   
   Add a flag that guards a recently introduced new codepath inside optimizer 
that wraps `exists` variables into an agg function. See 
[#45133](https://github.com/apache/spark/pull/45133) for details.
   
   ### Tests
   No additional tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47309][SQL][XML] Add schema inference unit tests and fix schema inference issues [spark]

2024-03-06 Thread via GitHub


shujingyang-db opened a new pull request, #45411:
URL: https://github.com/apache/spark/pull/45411

   
   
   ### What changes were proposed in this pull request?
   
   
   As titled.
   
   It also fixes schema inference issue
   
   1) when there's an empty tag
   
   2) when merging schema for NullType
   
   ### Why are the changes needed?
   
   Fix a bug.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes
   
   ### How was this patch tested?
   
   Unit tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45278] [YARN] Allow configuring Yarn executor bind address in Yarn [spark]

2024-03-06 Thread via GitHub


gedeh commented on PR #42870:
URL: https://github.com/apache/spark/pull/42870#issuecomment-1981727449

   Hello, I noticed this PR is closed, this is blocking Spark with Yarn in 
kubernetes. I dont understand whats left missing for this PR. If anyone in 
Spark project can shed a light what required to merge this for next release of 
Spark that would be great. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

2024-03-06 Thread via GitHub


ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515098224


##
python/pyspark/sql/profiler.py:
##
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
 for id in sorted(code_map.keys()):
 dump(id)
 
+def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+"""
+Clear the perf profile results.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+id : int, optional
+The UDF ID whose profiling results should be cleared.
+If not specified, all the results will be cleared.
+"""
+ids_to_remove = [
+result_id
+for result_id, (perf, _, *_) in self._profile_results.items()
+if perf is not None
+]
+with self._lock:
+if id is not None:
+if id in ids_to_remove:
+self._profile_results.pop(id, None)

Review Comment:
   On Jupyter:
   
   ```py
   from pyspark.sql.functions import pandas_udf
   df = spark.range(3)
   
   @pandas_udf("long")
   def add1(x):
   return x + 1
   
   added = df.select(add1("id"))
   
   spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
   added.show()
   
   spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
   added.show()
   
   spark.profile.show()
   ...
   
   spark.profile.clear(type="memory")
   
   spark.profile.show()  # should still show the perf results?
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]

2024-03-06 Thread via GitHub


xuanyuanking commented on PR #45266:
URL: https://github.com/apache/spark/pull/45266#issuecomment-1981661582

   Thanks for the contribution @y-wei !
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >