Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]
HyukjinKwon closed pull request #45306: [SPARK-47155][PYTHON] Fix Error Class Issue URL: https://github.com/apache/spark/pull/45306 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]
HyukjinKwon commented on PR #45306: URL: https://github.com/apache/spark/pull/45306#issuecomment-1982702140 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]
wForget commented on code in PR #45397: URL: https://github.com/apache/spark/pull/45397#discussion_r1515650747 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala: ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, Unevaluable} +import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, Limit, LocalRelation, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT + +/** + * Converts local operations (i.e. ones that don't require data exchange) on `CommandResult` + * to `LocalRelation`. + */ +object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( +_.containsPattern(COMMAND_RESULT)) { +case Project(projectList, CommandResult(output, _, _, rows)) + if !projectList.exists(hasUnevaluableExpr) => + val projection = new InterpretedMutableProjection(projectList, output) + projection.initialize(0) + LocalRelation(projectList.map(_.toAttribute), rows.map(projection(_).copy())) + +case Limit(IntegerLiteral(limit), CommandResult(output, _, _, rows)) => + LocalRelation(output, rows.take(limit)) + +case Filter(condition, CommandResult(output, _, _, rows)) Review Comment: Thanks for the explanation, it makes sense to me, can we continue to review #45373? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too [spark]
AngersZh commented on PR #33934: URL: https://github.com/apache/spark/pull/33934#issuecomment-1982645154 ping @holdenk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47315][SQL][TEST] Clean up tempView for `createTempView` UT [spark]
wForget commented on code in PR #45417: URL: https://github.com/apache/spark/pull/45417#discussion_r1515636612 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1420,18 +1420,20 @@ class DatasetSuite extends QueryTest dataset.sparkSession.catalog.dropTempView("tempView") withDatabase("test_db") { Review Comment: > We're cleaning the temp db up so we're good? `tempView` does not seem to be cleaned up because the temporary view does not belong to a db. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]
HyukjinKwon commented on code in PR #45397: URL: https://github.com/apache/spark/pull/45397#discussion_r1515634282 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala: ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, Unevaluable} +import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, Limit, LocalRelation, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT + +/** + * Converts local operations (i.e. ones that don't require data exchange) on `CommandResult` + * to `LocalRelation`. + */ +object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( +_.containsPattern(COMMAND_RESULT)) { +case Project(projectList, CommandResult(output, _, _, rows)) + if !projectList.exists(hasUnevaluableExpr) => + val projection = new InterpretedMutableProjection(projectList, output) + projection.initialize(0) + LocalRelation(projectList.map(_.toAttribute), rows.map(projection(_).copy())) + +case Limit(IntegerLiteral(limit), CommandResult(output, _, _, rows)) => + LocalRelation(output, rows.take(limit)) + +case Filter(condition, CommandResult(output, _, _, rows)) Review Comment: ic thanks for explanation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]
HyukjinKwon commented on code in PR #45410: URL: https://github.com/apache/spark/pull/45410#discussion_r1515631182 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala: ## @@ -153,12 +153,12 @@ object JDBCRDD extends Logging { */ class JDBCRDD( Review Comment: This class isn't actually an API. Even we change this for API usage, we can't just change the signature; otherwise, it will break binary compatibility -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47315][SQL][TEST] Clean up tempView for `createTempView` UT [spark]
HyukjinKwon commented on code in PR #45417: URL: https://github.com/apache/spark/pull/45417#discussion_r1515629669 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -1420,18 +1420,20 @@ class DatasetSuite extends QueryTest dataset.sparkSession.catalog.dropTempView("tempView") withDatabase("test_db") { Review Comment: We're cleaning the temp db up so we're good? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]
panbingkun commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1515628066 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala: ## @@ -84,28 +85,28 @@ class BasicInMemoryTableCatalog extends TableCatalog { invalidatedTables.add(ident) } - // TODO: remove it when no tests calling this deprecated method. + // TODO: remove it when the deprecated method `createTable(..., StructType, ...)` Review Comment: I'm actually a bit hesitant, because if we `eventually` decide to `completely remove` this in `a future version`, keeping it now may be a good reminder, and I want to hear everyone's opinions. @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]
LuciferYang commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1515621179 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -249,6 +241,16 @@ class V2SessionCatalog(catalog: SessionCatalog) null // Return null to save the `loadTable` call for CREATE TABLE without AS SELECT. } + // TODO: remove it when the deprecated method `createTable(..., StructType, ...)` Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]
LuciferYang commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1515620693 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala: ## @@ -84,28 +85,28 @@ class BasicInMemoryTableCatalog extends TableCatalog { invalidatedTables.add(ident) } - // TODO: remove it when no tests calling this deprecated method. + // TODO: remove it when the deprecated method `createTable(..., StructType, ...)` Review Comment: Should we remove this `TODO`? Isn't the current solution the final one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]
LuciferYang commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1515620693 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala: ## @@ -84,28 +85,28 @@ class BasicInMemoryTableCatalog extends TableCatalog { invalidatedTables.add(ident) } - // TODO: remove it when no tests calling this deprecated method. + // TODO: remove it when the deprecated method `createTable(..., StructType, ...)` Review Comment: Should we remove this TODO? Isn't the current solution the final one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47300][SQL] `quoteIfNeeded` should quote identifier starts with digits [spark]
dongjoon-hyun commented on PR #45401: URL: https://github.com/apache/spark/pull/45401#issuecomment-1982580157 It seems that the same failure (which @MaxGekk 's reported) still exists at the last CI. ``` StringUtilsSuite: [info] - escapeLikeRegex (1 millisecond) [info] - filter pattern (1 millisecond) [info] - string concatenation (1 millisecond) [info] - string concatenation with limit (1 millisecond) [info] - string concatenation return value (1 millisecond) [info] - SPARK-31916: StringConcat doesn't overflow on many inputs (190 milliseconds) [info] - SPARK-31916: verify that PlanStringConcat's output shows the actual length of the plan (3 milliseconds) [info] - SPARK-34872: quoteIfNeeded should quote a string which contains non-word characters *** FAILED *** (2 milliseconds) [info] "[`1a`]" did not equal "[1a]" (StringUtilsSuite.scala:136) [info] Analysis: [info] "[`1a`]" -> "[1a]" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]
uros-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1515617343 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -475,6 +475,24 @@ ], "sqlState" : "42704" }, + "COLLATION_MISMATCH" : { +"message" : [ + "Could not determine which collation to use for string comparison." +], +"subClass" : { + "EXPLICIT" : { +"message" : [ + "Error occurred due to the mismatch between explicit collations: " Review Comment: should we add an instruction for the user here? for example: `Please use the same collation for both strings.` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47265][SQL][TESTS] Replace `createTable(..., schema: StructType, ...)` with `createTable(..., columns: Array[Column], ...)` in UT [spark]
panbingkun commented on code in PR #45368: URL: https://github.com/apache/spark/pull/45368#discussion_r1515616923 ## sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala: ## @@ -84,28 +84,29 @@ class BasicInMemoryTableCatalog extends TableCatalog { invalidatedTables.add(ident) } - // TODO: remove it when no tests calling this deprecated method. + // TODO: remove it when the deprecated method `createTable(..., StructType, ...)` Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]
uros-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1515613758 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ## @@ -509,18 +509,10 @@ abstract class StringPredicate extends BinaryExpression return checkResult } // Additional check needed for collation compatibility -val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId -if (collationId != rightCollationId) { - DataTypeMismatch( -errorSubClass = "COLLATION_MISMATCH", -messageParameters = Map( - "collationNameLeft" -> CollationFactory.fetchCollation(collationId).collationName, - "collationNameRight" -> CollationFactory.fetchCollation(rightCollationId).collationName -) - ) -} else { - TypeCheckResult.TypeCheckSuccess -} +val outputCollationId: Int = TypeCoercion Review Comment: also note that in: `final lazy val collationId: Int = left.dataType.asInstanceOf[StringType].collationId` `left` is chosen arbitrarily, and shouldn't be used to check whether a function supports that collation type _before_ checking whether `right` has the same collation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]
uros-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1515609105 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ## @@ -509,18 +509,10 @@ abstract class StringPredicate extends BinaryExpression return checkResult } // Additional check needed for collation compatibility -val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId -if (collationId != rightCollationId) { - DataTypeMismatch( -errorSubClass = "COLLATION_MISMATCH", -messageParameters = Map( - "collationNameLeft" -> CollationFactory.fetchCollation(collationId).collationName, - "collationNameRight" -> CollationFactory.fetchCollation(rightCollationId).collationName -) - ) -} else { - TypeCheckResult.TypeCheckSuccess -} +val outputCollationId: Int = TypeCoercion Review Comment: I'd say `COLLATION_MISMATCH` first, `UNSUPPORTED_COLLATION.FOR_FUNCTION` second. If the user specifies COLLATION_1 for `left` and COLLATION_2 for `right`, how would we know which one to use when checking whether the functions supports this type of collation? (in this case, suppose a function supports COLLATION_1, but not COLLATION_2 - does the `UNSUPPORTED_COLLATION.FOR_FUNCTION` pass or fail?) Hence, I think we would first need to establish that COLLATION_1 and COLLATION_2 are the same (no `COLLATION_MISMATCH`), before checking whether the function supports the requested collation (no `UNSUPPORTED_COLLATION.FOR_FUNCTION`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [spark]
uros-db commented on code in PR #45383: URL: https://github.com/apache/spark/pull/45383#discussion_r1515609105 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ## @@ -509,18 +509,10 @@ abstract class StringPredicate extends BinaryExpression return checkResult } // Additional check needed for collation compatibility -val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId -if (collationId != rightCollationId) { - DataTypeMismatch( -errorSubClass = "COLLATION_MISMATCH", -messageParameters = Map( - "collationNameLeft" -> CollationFactory.fetchCollation(collationId).collationName, - "collationNameRight" -> CollationFactory.fetchCollation(rightCollationId).collationName -) - ) -} else { - TypeCheckResult.TypeCheckSuccess -} +val outputCollationId: Int = TypeCoercion Review Comment: I'd say `COLLATION_MISMATCH` first, `UNSUPPORTED_COLLATION.FOR_FUNCTION` second. If the user specifies COLLATION_1 for `left` and COLLATION_2 for `right`, how would we know which one to use when checking whether the functions supports this type of collation? (in this case, suppose a function supports COLLATION_1, but not COLLATION_2 - does the `UNSUPPORTED_COLLATION.FOR_FUNCTION` pass or fail?) Hence, I think we would first need to establish that COLLATION_1 and COLLATION_2 are the same (no `COLLATION_MISMATCH `), before checking whether the function supports the requested collation (no `UNSUPPORTED_COLLATION.FOR_FUNCTION`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]
uros-db commented on code in PR #45405: URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -1096,7 +1096,7 @@ colPosition ; collateClause -: COLLATE collationName=stringLit +: COLLATE collationName=multipartIdentifier Review Comment: @cloud-fan related to your comment, I'm just wondering what would be a better rule for this (if any)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]
uros-db commented on code in PR #45405: URL: https://github.com/apache/spark/pull/45405#discussion_r1515603380 ## sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -1096,7 +1096,7 @@ colPosition ; collateClause -: COLLATE collationName=stringLit +: COLLATE collationName=multipartIdentifier Review Comment: @cloud-fan related to your comment, I'm just wondering what would be a better rule for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44259][CONNECT][TESTS] Make `connect-client-jvm` pass on Java 21 except `RemoteSparkSession`-based tests [spark]
LuciferYang commented on PR #41805: URL: https://github.com/apache/spark/pull/41805#issuecomment-1982513050 > Ya, @LuciferYang is right. > > To @Midhunpottammal , you need [SPARK-43831](https://issues.apache.org/jira/browse/SPARK-43831) for Java 21 support. @Midhunpottammal As @dongjoon-hyun said, all the relevant patches in [SPARK-43831](https://issues.apache.org/jira/browse/SPARK-43831) are needed for Java 21 support. So this is not a job that can be accomplished with minor changes on Spark 3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming [spark]
HeartSaVioR closed pull request #45406: [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming URL: https://github.com/apache/spark/pull/45406 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44259][CONNECT][TESTS] Make `connect-client-jvm` pass on Java 21 except `RemoteSparkSession`-based tests [spark]
Midhunpottammal commented on PR #41805: URL: https://github.com/apache/spark/pull/41805#issuecomment-1982489254 @dongjoon-hyun @LuciferYang Thank you, I experimented with different versions of Java, Spark, and Arrow.I managed to get Arrow working with a lower version of Java in Spark 3.5.0. Here's my stack: > pyarrow==15.0.0 > pyspark==3.5.0 > java == Java(TM) SE Runtime Environment (build 17.0.10+11-LTS-240) When I try to move to Java version 21, I encounter the same error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]
uros-db commented on code in PR #45405: URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070 ## python/pyspark/sql/tests/test_types.py: ## @@ -862,15 +862,13 @@ def test_parse_datatype_string(self): if k != "varchar" and k != "char": self.assertEqual(t(), _parse_datatype_string(k)) self.assertEqual(IntegerType(), _parse_datatype_string("int")) -self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'")) +self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC")) self.assertEqual(StringType(0), _parse_datatype_string("string")) -self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'")) -self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'")) -self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'")) -self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'")) -self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'")) -self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'")) -self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'")) +self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC")) Review Comment: perhaps that would be best as a separate change? this one seems already scattered enough, and I think there's plenty of other places that may require changes w/ respect to naming -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47302][SQL][Collation] Collate key word as identifier [spark]
uros-db commented on code in PR #45405: URL: https://github.com/apache/spark/pull/45405#discussion_r1515587070 ## python/pyspark/sql/tests/test_types.py: ## @@ -862,15 +862,13 @@ def test_parse_datatype_string(self): if k != "varchar" and k != "char": self.assertEqual(t(), _parse_datatype_string(k)) self.assertEqual(IntegerType(), _parse_datatype_string("int")) -self.assertEqual(StringType(), _parse_datatype_string("string COLLATE 'UCS_BASIC'")) +self.assertEqual(StringType(), _parse_datatype_string("string COLLATE UCS_BASIC")) self.assertEqual(StringType(0), _parse_datatype_string("string")) -self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'")) -self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE 'UCS_BASIC'")) -self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE'UCS_BASIC'")) -self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'")) -self.assertEqual(StringType(1), _parse_datatype_string("string COLLATE 'UCS_BASIC_LCASE'")) -self.assertEqual(StringType(2), _parse_datatype_string("string COLLATE 'UNICODE'")) -self.assertEqual(StringType(3), _parse_datatype_string("string COLLATE 'UNICODE_CI'")) +self.assertEqual(StringType(0), _parse_datatype_string("string COLLATE UCS_BASIC")) Review Comment: perhaps that would be best as a separate change? this one seems already scattered enough -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cloud-fan commented on code in PR #45409: URL: https://github.com/apache/spark/pull/45409#discussion_r1515582120 ## sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala: ## @@ -175,6 +175,25 @@ trait CreatableRelationProvider { mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation + + /** + * Check if the relation supports the given data type. + * + * @param dt Data type to check + * @return True if the data type is supported Review Comment: let's add `@since 4.0.0` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming [spark]
HeartSaVioR commented on PR #45406: URL: https://github.com/apache/spark/pull/45406#issuecomment-1982445057 Thanks! Merging to master/3.5/3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47248][SQL][COLLATION] Improved string function support: contains [spark]
uros-db commented on code in PR #45382: URL: https://github.com/apache/spark/pull/45382#discussion_r1515579768 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -145,6 +149,18 @@ public Collation( } } + /** + * Auxiliary methods for collation aware string operations. + */ + + public static StringSearch getStringSearch(final UTF8String left, final UTF8String right, + final int collationId) { Review Comment: definitely, that's in ([SPARK-47295](https://issues.apache.org/jira/browse/SPARK-47295)) and will serve as an onboarding task for a new eng (of course, I will help them out with this) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][TEXT] Clean up tempView for `createTempView` UT [spark]
wForget opened a new pull request, #45417: URL: https://github.com/apache/spark/pull/45417 ### What changes were proposed in this pull request? Clean up `tempView` for `createTempView` UT ### Why are the changes needed? `tempView` created in `createTempView` UT is not cleaned up. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? no need ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46743][SQL] Count bug after constant folding [spark]
cloud-fan commented on code in PR #45125: URL: https://github.com/apache/spark/pull/45125#discussion_r1515561516 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala: ## @@ -34,7 +34,7 @@ import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, WITH_EX */ object RewriteWithExpression extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = { -plan.transformWithPruning(_.containsPattern(WITH_EXPRESSION)) { + plan.transformDownWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) { Review Comment: Then this rule becomes O(n^2) complexity as every level of subqueries runs this rule for all subqueries under its level. Thinking about it more, why do we handle the count bug in two places that are far away? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]
cloud-fan commented on code in PR #45397: URL: https://github.com/apache/spark/pull/45397#discussion_r1515557219 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala: ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, Unevaluable} +import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, Limit, LocalRelation, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT + +/** + * Converts local operations (i.e. ones that don't require data exchange) on `CommandResult` + * to `LocalRelation`. + */ +object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( +_.containsPattern(COMMAND_RESULT)) { +case Project(projectList, CommandResult(output, _, _, rows)) + if !projectList.exists(hasUnevaluableExpr) => + val projection = new InterpretedMutableProjection(projectList, output) + projection.initialize(0) + LocalRelation(projectList.map(_.toAttribute), rows.map(projection(_).copy())) + +case Limit(IntegerLiteral(limit), CommandResult(output, _, _, rows)) => + LocalRelation(output, rows.take(limit)) + +case Filter(condition, CommandResult(output, _, _, rows)) Review Comment: By looking at this rule, I'm on the fence now. The original target of `CommandResult` is for better UI support: https://github.com/apache/spark/commit/8013f985a4d07a948b0c22638314162819bfb2be e.g. , if you do `sql("show tables").filter(...)`, we do want to see a command result node under a filter node in the UI, even if it means extra jobs. I think certain DataFrame operations such as `df.show()`, `df.isEmpty` should just be exceptions. It looks like a single operation to users and we should not have extra jobs. But this should not be general to all operations on `CommandResult` cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]
ulysses-you commented on code in PR #45397: URL: https://github.com/apache/spark/pull/45397#discussion_r1515554208 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala: ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, Unevaluable} +import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, Limit, LocalRelation, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT + +/** + * Converts local operations (i.e. ones that don't require data exchange) on `CommandResult` + * to `LocalRelation`. + */ +object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] { Review Comment: Probably, we can add a new trait `LocalRelationConverable` to let `LocalRelation` and `CommandResult` inherit it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43124][SQL] Add ConvertCommandResultToLocalRelation rule [spark]
cloud-fan commented on code in PR #45397: URL: https://github.com/apache/spark/pull/45397#discussion_r1515551231 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ConvertCommandResultToLocalRelation.scala: ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, IntegerLiteral, InterpretedMutableProjection, Predicate, Unevaluable} +import org.apache.spark.sql.catalyst.plans.logical.{CommandResult, Filter, Limit, LocalRelation, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.COMMAND_RESULT + +/** + * Converts local operations (i.e. ones that don't require data exchange) on `CommandResult` + * to `LocalRelation`. + */ +object ConvertCommandResultToLocalRelation extends Rule[LogicalPlan] { Review Comment: oh I see! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cashmand commented on PR #45409: URL: https://github.com/apache/spark/pull/45409#issuecomment-1982352373 @cloud-fan Thanks for the review! I've updated with your feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]
cloud-fan closed pull request #45348: [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable URL: https://github.com/apache/spark/pull/45348 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]
cloud-fan commented on PR #45348: URL: https://github.com/apache/spark/pull/45348#issuecomment-1982346069 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cashmand commented on code in PR #45409: URL: https://github.com/apache/spark/pull/45409#discussion_r1515515440 ## sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala: ## @@ -175,6 +175,32 @@ trait CreatableRelationProvider { mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation + + /** + * Check if the relation supports the given data type. + * + * @param dt Data type to check + * @return True if the data type is supported + */ + def supportsDataType( + dt: DataType + ): Boolean = { +dt match { + case ArrayType(e, _) => supportsDataType(e) + case MapType(k, v, _) => +supportsDataType(k) && supportsDataType(v) + case StructType(fields) => fields.forall(f => supportsDataType(f.dataType)) + case udt: UserDefinedType[_] => supportsDataType(udt.sqlType) + case _: AnsiIntervalType | CalendarIntervalType | VariantType => false + case BinaryType | BooleanType | ByteType | CalendarIntervalType | CharType(_) | DateType | + DayTimeIntervalType(_, _) | _ : DecimalType | DoubleType | FloatType | Review Comment: Oh, that was an accident. I'll remove from this list. I can have it only list supported types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47007][SQL][PYTHON][R][CONNECT] MapSort function [spark]
zhengruifeng commented on PR #45069: URL: https://github.com/apache/spark/pull/45069#issuecomment-1982325384 update the title since it also touch python/r/connect -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]
wbo4958 commented on PR #45232: URL: https://github.com/apache/spark/pull/45232#issuecomment-1982322729 > > Does this PR introduce any user-facing change? > > Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the connect pysprark client. > > I think you are adding the ResourceProfile api to spark connect for anything to use, correct? I guess since Spark Connect doesn't have SparkContext support its only usable by these apis? It would be nice to have more info in the description about what you are adding and if they match the current python api's exactly? There are no APIs changed/added for ResourceProfile, this PR just changed ResourceProfile internally to support Spark Connect. Like, [here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-245bd4d0d3aea466e32424806642d225ffa4f07e75ef1d45f7fd7a68fbb3e4c3R118-R141) [here](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-e2439904a861b6dd69efec067e87ff43070e3b673911ab286aca382897954c2bR167-R469) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]
wbo4958 commented on code in PR #45232: URL: https://github.com/apache/spark/pull/45232#discussion_r1515490084 ## dev/sparktestsupport/modules.py: ## @@ -554,6 +554,7 @@ def __hash__(self): "pyspark.resource.profile", # unittests "pyspark.resource.tests.test_resources", +"pyspark.resource.tests.test_connect_resources", Review Comment: This pull request already includes [pyspark.sql.tests.connect.test_resources](https://github.com/apache/spark/pull/45232/files/30531aeb7b6eb5f2235f1dabf7b568f3cf49d841#diff-1ebc67a99155114aa6bced912ce7942956ce784eeb59aadab75b82814684bd27R1031) to test the general mapInPandas/mapInArrow functionality with ResourceProfile. On the other hand, `pyspark.resource.tests.test_connect_resources` is specifically for testing special cases like creating a ResourceProfile before establishing a remote session. Therefore, it seems appropriate to keep the tests in their respective locations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]
wbo4958 commented on code in PR #45232: URL: https://github.com/apache/spark/pull/45232#discussion_r1515485716 ## python/pyspark/sql/connect/resource/profile.py: ## @@ -0,0 +1,69 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from typing import Optional, Dict + +from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest + +import pyspark.sql.connect.proto as pb2 + + +class ResourceProfile: Review Comment: This ResourceProfile is not user-facing; it is primarily used internally to create the resource profile on the server side and retrieve the associated resource profile ID. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]
HyukjinKwon closed pull request #45414: [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path URL: https://github.com/apache/spark/pull/45414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]
HyukjinKwon commented on PR #45414: URL: https://github.com/apache/spark/pull/45414#issuecomment-1982313126 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]
zhengruifeng commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1515475836 ## python/pyspark/sql/tests/test_session.py: ## @@ -531,6 +531,33 @@ def test_dump_invalid_type(self): }, ) +def test_clear_memory_type(self): Review Comment: nit, it seems we don't have a parity test for `test_session`. does it make sense to move `SparkSessionProfileTests` out of `test_session` and add parity test for it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` [spark]
zhengruifeng commented on PR #45413: URL: https://github.com/apache/spark/pull/45413#issuecomment-1982294025 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` [spark]
zhengruifeng closed pull request #45413: [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` URL: https://github.com/apache/spark/pull/45413 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]
zhengruifeng commented on code in PR #45232: URL: https://github.com/apache/spark/pull/45232#discussion_r1515455169 ## python/pyspark/sql/connect/resource/profile.py: ## @@ -0,0 +1,69 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from typing import Optional, Dict + +from pyspark.resource import ExecutorResourceRequest, TaskResourceRequest + +import pyspark.sql.connect.proto as pb2 + + +class ResourceProfile: Review Comment: is this class user-facing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47314][DOC] Correct the `ExternalSorter#writePartitionedMapOutput` method comment [spark]
zwangsheng commented on PR #45415: URL: https://github.com/apache/spark/pull/45415#issuecomment-1982290244 Hi @mridulm @yaooqinn @mccheah, aims to correct the comment, please take a review, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile [spark]
zhengruifeng commented on code in PR #45232: URL: https://github.com/apache/spark/pull/45232#discussion_r1515448887 ## dev/sparktestsupport/modules.py: ## @@ -554,6 +554,7 @@ def __hash__(self): "pyspark.resource.profile", # unittests "pyspark.resource.tests.test_resources", +"pyspark.resource.tests.test_connect_resources", Review Comment: this test is for spark connect, I think we should move it to Module `pyspark_connect`? maybe we can move the test cases in it to `pyspark.sql.tests.connect.test_resources`? ## dev/sparktestsupport/modules.py: ## @@ -554,6 +554,7 @@ def __hash__(self): "pyspark.resource.profile", # unittests "pyspark.resource.tests.test_resources", +"pyspark.resource.tests.test_connect_resources", Review Comment: this test is for spark connect, I think we should move it to Module `pyspark_connect`? or maybe we can move the test cases in it to `pyspark.sql.tests.connect.test_resources`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP] align gihub workflow run ubuntu version [spark]
panbingkun opened a new pull request, #45416: URL: https://github.com/apache/spark/pull/45416 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cloud-fan commented on code in PR #45409: URL: https://github.com/apache/spark/pull/45409#discussion_r1515426129 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommandSuite.scala: ## @@ -71,6 +71,39 @@ class SaveIntoDataSourceCommandSuite extends QueryTest with SharedSparkSession { FakeV1DataSource.data = null } + + test("Data type support") { + +val dataSource = DataSource( + sparkSession = spark, + className = "jdbc", + partitionColumns = Nil, + options = Map()) + +val df = spark.range(1).selectExpr( +"cast('a' as binary) a", "true b", "cast(1 as byte) c", "1.23 d") +dataSource.planForWriting(SaveMode.ErrorIfExists, df.logicalPlan) + +withSQLConf("spark.databricks.variant.enabled" -> "true") { + // Variant and Interval types are disallowed by default. + val unsupportedTypes = Seq( + "parse_json('1') v", + "array(parse_json('1'))", + "struct(1, parse_json('1')) s", + "map(1, parse_json('1')) s", + "INTERVAL '1' MONTH i", + "make_ym_interval(1, 2) ym", + "make_dt_interval(1, 2, 3, 4) dt") + + unsupportedTypes.foreach { expr => +val df = spark.range(1).selectExpr(expr) +val e = intercept[AnalysisException] { + dataSource.planForWriting(SaveMode.ErrorIfExists, df.logicalPlan) +} +assert(e.getMessage.contains("UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE")) Review Comment: let's use `checkError` for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cloud-fan commented on code in PR #45409: URL: https://github.com/apache/spark/pull/45409#discussion_r1515423962 ## sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala: ## @@ -175,6 +175,32 @@ trait CreatableRelationProvider { mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation + + /** + * Check if the relation supports the given data type. + * + * @param dt Data type to check + * @return True if the data type is supported + */ + def supportsDataType( + dt: DataType + ): Boolean = { +dt match { + case ArrayType(e, _) => supportsDataType(e) + case MapType(k, v, _) => +supportsDataType(k) && supportsDataType(v) + case StructType(fields) => fields.forall(f => supportsDataType(f.dataType)) + case udt: UserDefinedType[_] => supportsDataType(udt.sqlType) + case _: AnsiIntervalType | CalendarIntervalType | VariantType => false + case BinaryType | BooleanType | ByteType | CalendarIntervalType | CharType(_) | DateType | + DayTimeIntervalType(_, _) | _ : DecimalType | DoubleType | FloatType | Review Comment: It's confusing to have the interval types in both the true and false case matches. Shall we stick with the allowlist approach and only list the supported types? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cloud-fan commented on code in PR #45409: URL: https://github.com/apache/spark/pull/45409#discussion_r1515421569 ## sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala: ## @@ -175,6 +175,32 @@ trait CreatableRelationProvider { mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation + + /** + * Check if the relation supports the given data type. + * + * @param dt Data type to check + * @return True if the data type is supported + */ + def supportsDataType( + dt: DataType + ): Boolean = { +dt match { + case ArrayType(e, _) => supportsDataType(e) + case MapType(k, v, _) => +supportsDataType(k) && supportsDataType(v) Review Comment: ```suggestion case MapType(k, v, _) => supportsDataType(k) && supportsDataType(v) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Move data type checks to CreatableRelationProvider [spark]
cloud-fan commented on code in PR #45409: URL: https://github.com/apache/spark/pull/45409#discussion_r1515421350 ## sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala: ## @@ -175,6 +175,32 @@ trait CreatableRelationProvider { mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation + + /** + * Check if the relation supports the given data type. + * + * @param dt Data type to check + * @return True if the data type is supported + */ + def supportsDataType( + dt: DataType + ): Boolean = { Review Comment: ```suggestion def supportsDataType(dt: DataType): Boolean = { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47314][DOC] Correct the `ExternalSorter#writePartitionedMapOutput` method comment [spark]
zwangsheng opened a new pull request, #45415: URL: https://github.com/apache/spark/pull/45415 ### What changes were proposed in this pull request? Correct the comment of `ExternalSorter#writePartitionedMapOutput`. `ExternalSorter#writePartitionedMapOutput` return nothing, and will update the partition's length when call partition pair writer to close. ### Why are the changes needed? Correct comment. ### Does this PR introduce _any_ user-facing change? No, developers will meet this change in source code. ### How was this patch tested? No need. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]
jwang0306 commented on code in PR #45348: URL: https://github.com/apache/spark/pull/45348#discussion_r1514792504 ## sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala: ## @@ -899,4 +900,28 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession } } } + + test("SPARK-47238: Test broadcast threshold for generated code") { +// case 1: threshold is -1, shouldn't broadcast since smaller than 0 means disabled. +// case 2: threshold is a larger number, shouldn't broadcast since not yet exceeded. +// case 3: threshold is 0, should broadcast since it's always smaller than generated code size. +Seq((-1, false), (10, false), (0, true)).foreach { case (threshold, shouldBroadcast) => + withSQLConf(SQLConf.WHOLESTAGE_BROADCAST_CLEANED_SOURCE_THRESHOLD.key -> threshold.toString) { +val df = Seq(0, 1, 2).toDF().groupBy("value").sum() +// Invoke tryBroadcastCleanedSource and make sure it returns the desired variables. Review Comment: Good idea, otherwise I was even thinking of logging something to validate the broadcast. However, due to the way `evaluatorFactory` and `cleanedSourceOpt` are declared, they are not directly accessible, and thus we would need to retrieve them using reflection (unless we'd rather directly modify the visibility of the constructor fields). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]
jwang0306 commented on code in PR #45348: URL: https://github.com/apache/spark/pull/45348#discussion_r1514792504 ## sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala: ## @@ -899,4 +900,28 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession } } } + + test("SPARK-47238: Test broadcast threshold for generated code") { +// case 1: threshold is -1, shouldn't broadcast since smaller than 0 means disabled. +// case 2: threshold is a larger number, shouldn't broadcast since not yet exceeded. +// case 3: threshold is 0, should broadcast since it's always smaller than generated code size. +Seq((-1, false), (10, false), (0, true)).foreach { case (threshold, shouldBroadcast) => + withSQLConf(SQLConf.WHOLESTAGE_BROADCAST_CLEANED_SOURCE_THRESHOLD.key -> threshold.toString) { +val df = Seq(0, 1, 2).toDF().groupBy("value").sum() +// Invoke tryBroadcastCleanedSource and make sure it returns the desired variables. Review Comment: Good idea, otherwise I was even thinking of logging something to validate the broadcast. However, due to the way `evaluatorFactory` and `cleanedSourceOpt` are declared, they are not directly accessible, and thus we would need to retrieve them using reflection (unless we'd rather directly modify the visibility of the constructor definition). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]
cloud-fan commented on code in PR #45350: URL: https://github.com/apache/spark/pull/45350#discussion_r1515378900 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor } } +// We must wait until all expressions except for generator functions are resolved before +// rewriting generator functions in Project/Aggregate. This is necessary to make this rule +// stable for different execution orders of analyzer rules. See also SPARK-47241. +private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean = { + namedExprs.forall { ne => +ne.resolved || { + trimNonTopLevelAliases(ne) match { +case AliasedGenerator(_, _, _) => true +case _ => false + } +} + } +} + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( _.containsPattern(GENERATOR), ruleId) { case Project(projectList, _) if projectList.exists(hasNestedGenerator) => val nestedGenerator = projectList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) - case Project(projectList, _) if projectList.count(hasGenerator) > 1 => -val generators = projectList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "SELECT") - case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) => val nestedGenerator = aggList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 => val generators = aggList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate") +throw QueryCompilationErrors.moreThanOneGeneratorError(generators) Review Comment: Another reason is we can't always figure out if it's aggregate or not. If there is no GROUP BY, the plan is still `Project` and we may fail before analyzer rewrite it to `Aggregate`, then we report `SELECT clause` anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]
viirya commented on code in PR #45350: URL: https://github.com/apache/spark/pull/45350#discussion_r1515378800 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor } } +// We must wait until all expressions except for generator functions are resolved before +// rewriting generator functions in Project/Aggregate. This is necessary to make this rule +// stable for different execution orders of analyzer rules. See also SPARK-47241. +private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean = { + namedExprs.forall { ne => +ne.resolved || { + trimNonTopLevelAliases(ne) match { +case AliasedGenerator(_, _, _) => true +case _ => false + } +} + } +} + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( _.containsPattern(GENERATOR), ruleId) { case Project(projectList, _) if projectList.exists(hasNestedGenerator) => val nestedGenerator = projectList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) - case Project(projectList, _) if projectList.count(hasGenerator) > 1 => -val generators = projectList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "SELECT") - case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) => val nestedGenerator = aggList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 => val generators = aggList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate") +throw QueryCompilationErrors.moreThanOneGeneratorError(generators) Review Comment: Hmm, okay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]
cloud-fan commented on code in PR #45350: URL: https://github.com/apache/spark/pull/45350#discussion_r1515378900 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor } } +// We must wait until all expressions except for generator functions are resolved before +// rewriting generator functions in Project/Aggregate. This is necessary to make this rule +// stable for different execution orders of analyzer rules. See also SPARK-47241. +private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean = { + namedExprs.forall { ne => +ne.resolved || { + trimNonTopLevelAliases(ne) match { +case AliasedGenerator(_, _, _) => true +case _ => false + } +} + } +} + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( _.containsPattern(GENERATOR), ruleId) { case Project(projectList, _) if projectList.exists(hasNestedGenerator) => val nestedGenerator = projectList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) - case Project(projectList, _) if projectList.count(hasGenerator) > 1 => -val generators = projectList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "SELECT") - case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) => val nestedGenerator = aggList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 => val generators = aggList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate") +throw QueryCompilationErrors.moreThanOneGeneratorError(generators) Review Comment: Another reason is we can't always figure out if it's aggregate or nor. If there is no GROUP BY, the plan is still `Project` and we may fail before analyzer rewrite it to `Aggregate`, then we report `SELECT clause` anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]
cloud-fan commented on code in PR #45350: URL: https://github.com/apache/spark/pull/45350#discussion_r1515379202 ## sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala: ## @@ -553,6 +552,32 @@ class GeneratorFunctionSuite extends QueryTest with SharedSparkSession { checkAnswer(df, Row(0.7604953758285915d)) } } + + test("SPARK-47241: two generator functions in SELECT") { +def testTwoGenerators(needImplicitCast: Boolean): Unit = { + val df = sql( +s""" + |SELECT + |explode(array('a', 'b')) as c1, + |explode(array(0L, ${if (needImplicitCast) "0L + 1" else "1L"})) as c2 + |""".stripMargin) + checkAnswer(df, Seq(Row("a", 0L), Row("a", 1L), Row("b", 0L), Row("b", 1L))) +} +testTwoGenerators(needImplicitCast = true) +testTwoGenerators(needImplicitCast = false) + } + + test("SPARK-47241: generator function after SELECT *") { Review Comment: ```suggestion test("SPARK-47241: generator function after wildcard in SELECT") { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]
cloud-fan commented on code in PR #45350: URL: https://github.com/apache/spark/pull/45350#discussion_r1515379047 ## sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala: ## @@ -553,6 +552,32 @@ class GeneratorFunctionSuite extends QueryTest with SharedSparkSession { checkAnswer(df, Row(0.7604953758285915d)) } } + + test("SPARK-47241: two generator functions in SELECT") { +def testTwoGenerators(needImplicitCast: Boolean): Unit = { + val df = sql( +s""" + |SELECT + |explode(array('a', 'b')) as c1, + |explode(array(0L, ${if (needImplicitCast) "0L + 1" else "1L"})) as c2 + |""".stripMargin) + checkAnswer(df, Seq(Row("a", 0L), Row("a", 1L), Row("b", 0L), Row("b", 1L))) +} +testTwoGenerators(needImplicitCast = true) +testTwoGenerators(needImplicitCast = false) + } + + test("SPARK-47241: generator function after SELECT *") { +val df = sql( + s""" + |SELECT *, explode(array('a', 'b')) as c1 + |FROM + |( + | SELECT id FROM range(1) GROUP BY 1 + |) + |""".stripMargin) +checkAnswer(df, Seq(Row(0, "a"), Row(0, "b"))) + } Review Comment: good suggestion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator [spark]
cloud-fan commented on code in PR #45350: URL: https://github.com/apache/spark/pull/45350#discussion_r1515378012 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -2876,28 +2876,36 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor } } +// We must wait until all expressions except for generator functions are resolved before +// rewriting generator functions in Project/Aggregate. This is necessary to make this rule +// stable for different execution orders of analyzer rules. See also SPARK-47241. +private def canRewriteGenerator(namedExprs: Seq[NamedExpression]): Boolean = { + namedExprs.forall { ne => +ne.resolved || { + trimNonTopLevelAliases(ne) match { +case AliasedGenerator(_, _, _) => true +case _ => false + } +} + } +} + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUpWithPruning( _.containsPattern(GENERATOR), ruleId) { case Project(projectList, _) if projectList.exists(hasNestedGenerator) => val nestedGenerator = projectList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) - case Project(projectList, _) if projectList.count(hasGenerator) > 1 => -val generators = projectList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "SELECT") - case Aggregate(_, aggList, _) if aggList.exists(hasNestedGenerator) => val nestedGenerator = aggList.find(hasNestedGenerator).get throw QueryCompilationErrors.nestedGeneratorError(trimAlias(nestedGenerator)) case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 => val generators = aggList.filter(hasGenerator).map(trimAlias) -throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate") +throw QueryCompilationErrors.moreThanOneGeneratorError(generators) Review Comment: I find `aggregate clause` confusing, as what end users write is a `SELECT` query with GROUP BY or aggregate functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
HyukjinKwon closed pull request #45412: [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite URL: https://github.com/apache/spark/pull/45412 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
HyukjinKwon commented on PR #45412: URL: https://github.com/apache/spark/pull/45412#issuecomment-1982197769 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
zhengruifeng commented on code in PR #45412: URL: https://github.com/apache/spark/pull/45412#discussion_r1515374917 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3509,6 +3509,17 @@ object SQLConf { .booleanConf .createWithDefault(true) + val WRAP_EXISTS_IN_AGGREGATE_FUNCTION = +buildConf("spark.sql.optimizer.wrapExistsInAggregateFunction") + .internal() + .doc("When true, the optimizer will wrap newly introduced `exists` attributes in an" + + "aggregate function to ensure that Aggregate nodes preserve semantic invariant that each" + + "variable among agg expressions appears either in grouping expressions or belongs to" + Review Comment: nit, missing spaces ```suggestion .doc("When true, the optimizer will wrap newly introduced `exists` attributes in an " + "aggregate function to ensure that Aggregate nodes preserve semantic invariant that each " + "variable among agg expressions appears either in grouping expressions or belongs to " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47238][SQL] Reduce executor memory usage by making generated code in WSCG a broadcast variable [spark]
jwang0306 commented on code in PR #45348: URL: https://github.com/apache/spark/pull/45348#discussion_r1514792504 ## sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala: ## @@ -899,4 +900,28 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession } } } + + test("SPARK-47238: Test broadcast threshold for generated code") { +// case 1: threshold is -1, shouldn't broadcast since smaller than 0 means disabled. +// case 2: threshold is a larger number, shouldn't broadcast since not yet exceeded. +// case 3: threshold is 0, should broadcast since it's always smaller than generated code size. +Seq((-1, false), (10, false), (0, true)).foreach { case (threshold, shouldBroadcast) => + withSQLConf(SQLConf.WHOLESTAGE_BROADCAST_CLEANED_SOURCE_THRESHOLD.key -> threshold.toString) { +val df = Seq(0, 1, 2).toDF().groupBy("value").sum() +// Invoke tryBroadcastCleanedSource and make sure it returns the desired variables. Review Comment: Good idea, otherwise I was even thinking of logging something to validate the broadcast. However, due to the way `evaluatorFactory` and `cleanedSourceOpt` are declared, they are not directly accessible, and thus we would need to retrieve them using reflection (unless we'd rather directly modify the visibility of the constructor fields). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
zhengruifeng commented on code in PR #45412: URL: https://github.com/apache/spark/pull/45412#discussion_r1515374917 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3509,6 +3509,17 @@ object SQLConf { .booleanConf .createWithDefault(true) + val WRAP_EXISTS_IN_AGGREGATE_FUNCTION = +buildConf("spark.sql.optimizer.wrapExistsInAggregateFunction") + .internal() + .doc("When true, the optimizer will wrap newly introduced `exists` attributes in an" + + "aggregate function to ensure that Aggregate nodes preserve semantic invariant that each" + + "variable among agg expressions appears either in grouping expressions or belongs to" + Review Comment: nit, missing spaces ```suggestion .doc("When true, the optimizer will wrap newly introduced `exists` attributes in an " + "aggregate function to ensure that Aggregate nodes preserve semantic invariant that each " + "variable among agg expressions appears either in grouping expressions or belongs to " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]
HyukjinKwon commented on PR #45414: URL: https://github.com/apache/spark/pull/45414#issuecomment-1982190942 cc @cloud-fan @allisonwang-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47311][SQL][PYTHON] Suppress Python exceptions where PySpark is not in the Python path [spark]
HyukjinKwon opened a new pull request, #45414: URL: https://github.com/apache/spark/pull/45414 ### What changes were proposed in this pull request? This PR proposes to suppress Python exceptions where PySpark is not in the Python path ### Why are the changes needed? `pyspark` library itself might be missing when users run Scala/Java and R Spark applications. In that case, this warning messages might too nosiy. ### Does this PR introduce _any_ user-facing change? Yes, it will hide the warning message when a user meet all conditions below: - Use Scala or R only Spark application - Have a Python, but the Python does not have have PySpark in their Python path - Either because `SPARK_HOME` is undefined for some reasons, - Or, by other environment problems. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1515363804 ## python/pyspark/sql/profiler.py: ## @@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) -> None: The UDF ID whose profiling results should be cleared. If not specified, all the results will be cleared. """ -ids_to_remove = [ -result_id -for result_id, (perf, _, *_) in self._profile_results.items() -if perf is not None -] with self._lock: if id is not None: -if id in ids_to_remove: -self._profile_results.pop(id, None) +if id in self._profile_results: +perf, mem, *rest = self._profile_results[id] +self._profile_results[id] = (None, mem, *rest) +if mem is None: +self._profile_results.pop(id, None) else: -for id_to_remove in ids_to_remove: -self._profile_results.pop(id_to_remove, None) +ids_to_remove = [] +for id, (perf, mem, *rest) in list(self._profile_results.items()): +self._profile_results[id] = (None, mem, *rest) +if mem is None: +ids_to_remove.append(id) Review Comment: Good idea! Adjusted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-46988][CONNECT][TESTS] Add tests for `map` type in `ProtoUtils.abbreviate` [spark]
zhengruifeng opened a new pull request, #45413: URL: https://github.com/apache/spark/pull/45413 ### What changes were proposed in this pull request? Add tests for map types in `ProtoUtils.abbreviate` ### Why are the changes needed? when I started working on `SPARK-46988` to make `ProtoUtils.abbreviate` support `map` type, I just found that it had already been supported when we supported `repeated Message`. A `map` field internally is also a `repeated` field, and each element is a `Message`. see https://cloud.google.com/java/docs/reference/protobuf/latest/com.google.protobuf.MapEntry > In reflection API, map fields will be treated as repeated message fields and each map entry is accessed as a message. This MapEntry class is used to represent these map entry messages in reflection API. ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]
anishshri-db commented on PR #45341: URL: https://github.com/apache/spark/pull/45341#issuecomment-1982166978 @jingz-db - could you please fix this style error ? ``` [info] compiling 1 Scala source to /home/runner/work/spark/spark/tools/target/scala-2.13/classes ... [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala:195:0: Whitespace at end of line [info] Compilation completed in 14.316s. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]
github-actions[bot] commented on PR #42398: URL: https://github.com/apache/spark/pull/42398#issuecomment-1982086020 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]
jingz-db commented on code in PR #45341: URL: https://github.com/apache/spark/pull/45341#discussion_r1515298710 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala: ## @@ -86,3 +88,53 @@ object StateTypesEncoder { new StateTypesEncoder[GK](keySerializer, stateName) } } + +class CompositeKeyStateEncoder[GK, K]( +keySerializer: Serializer[GK], +stateName: String, +userKeyEnc: Encoder[K]) + extends StateTypesEncoder[GK](keySerializer: Serializer[GK], stateName: String) { + + private val schemaForCompositeKeyRow: StructType = +new StructType() + .add("key", BinaryType) + .add("userKey", BinaryType) + private val compositeKeyProjection = UnsafeProjection.create(schemaForCompositeKeyRow) + private val reuseRow = new UnsafeRow(userKeyEnc.schema.fields.length) + private val userKeyExpressionEnc = encoderFor(userKeyEnc) + + private val userKeyRowToObjDeserializer = +userKeyExpressionEnc.resolveAndBind().createDeserializer() + private val userKeySerializer = encoderFor(userKeyEnc).createSerializer() + + def encodeCompositeKey(userKey: K): UnsafeRow = { +val keyOption = ImplicitGroupingKeyTracker.getImplicitKeyOption +if (keyOption.isEmpty) { + throw StateStoreErrors.implicitKeyNotFound(stateName) +} +val groupingKey = keyOption.get.asInstanceOf[GK] +// generate grouping key byte array +val groupingKeyByteArr = keySerializer.apply(groupingKey).asInstanceOf[UnsafeRow].getBytes() +// generate user key byte array +val userKeyBytesArr = userKeySerializer.apply(userKey).asInstanceOf[UnsafeRow].getBytes() + +val compositeKeyRow = compositeKeyProjection(InternalRow(groupingKeyByteArr, userKeyBytesArr)) +compositeKeyRow + } + + def decodeCompositeKey(row: UnsafeRow): K = { +val bytes = row.getBinary(1) +reuseRow.pointTo(bytes, bytes.length) +val value = userKeyRowToObjDeserializer.apply(reuseRow) +value + } +} + +object CompositeKeyStateEncoder { Review Comment: I was following Bhuwan's style in the base class. Maybe I am missing something but did not find anything useful in the [style guide](https://github.com/databricks/scala-style-guide). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47155][PYTHON] Fix Error Class Issue [spark]
HyukjinKwon commented on PR #45306: URL: https://github.com/apache/spark/pull/45306#issuecomment-1982064029 ``` == FAIL [0.001s]: test_error_classes_sorted (pyspark.errors.tests.test_errors.ErrorsTest) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/errors/tests/test_errors.py", line 33, in test_error_classes_sorted self.assertTrue( AssertionError: False is not true : Error class [DATA_SOURCE_RETURN_SCHEMA_MISMATCH] should place after [DATA_SOURCE_CREATE_ERROR]. Run 'cd $SPARK_HOME; bin/pyspark' and 'from pyspark.errors.exceptions import _write_self; _write_self()' to automatically sort them. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
HyukjinKwon commented on PR #45412: URL: https://github.com/apache/spark/pull/45412#issuecomment-1982062236 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]
HyukjinKwon closed pull request #45374: [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page URL: https://github.com/apache/spark/pull/45374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]
HyukjinKwon commented on PR #45374: URL: https://github.com/apache/spark/pull/45374#issuecomment-1982061440 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][SQL][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
HyukjinKwon commented on PR #45412: URL: https://github.com/apache/spark/pull/45412#issuecomment-1982034513 @anton5798 mind keeping the PR description template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47070][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
HyukjinKwon commented on code in PR #45412: URL: https://github.com/apache/spark/pull/45412#discussion_r1515285320 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3509,6 +3509,17 @@ object SQLConf { .booleanConf .createWithDefault(true) + val WRAP_EXISTS_IN_AGGREGATE_FUNCTION = +buildConf("spark.sql.optimizer.wrapExistsInAggregateFunction") + .internal() + .doc("When true, the optimizer will wrap newly introduced `exists` attributes in an" + + "aggregate function to ensure that Aggregate nodes preserve semantic invariant that each" + + "variable among agg expressions appears either in grouping expressions or belongs to" + + "and aggregate function.") + .version("3.5.0") Review Comment: 3.5.2 or 4.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45959][SQL] : Deduplication of relations causes ambiguous exception to be raised, though logically it is Unambiguous [spark]
ahshahid commented on PR #45343: URL: https://github.com/apache/spark/pull/45343#issuecomment-1982020269 While fixing the issue, found that tests in DataFrameSelfJoinSuite, where the two legs have distinct datasetId , the tests are expecting AnalysisException due to ambiguity in the join condition, which IMO is not correct as it is possible to dis-ambiguate using resolution via Dataset ID. This PR fixes the above and have test modifications accordingly. The same applies to some tests in DataFrameAsofJoinSuite. I believe there is still need to do some enhancement in the existing source code related to AsofJoin, as the condition part is still not trying to disambiguate properly ( unlike the changes done for Join). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]
ueshin commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1515240180 ## python/pyspark/sql/profiler.py: ## @@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) -> None: The UDF ID whose profiling results should be cleared. If not specified, all the results will be cleared. """ -ids_to_remove = [ -result_id -for result_id, (perf, _, *_) in self._profile_results.items() -if perf is not None -] with self._lock: if id is not None: -if id in ids_to_remove: -self._profile_results.pop(id, None) +if id in self._profile_results: +perf, mem, *rest = self._profile_results[id] +self._profile_results[id] = (None, mem, *rest) +if mem is None: +self._profile_results.pop(id, None) else: -for id_to_remove in ids_to_remove: -self._profile_results.pop(id_to_remove, None) +ids_to_remove = [] +for id, (perf, mem, *rest) in list(self._profile_results.items()): +self._profile_results[id] = (None, mem, *rest) +if mem is None: +ids_to_remove.append(id) Review Comment: nit: Can't we pop it here? ## python/pyspark/sql/profiler.py: ## @@ -262,15 +266,21 @@ def clear_memory_profiles(self, id: Optional[int] = None) -> None: If not specified, all the results will be cleared. """ with self._lock: -ids_to_remove = [ -id for id, (_, mem, *_) in self._profile_results.items() if mem is not None -] if id is not None: -if id in ids_to_remove: -self._profile_results.pop(id, None) +if id in self._profile_results: +perf, mem, *rest = self._profile_results[id] +self._profile_results[id] = (perf, None, *rest) +if perf is None: +self._profile_results.pop(id, None) else: -for id_to_remove in ids_to_remove: -self._profile_results.pop(id_to_remove, None) +ids_to_remove = [] +for id, (perf, mem, *rest) in list(self._profile_results.items()): +self._profile_results[id] = (perf, None, *rest) +if perf is None: +ids_to_remove.append(id) Review Comment: ditto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515243849 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala: ## @@ -148,6 +148,17 @@ abstract class QueryStageExec extends LeafExecNode { */ abstract class ExchangeQueryStageExec extends QueryStageExec { + /** + * This flag aims to detect if the stage materialization is started. This helps + * to avoid unnecessary stage materialization when the stage is canceled. + */ + @transient protected val materializationStarted = new AtomicBoolean() Review Comment: Yes, it is addressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec( currentPhysicalPlan.foreach { // earlyFailedStage is the stage which failed before calling doMaterialize, // so we should avoid calling cancel on it to re-trigger the failure again. - case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) => + case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) && +s.isMaterializationStarted() => Review Comment: This depends on the query complexity (`numOfExchangeQueryStage`) and which stage' s materialization is failed such as first stage or last stage. For example, if the query introduces multiple `ShuffleQueryStages` and if the first stage' s materialization is failed, currently, cancellation is being skipped for all of the remaining stages by new logic due to their `materializationStarted = false` such as: Let there are 3 `ShuffleQueryStages`: ``` - ShuffleQueryStage-0' s materialization is failed (and its materializationStarted = true), - ShuffleQueryStage-1' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, - ShuffleQueryStage-2' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, ... ``` I can also reproduce same behaviour for more complex queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec( currentPhysicalPlan.foreach { // earlyFailedStage is the stage which failed before calling doMaterialize, // so we should avoid calling cancel on it to re-trigger the failure again. - case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) => + case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) && +s.isMaterializationStarted() => Review Comment: This depends on the query complexity (`numOfExchangeQueryStage`) and which stage' s materialization is failed such as first stage or last stage. For example, if the query introduces multiple `ShuffleQueryStages` and if the first stage' s materialization is failed, currently, cancellation is being skipped for all of the remaining stages by new logic due to their `materializationStarted = false` such as: Let there are 3 `ShuffleQueryStages`: ``` - ShuffleQueryStage-0' s materialization is failed (and its materializationStarted = true), - ShuffleQueryStage-1' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, - ShuffleQueryStage-2' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false. ``` I can also reproduce same behaviour for more complex queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec( currentPhysicalPlan.foreach { // earlyFailedStage is the stage which failed before calling doMaterialize, // so we should avoid calling cancel on it to re-trigger the failure again. - case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) => + case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) && +s.isMaterializationStarted() => Review Comment: This depends on the query complexity (`numOfExchangeQueryStage`) and which stage' s materialization is failed such as first stage or last stage. For example, if the query introduces multiple `ShuffleQueryStages` and if the first stage' s materialization is failed, currently, cancellation is being skipped for all of the remaining stages by new logic due to their `materializationStarted = false` such as: Let there are 3 `ShuffleQueryStages`: ``` - ShuffleQueryStage-0' s materialization is failed, - ShuffleQueryStage-1' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, - ShuffleQueryStage-2' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false. ``` I can also reproduce same behaviour for more complex queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47303][CORE][TESTS] Restructure MasterSuite [spark]
HyukjinKwon closed pull request #45366: [SPARK-47303][CORE][TESTS] Restructure MasterSuite URL: https://github.com/apache/spark/pull/45366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec( currentPhysicalPlan.foreach { // earlyFailedStage is the stage which failed before calling doMaterialize, // so we should avoid calling cancel on it to re-trigger the failure again. - case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) => + case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) && +s.isMaterializationStarted() => Review Comment: This depends on the query complexity / `numOfExchangeQueryStage` and which stage' s materialization is failed such as first stage or last stage. For example, if the query introduces multiple `ShuffleQueryStages` and if the first stage' s materialization is failed, currently, cancellation is being skipped for all of the remaining stages by new logic due to their `materializationStarted = false` such as: Let there are 3 `ShuffleQueryStages`: ``` - ShuffleQueryStage-0' s materialization is failed, - ShuffleQueryStage-1' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, - ShuffleQueryStage-2' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false. ``` I can also reproduce same behaviour for more complex queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47303][CORE][TESTS] Restructure MasterSuite [spark]
HyukjinKwon commented on PR #45366: URL: https://github.com/apache/spark/pull/45366#issuecomment-1981968073 Thanks. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec( currentPhysicalPlan.foreach { // earlyFailedStage is the stage which failed before calling doMaterialize, // so we should avoid calling cancel on it to re-trigger the failure again. - case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) => + case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) && +s.isMaterializationStarted() => Review Comment: This depends on the query complexity/numOfExchangeQueryStage and which stage' s materialization is failed such as first stage or last stage. For example, if the query introduces multiple `ShuffleQueryStages` and if the first stage' s materialization is failed, currently, cancellation is being skipped for all of the remaining stages by new logic due to their `materializationStarted = false` such as: Let there are 3 `ShuffleQueryStages`: ``` - ShuffleQueryStage-0' s materialization is failed, - ShuffleQueryStage-1' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, - ShuffleQueryStage-2' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false. ``` I can also reproduce same behaviour for more complex queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47148][SQL] Avoid to materialize AQE ExchangeQueryStageExec on the cancellation [spark]
erenavsarogullari commented on code in PR #45234: URL: https://github.com/apache/spark/pull/45234#discussion_r1515240053 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -790,7 +790,8 @@ case class AdaptiveSparkPlanExec( currentPhysicalPlan.foreach { // earlyFailedStage is the stage which failed before calling doMaterialize, // so we should avoid calling cancel on it to re-trigger the failure again. - case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) => + case s: ExchangeQueryStageExec if !earlyFailedStage.contains(s.id) && +s.isMaterializationStarted() => Review Comment: This depends on the query complexity and which stage' s materialization is failed such as first stage or last stage. For example, if the query introduces multiple `ShuffleQueryStages` and if the first stage' s materialization is failed, currently, cancellation is being skipped for all of the remaining stages by new logic due to their `materializationStarted = false` such as: Let there are 3 `ShuffleQueryStages`: ``` - ShuffleQueryStage-0' s materialization is failed, - ShuffleQueryStage-1' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false, - ShuffleQueryStage-2' s materialization is not started yet so it is not tried to cancel due to materializationStarted = false. ``` I can also reproduce same behaviour for more complex queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]
xinrong-meng commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1515212345 ## python/pyspark/sql/profiler.py: ## @@ -224,6 +224,54 @@ def dump(id: int) -> None: for id in sorted(code_map.keys()): dump(id) +def clear_perf_profiles(self, id: Optional[int] = None) -> None: +""" +Clear the perf profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +id : int, optional +The UDF ID whose profiling results should be cleared. +If not specified, all the results will be cleared. +""" +ids_to_remove = [ +result_id +for result_id, (perf, _, *_) in self._profile_results.items() +if perf is not None +] +with self._lock: +if id is not None: +if id in ids_to_remove: +self._profile_results.pop(id, None) Review Comment: Good catch! Thanks for the example. I adjusted the code and added tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47272][SS] Add MapState implementation for State API v2. [spark]
jingz-db commented on PR #45341: URL: https://github.com/apache/spark/pull/45341#issuecomment-1981833646 > @jingz-db - test failure seems related ? Weirdly is passing locally. Let me resolve your comments and retrigger the CI and see if it still fails. Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]
dongjoon-hyun commented on PR #45408: URL: https://github.com/apache/spark/pull/45408#issuecomment-1981829489 Hi, @ted-jenks . Could you elaborate your correctness situation a little more? It sounds like you have other systems to read Spark's data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47070][FOLLOW-UP] Add a flag guarding a subquery in aggregate rewrite [spark]
anton5798 opened a new pull request, #45412: URL: https://github.com/apache/spark/pull/45412 ### What changes were proposed in this pull request? Add a flag that guards a recently introduced new codepath inside optimizer that wraps `exists` variables into an agg function. See [#45133](https://github.com/apache/spark/pull/45133) for details. ### Tests No additional tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47309][SQL][XML] Add schema inference unit tests and fix schema inference issues [spark]
shujingyang-db opened a new pull request, #45411: URL: https://github.com/apache/spark/pull/45411 ### What changes were proposed in this pull request? As titled. It also fixes schema inference issue 1) when there's an empty tag 2) when merging schema for NullType ### Why are the changes needed? Fix a bug. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45278] [YARN] Allow configuring Yarn executor bind address in Yarn [spark]
gedeh commented on PR #42870: URL: https://github.com/apache/spark/pull/42870#issuecomment-1981727449 Hello, I noticed this PR is closed, this is blocking Spark with Yarn in kubernetes. I dont understand whats left missing for this PR. If anyone in Spark project can shed a light what required to merge this for next release of Spark that would be great. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]
ueshin commented on code in PR #45378: URL: https://github.com/apache/spark/pull/45378#discussion_r1515098224 ## python/pyspark/sql/profiler.py: ## @@ -224,6 +224,54 @@ def dump(id: int) -> None: for id in sorted(code_map.keys()): dump(id) +def clear_perf_profiles(self, id: Optional[int] = None) -> None: +""" +Clear the perf profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +id : int, optional +The UDF ID whose profiling results should be cleared. +If not specified, all the results will be cleared. +""" +ids_to_remove = [ +result_id +for result_id, (perf, _, *_) in self._profile_results.items() +if perf is not None +] +with self._lock: +if id is not None: +if id in ids_to_remove: +self._profile_results.pop(id, None) Review Comment: On Jupyter: ```py from pyspark.sql.functions import pandas_udf df = spark.range(3) @pandas_udf("long") def add1(x): return x + 1 added = df.select(add1("id")) spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") added.show() spark.conf.set("spark.sql.pyspark.udf.profiler", "memory") added.show() spark.profile.show() ... spark.profile.clear(type="memory") spark.profile.show() # should still show the perf results? ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39771][CORE] Add a warning msg in `Dependency` when a too large number of shuffle blocks is to be created. [spark]
xuanyuanking commented on PR #45266: URL: https://github.com/apache/spark/pull/45266#issuecomment-1981661582 Thanks for the contribution @y-wei ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org