[GitHub] [spark] MaxGekk closed pull request #39332: [WIP][SPARK-40822][SQL] Stable derived column aliases
MaxGekk closed pull request #39332: [WIP][SPARK-40822][SQL] Stable derived column aliases URL: https://github.com/apache/spark/pull/39332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #40126: [SPARK-40822][SQL] Stable derived column aliases
MaxGekk closed pull request #40126: [SPARK-40822][SQL] Stable derived column aliases URL: https://github.com/apache/spark/pull/40126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases
MaxGekk commented on PR #40126: URL: https://github.com/apache/spark/pull/40126#issuecomment-1477333100 Merging to master. Thank you, @srielau and @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
amaliujia commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142929343 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -148,6 +143,13 @@ message Read { // This is only supported by the JDBC data source. repeated string predicates = 5; } + + // Options for data sources and named table. + // + // When using for data sources, the context of this map varies based on the + // data source format. This options could be empty for valid data source format. + // The map key is case insensitive. + map options = 3; Review Comment: Choose to use non-breaking way to change the proto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
amaliujia commented on PR #40498: URL: https://github.com/apache/spark/pull/40498#issuecomment-1477332325 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
amaliujia commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142929171 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -183,7 +183,7 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging dataSourceBuilder.setFormat(source) userSpecifiedSchema.foreach(schema => dataSourceBuilder.setSchema(schema.toDDL)) extraOptions.foreach { case (k, v) => -dataSourceBuilder.putOptions(k, v) +builder.getReadBuilder.putOptions(k, v) Review Comment: I found I can only add a meaningful test in server side. On client sides there is no way to verify an option has passed through. In existing codebase, it is tested because we can do `df.queryExecution.analyzed` then get the complete plan the to fetch the options and verify it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yliou opened a new pull request, #40503: [SPARK-42830] [UI] Link skipped stages on Spark UI
yliou opened a new pull request, #40503: URL: https://github.com/apache/spark/pull/40503 ### What changes were proposed in this pull request? Adds text on the UI that shows which executed stage a skipped stage on the UI refers to. ### Why are the changes needed? Helps find the execution details, in terms of figuring out which stages from earlier jobs feed into a later stage in a specific job. ### Does this PR introduce _any_ user-facing change? Yes, the jobs page can look like the following, where the skipped stage is now clickable and redirects to the stage that was actually executed https://user-images.githubusercontent.com/16739760/226529417-a2384904-7e46-48f6-bcd5-c708416e6353.png";> ### How was this patch tested? Manual test locally on UI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #40444: [SPARK-42813][K8S] Print application info when waitAppCompletion is false
pan3793 commented on PR #40444: URL: https://github.com/apache/spark/pull/40444#issuecomment-1477316443 @dongjoon-hyun thank you, I updated the log message to ``` "Application $appName with application ID $appId and submission ID $sId finished" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yliou opened a new pull request, #40502: [SPARK-42829] [UI] add repeat identifier to cached RDD on stage page
yliou opened a new pull request, #40502: URL: https://github.com/apache/spark/pull/40502 ### What changes were proposed in this pull request? Adds Repeat Identifier: to the cached RDD node on the Stages page. Made the Repeat Identifier: have bolded text so that it's easier to distinguish from the rest of the text. ### Why are the changes needed? Currently there is no way to distinguish which cached RDD is being executed in a particular stage. This aims to fix that. ### Does this PR introduce _any_ user-facing change? Yes, on the Stages page in the UI when there is a cached RDD. One example is https://user-images.githubusercontent.com/16739760/226527044-4c0719f4-aaf8-4e99-8bce-fd033106379c.png";> ### How was this patch tested? Manual test locally in SQL UI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #40501: [SPARK-42864][ML] Make IsotonicRegression.PointsAccumulator private
zhengruifeng closed pull request #40501: [SPARK-42864][ML] Make IsotonicRegression.PointsAccumulator private URL: https://github.com/apache/spark/pull/40501 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
xinrong-meng commented on PR #40500: URL: https://github.com/apache/spark/pull/40500#issuecomment-1477308745 Merged to branch-3.4, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng closed pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
xinrong-meng closed pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private URL: https://github.com/apache/spark/pull/40500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
zhengruifeng commented on PR #40500: URL: https://github.com/apache/spark/pull/40500#issuecomment-1477304948 @xinrong-meng it seems this PR was already merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression
ulysses-you commented on code in PR #40446: URL: https://github.com/apache/spark/pull/40446#discussion_r1142906959 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala: ## @@ -125,13 +128,27 @@ class EquivalentExpressions { } } + private def skipForShortcut(expr: Expression): Expression = { +if (skipForShortcutEnable) { + // The subexpression may not need to eval even if it appears more than once. + // e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if `a` is true. + expr match { +case and: And => skipForShortcut(and.left) Review Comment: yea it is the fact, but after second thought the `updateExprTree` has side effect `updateExprInMap` during recursion. If we decide to skip for shortcut, is it better to return the final valid expression in one shot ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression
ulysses-you commented on code in PR #40446: URL: https://github.com/apache/spark/pull/40446#discussion_r1142906959 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala: ## @@ -125,13 +128,27 @@ class EquivalentExpressions { } } + private def skipForShortcut(expr: Expression): Expression = { +if (skipForShortcutEnable) { + // The subexpression may not need to eval even if it appears more than once. + // e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if `a` is true. + expr match { +case and: And => skipForShortcut(and.left) Review Comment: yea it is the fact, but after second though the `updateExprTree` has side effect `updateExprInMap` during recursion. If we decide to skip for shortcut, is it better to return the final valid expression in one shot ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sudoliyang commented on pull request #40494: [MINOR][DOCS] Fix typos
sudoliyang commented on PR #40494: URL: https://github.com/apache/spark/pull/40494#issuecomment-1477300074 I did enable action on my forked repo and rebase to `apache/spark` master. Can anyone re-run the workflows? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rednaxelafx commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation
rednaxelafx commented on code in PR #40488: URL: https://github.com/apache/spark/pull/40488#discussion_r1142905889 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala: ## @@ -296,12 +298,17 @@ object PhysicalAggregation { // build a set of semantically distinct aggregate expressions and re-write expressions so // that they reference the single copy of the aggregate function which actually gets computed. // Non-deterministic aggregate expressions are not deduplicated. - val equivalentAggregateExpressions = new EquivalentExpressions + val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression] val aggregateExpressions = resultExpressions.flatMap { expr => expr.collect { - // addExpr() always returns false for non-deterministic expressions and do not add them. case a -if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) => Review Comment: The line of thought would be: adding the `supportedExpression` guard to `addExpr()` would cause performance regression, so let's just close our eyes and make the only remaining use of `addExpr` break away and do its own deduplication in the old logic without taking things like `NamedLambdaVariable` into account -- which is the way it's been for quite a few releases. This PR essentially inlines the `addExpr` path of the old `EquivalentExpressions` into `PhysicalAggregation` to recover what it used to do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #40444: [SPARK-42813][K8S] Print application info when waitAppCompletion is false
pan3793 commented on PR #40444: URL: https://github.com/apache/spark/pull/40444#issuecomment-1477299823 Kindly ping @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #40495: test for reading footer within file range
huaxingao commented on code in PR #40495: URL: https://github.com/apache/spark/pull/40495#discussion_r1142903084 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala: ## @@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory( if (aggregation.isEmpty) { ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS) } else { + val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String]) // For aggregate push down, we will get max/min/count from footer statistics. - ParquetFooterReader.readFooter(conf, filePath, NO_FILTER) Review Comment: @sunchao What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #40495: test for reading footer within file range
huaxingao commented on code in PR #40495: URL: https://github.com/apache/spark/pull/40495#discussion_r1142902856 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala: ## @@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory( if (aggregation.isEmpty) { ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS) } else { + val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String]) // For aggregate push down, we will get max/min/count from footer statistics. - ParquetFooterReader.readFooter(conf, filePath, NO_FILTER) Review Comment: @yabola Thanks for the ping. I was assuming that we always need to read the whole `PartitionedFile`, i.e. the start is always 0 and the length is always the file length. Maybe it's safer to add the range. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
dtenedor commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1142891754 ## sql/core/src/test/resources/sql-tests/analyzer-results/ansi/cast.sql.out: ## @@ -0,0 +1,881 @@ +-- Automatically generated by SQLQueryTestSuite +-- !query +SELECT CAST('1.23' AS int) +-- !query analysis +Project [cast(1.23 as int) AS CAST(1.23 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('1.23' AS long) +-- !query analysis +Project [cast(1.23 as bigint) AS CAST(1.23 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('-4.56' AS int) +-- !query analysis +Project [cast(-4.56 as int) AS CAST(-4.56 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-4.56' AS long) +-- !query analysis +Project [cast(-4.56 as bigint) AS CAST(-4.56 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS int) +-- !query analysis +Project [cast(abc as int) AS CAST(abc AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS long) +-- !query analysis +Project [cast(abc as bigint) AS CAST(abc AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS float) +-- !query analysis +Project [cast(abc as float) AS CAST(abc AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS double) +-- !query analysis +Project [cast(abc as double) AS CAST(abc AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('1234567890123' AS int) +-- !query analysis +Project [cast(1234567890123 as int) AS CAST(1234567890123 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('12345678901234567890123' AS long) +-- !query analysis +Project [cast(12345678901234567890123 as bigint) AS CAST(12345678901234567890123 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS int) +-- !query analysis +Project [cast( as int) AS CAST( AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS long) +-- !query analysis +Project [cast( as bigint) AS CAST( AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS float) +-- !query analysis +Project [cast( as float) AS CAST( AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS double) +-- !query analysis +Project [cast( as double) AS CAST( AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST(NULL AS int) +-- !query analysis +Project [cast(null as int) AS CAST(NULL AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST(NULL AS long) +-- !query analysis +Project [cast(null as bigint) AS CAST(NULL AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS int) +-- !query analysis +Project [cast(123.a as int) AS CAST(123.a AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS long) +-- !query analysis +Project [cast(123.a as bigint) AS CAST(123.a AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS float) +-- !query analysis +Project [cast(123.a as float) AS CAST(123.a AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS double) +-- !query analysis +Project [cast(123.a as double) AS CAST(123.a AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-2147483648' AS int) +-- !query analysis +Project [cast(-2147483648 as int) AS CAST(-2147483648 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-2147483649' AS int) +-- !query analysis +Project [cast(-2147483649 as int) AS CAST(-2147483649 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('2147483647' AS int) +-- !query analysis +Project [cast(2147483647 as int) AS CAST(2147483647 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('2147483648' AS int) +-- !query analysis +Project [cast(2147483648 as int) AS CAST(2147483648 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-9223372036854775808' AS long) +-- !query analysis +Project [cast(-9223372036854775808 as bigint) AS CAST(-9223372036854775808 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('-9223372036854775809' AS long) +-- !query analysis +Project [cast(-9223372036854775809 as bigint) AS CAST(-9223372036854775809 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('9223372036854775807' AS long) +-- !query analysis +Project [cast(9223372036854775807 as bigint) AS CAST(9223372036854775807 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('9223372036854775808' AS long) +-- !query analysis +Project [cast(9223372036854775808 as bigint) AS CAST(9223372036854775808 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT HEX(CAST('abc' AS binary)) +-- !query analysis +Project [hex(cast(abc as binary)) AS hex(CAST(abc AS BINARY))#x] ++- OneRowRelation + + +-- !query +SELECT HEX(CAST(CAST(123 AS byte) AS binary)) +-- !query analysis +org.apache.spark.sql.AnalysisException +{ + "errorClass" : "DATATYPE_MISMATCH.CAST_WITH_CONF_SUGGESTION", + "sqlState" : "42K09", + "messageParameters" : { +"config" : "\"spark.sql.ansi.enabled\"", +"configVal" : "'false'", +"sqlExpr" : "\"CAST(CAST(123 AS TINYINT) AS BINARY)\""
[GitHub] [spark] LuciferYang commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
LuciferYang commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1142891030 ## sql/core/src/test/resources/sql-tests/analyzer-results/ansi/cast.sql.out: ## @@ -0,0 +1,881 @@ +-- Automatically generated by SQLQueryTestSuite +-- !query +SELECT CAST('1.23' AS int) +-- !query analysis +Project [cast(1.23 as int) AS CAST(1.23 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('1.23' AS long) +-- !query analysis +Project [cast(1.23 as bigint) AS CAST(1.23 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('-4.56' AS int) +-- !query analysis +Project [cast(-4.56 as int) AS CAST(-4.56 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-4.56' AS long) +-- !query analysis +Project [cast(-4.56 as bigint) AS CAST(-4.56 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS int) +-- !query analysis +Project [cast(abc as int) AS CAST(abc AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS long) +-- !query analysis +Project [cast(abc as bigint) AS CAST(abc AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS float) +-- !query analysis +Project [cast(abc as float) AS CAST(abc AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS double) +-- !query analysis +Project [cast(abc as double) AS CAST(abc AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('1234567890123' AS int) +-- !query analysis +Project [cast(1234567890123 as int) AS CAST(1234567890123 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('12345678901234567890123' AS long) +-- !query analysis +Project [cast(12345678901234567890123 as bigint) AS CAST(12345678901234567890123 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS int) +-- !query analysis +Project [cast( as int) AS CAST( AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS long) +-- !query analysis +Project [cast( as bigint) AS CAST( AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS float) +-- !query analysis +Project [cast( as float) AS CAST( AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS double) +-- !query analysis +Project [cast( as double) AS CAST( AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST(NULL AS int) +-- !query analysis +Project [cast(null as int) AS CAST(NULL AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST(NULL AS long) +-- !query analysis +Project [cast(null as bigint) AS CAST(NULL AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS int) +-- !query analysis +Project [cast(123.a as int) AS CAST(123.a AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS long) +-- !query analysis +Project [cast(123.a as bigint) AS CAST(123.a AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS float) +-- !query analysis +Project [cast(123.a as float) AS CAST(123.a AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS double) +-- !query analysis +Project [cast(123.a as double) AS CAST(123.a AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-2147483648' AS int) +-- !query analysis +Project [cast(-2147483648 as int) AS CAST(-2147483648 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-2147483649' AS int) +-- !query analysis +Project [cast(-2147483649 as int) AS CAST(-2147483649 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('2147483647' AS int) +-- !query analysis +Project [cast(2147483647 as int) AS CAST(2147483647 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('2147483648' AS int) +-- !query analysis +Project [cast(2147483648 as int) AS CAST(2147483648 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-9223372036854775808' AS long) +-- !query analysis +Project [cast(-9223372036854775808 as bigint) AS CAST(-9223372036854775808 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('-9223372036854775809' AS long) +-- !query analysis +Project [cast(-9223372036854775809 as bigint) AS CAST(-9223372036854775809 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('9223372036854775807' AS long) +-- !query analysis +Project [cast(9223372036854775807 as bigint) AS CAST(9223372036854775807 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('9223372036854775808' AS long) +-- !query analysis +Project [cast(9223372036854775808 as bigint) AS CAST(9223372036854775808 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT HEX(CAST('abc' AS binary)) +-- !query analysis +Project [hex(cast(abc as binary)) AS hex(CAST(abc AS BINARY))#x] ++- OneRowRelation + + +-- !query +SELECT HEX(CAST(CAST(123 AS byte) AS binary)) +-- !query analysis +org.apache.spark.sql.AnalysisException +{ + "errorClass" : "DATATYPE_MISMATCH.CAST_WITH_CONF_SUGGESTION", + "sqlState" : "42K09", + "messageParameters" : { +"config" : "\"spark.sql.ansi.enabled\"", +"configVal" : "'false'", +"sqlExpr" : "\"CAST(CAST(123 AS TINYINT) AS BINARY)
[GitHub] [spark] zhengruifeng opened a new pull request, #40501: [SPARK-42864][ML] Make IsotonicRegression.PointsAccumulator private
zhengruifeng opened a new pull request, #40501: URL: https://github.com/apache/spark/pull/40501 ### What changes were proposed in this pull request? Make `IsotonicRegression.PointsAccumulator` private, which was introduced in https://github.com/apache/spark/commit/3d05c7e037eff79de8ef9f6231aca8340bcc65ef ### Why are the changes needed? `PointsAccumulator` is implementation details, should not be exposed ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
xinrong-meng commented on PR #40500: URL: https://github.com/apache/spark/pull/40500#issuecomment-1477276287 LGTM, thank you @zhengruifeng ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common`
amaliujia commented on PR #40485: URL: https://github.com/apache/spark/pull/40485#issuecomment-1477271752 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on PR #40402: URL: https://github.com/apache/spark/pull/40402#issuecomment-1477260665 @zhengruifeng ah, seems like something is wrong when the schema is a column name list. Could you use `StructType` to specify the schema as a workaround? I'll take a look later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations
srowen commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1477246959 I don't know enough to say whether it's worth a new method. Can we start with the change that needs no new API, is it a big enough win? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
mskapilks commented on PR #40266: URL: https://github.com/apache/spark/pull/40266#issuecomment-1477245713 > Looks like there are a few failures after moving the rule ([22e7886](https://github.com/apache/spark/commit/22e7886ff1059b98d1525380b2cb22718fd5dd09)). @mskapilks, do you think you can look into those failures? Yup I am working on them. I had wrong SPARK_HOME setup so missed the plan changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40500: [WIP][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
zhengruifeng commented on code in PR #40500: URL: https://github.com/apache/spark/pull/40500#discussion_r1142863924 ## mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala: ## @@ -490,15 +490,13 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali .sortBy(_._2) poolAdjacentViolators(parallelStepResult) } -} -object IsotonicRegression { /** * Utility class, holds a buffer of all points with unique features so far, and performs * weighted sum accumulation of points. Hides these details for better readability of the * main algorithm. */ - class PointsAccumulator { + private class PointsAccumulator { Review Comment: right now it is only used in `IsotonicRegression` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40500: [WIP][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
WeichenXu123 commented on code in PR #40500: URL: https://github.com/apache/spark/pull/40500#discussion_r1142863633 ## mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala: ## @@ -490,15 +490,13 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali .sortBy(_._2) poolAdjacentViolators(parallelStepResult) } -} -object IsotonicRegression { /** * Utility class, holds a buffer of all points with unique features so far, and performs * weighted sum accumulation of points. Hides these details for better readability of the * main algorithm. */ - class PointsAccumulator { + private class PointsAccumulator { Review Comment: private[ml] ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic closed pull request #40270: [WIP][SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.
itholic closed pull request #40270: [WIP][SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function. URL: https://github.com/apache/spark/pull/40270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40270: [WIP][SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.
itholic commented on PR #40270: URL: https://github.com/apache/spark/pull/40270#issuecomment-1477238760 As the #40456 has been completed, will resume this one. Let me close this PR for convenience and open a new one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] lyy-pineapple commented on a diff in pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)
lyy-pineapple commented on code in PR #38171: URL: https://github.com/apache/spark/pull/38171#discussion_r1142860063 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressionsJoni.scala: ## @@ -0,0 +1,471 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.util.Locale + +import scala.collection.JavaConverters._ + +import org.apache.commons.text.StringEscapeUtils +import org.jcodings.specific.UTF8Encoding +import org.joni.{Option, Regex, Syntax} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.catalyst.expressions.codegen.Block._ +import org.apache.spark.sql.catalyst.trees.TreePattern.{LIKE_FAMLIY, TreePattern} +import org.apache.spark.sql.catalyst.util.StringUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +abstract class StringRegexExpressionJoni extends BinaryExpression + with ImplicitCastInputTypes with NullIntolerant with Predicate { + + def escape(v: Array[Byte]): Array[Byte] + def matches(regex: Regex, str: Array[Byte]): Boolean + + override def dataType: DataType = BooleanType + override def inputTypes: Seq[DataType] = Seq(StringType, StringType) + + // try cache foldable pattern + private lazy val cache: Regex = right match { +case p: Expression if p.foldable => + compile(p.eval().asInstanceOf[UTF8String].getBytes) +case _ => null + } + + protected def compile(pattern: Array[Byte]): Regex = if (pattern == null) { +null + } else { +// Let it raise exception if couldn't compile the regex string +val escapedPattern = escape(pattern) +new Regex(escapedPattern, 0, escapedPattern.length, + Option.NONE, UTF8Encoding.INSTANCE, Syntax.Java) + } + + protected def pattern(pattern: Array[Byte]) = if (cache == null) compile(pattern) else cache + + protected override def nullSafeEval(input1: Any, input2: Any): Any = { +val regex = pattern(input2.asInstanceOf[UTF8String].getBytes) +if(regex == null) { + null +} else { + matches(regex, input1.asInstanceOf[UTF8String].getBytes) +} + } + + override def sql: String = s"${left.sql} ${prettyName.toUpperCase(Locale.ROOT)} ${right.sql}" +} + +// scalastyle:off line.contains.tab +/** + * Simple RegEx pattern matching function + */ +@ExpressionDescription( + usage = "str _FUNC_ pattern[ ESCAPE escape] - Returns true if str matches `pattern` with " + +"`escape`, null if any arguments are null, false otherwise.", + arguments = """ +Arguments: + * str - a string expression + * pattern - a string expression. The pattern is a string which is matched literally, with + exception to the following special symbols: + exception to the following special symbols: +_ matches any one character in the input (similar to . in posix regular expressions) + % matches zero or more characters in the input (similar to .* in posix regular + expressions) + expressions) + Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order + to match "\abc", the pattern should be "\\abc". + When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, it fallbacks + to Spark 1.6 behavior regarding string literal parsing. For example, if the config is + enabled, the pattern to match "\abc" should be "\abc". + * escape - an character added since Spark 3.0. The default escape character is the '\'. + If an escape character precedes a special symbol or another escape character, the + following character is matched literally. It is invalid to escape any other character. + """, + examples = """ +Examples: + > SELECT _FUNC_('Spark', '_park'); + true + > SET spark.sql.parser.escapedStringLiterals=true; + spark.sql.parser.escapedStringLiterals true + > SELECT '%SystemDrive%\Users\John' _FUNC_ '\%SystemDrive\%\\Users%'; + true + > SET spark.sql.parser.escapedStringLiterals=false; + spark.sql.parser.esc
[GitHub] [spark] zhengruifeng opened a new pull request, #40500: [WIP][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private
zhengruifeng opened a new pull request, #40500: URL: https://github.com/apache/spark/pull/40500 ### What changes were proposed in this pull request? Make `IsotonicRegression.PointsAccumulator` private ### Why are the changes needed? `PointsAccumulator` is implementation details, should not be exposed ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Stove-hust commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
Stove-hust commented on code in PR #40393: URL: https://github.com/apache/spark/pull/40393#discussion_r1142853303 ## pom.xml: ## @@ -114,7 +114,7 @@ 1.8 ${java.version} ${java.version} -3.8.7 +3.6.3 Review Comment: 😂 Forgot to change it back after local compilation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1142848071 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -205,11 +212,21 @@ class ParquetFileFormat val sharedConf = broadcastedHadoopConf.value.value - lazy val footerFileMetaData = -ParquetFooterReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData + val fileRange = HadoopReadOptions.builder(sharedConf, split.getPath) Review Comment: Yes, before I created and passed `ParquetFileReader` because I wanted to create one less `file.newStream()` in it (if there is no filter pushdown), but it doesn’t seem to make much sense. I have changed to pass footer here. Please take a look, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40499: [SPARK-42876][SQL] DataType's physicalDataType should be private[sql]
amaliujia commented on code in PR #40499: URL: https://github.com/apache/spark/pull/40499#discussion_r1142847660 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -119,7 +119,7 @@ abstract class DataType extends AbstractDataType { override private[sql] def acceptsType(other: DataType): Boolean = sameType(other) - def physicalDataType: PhysicalDataType = UninitializedPhysicalType + private[sql] def physicalDataType: PhysicalDataType = UninitializedPhysicalType Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
zhengruifeng commented on PR #40402: URL: https://github.com/apache/spark/pull/40402#issuecomment-1477209035 I save a df with UDT in pyspark, and then read it in python client, and it works fine. So I guess something is wrong in `createDataFrame` vanilla PySpark: ``` In [1]: from pyspark.ml.linalg import Vectors In [2]: df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, ...:...: 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)),], ["label", "weight", "features"],) In [3]: df.write.parquet("/tmp/tmp.pq") ``` Python Client: ``` In [6]: df = spark.read.parquet("/tmp/tmp.pq") In [7]: df.schema Out[7]: StructType([StructField('label', DoubleType(), True), StructField('weight', DoubleType(), True), StructField('features', VectorUDT(), True)]) In [8]: df.collect() Out[8]: [Row(label=0.0, weight=4.0, features=DenseVector([3.0, 3.0])), Row(label=0.0, weight=2.0, features=DenseVector([1.0, 2.0])), Row(label=1.0, weight=3.0, features=DenseVector([2.0, 1.0])), Row(label=1.0, weight=1.0, features=DenseVector([0.0, 5.0]))] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on a diff in pull request #40495: test for reading footer within file range
yabola commented on code in PR #40495: URL: https://github.com/apache/spark/pull/40495#discussion_r1142841503 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala: ## @@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory( if (aggregation.isEmpty) { ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS) } else { + val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String]) // For aggregate push down, we will get max/min/count from footer statistics. - ParquetFooterReader.readFooter(conf, filePath, NO_FILTER) Review Comment: @huaxingao Hi~when reviewing the code, I have a point of doubt, why not take the RowGroup metrics information in the current file range here? I feel like there's going to be a problem here... Here is a example in `SpecificParquetRecordReaderBase` https://github.com/apache/spark/blob/61035129a354d0b31c66908106238b12b1f2f7b0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L96-L102 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
zhengruifeng commented on PR #40402: URL: https://github.com/apache/spark/pull/40402#issuecomment-1477202593 @ueshin it seem that `createDataFrame` always use the underlying `sqlType` other than the UDT itself: ``` In [1]: from pyspark.ml.linalg import Vectors In [2]: df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, ...: 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)),], ["label", "weight", "features"],) In [3]: df.schema Out[3]: StructType([StructField('label', DoubleType(), True), StructField('weight', DoubleType(), True), StructField('features', StructType([StructField('type', ByteType(), False), StructField('size', IntegerType(), True), StructField('indices', ArrayType(IntegerType(), False), True), StructField('values', ArrayType(DoubleType(), False), True)]), True)]) In [4]: df.collect() Out[4]: :> (0 + 4) / 4] [Row(label=1.0, weight=1.0, features=Row(type=1, size=None, indices=None, values=[0.0, 5.0])), Row(label=0.0, weight=2.0, features=Row(type=1, size=None, indices=None, values=[1.0, 2.0])), Row(label=1.0, weight=3.0, features=Row(type=1, size=None, indices=None, values=[2.0, 1.0])), Row(label=0.0, weight=4.0, features=Row(type=1, size=None, indices=None, values=[3.0, 3.0]))] ``` while in vanilla PySpark: ``` In [1]: from pyspark.ml.linalg import Vectors In [2]: df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, ...:...: 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)),], ["label", "weight", "features"],) In [3]: df.schema Out[3]: StructType([StructField('label', DoubleType(), True), StructField('weight', DoubleType(), True), StructField('features', VectorUDT(), True)]) In [4]: df.collect() Out[4]: [Row(label=1.0, weight=1.0, features=DenseVector([0.0, 5.0])), Row(label=0.0, weight=2.0, features=DenseVector([1.0, 2.0])), Row(label=1.0, weight=3.0, features=DenseVector([2.0, 1.0])), Row(label=0.0, weight=4.0, features=DenseVector([3.0, 3.0]))] ``` also cc @WeichenXu123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on a diff in pull request #40495: test for reading footer within file range
yabola commented on code in PR #40495: URL: https://github.com/apache/spark/pull/40495#discussion_r1142841503 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala: ## @@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory( if (aggregation.isEmpty) { ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS) } else { + val split = new FileSplit(file.toPath, file.start, file.length, Array.empty[String]) // For aggregate push down, we will get max/min/count from footer statistics. - ParquetFooterReader.readFooter(conf, filePath, NO_FILTER) Review Comment: @huaxingao Hi~when reviewing the code, I have a point of doubt, why not take the RowGroup metrics information in the current file range here? Here is a example in `SpecificParquetRecordReaderBase` https://github.com/apache/spark/blob/61035129a354d0b31c66908106238b12b1f2f7b0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L96-L102 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40408: [SPARK-42780][BUILD] Upgrade `Tink` to 1.8.0
LuciferYang commented on code in PR #40408: URL: https://github.com/apache/spark/pull/40408#discussion_r1142839820 ## pom.xml: ## @@ -214,7 +214,7 @@ 1.1.0 1.5.0 1.60 -1.7.0 +1.8.0 Review Comment: @bjornjorgensen We can exclude `androidx.annotation:annotation` from `tink`, and then we should be able to remove the newly added maven repository. I have tested it and it should be that only `androidx.annotation:annotation` cannot be downloaded from the central repository and exclude does not cause UT failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40499: [SPARK-42876][SQL] DataType's physicalDataType should be private[sql]
cloud-fan commented on code in PR #40499: URL: https://github.com/apache/spark/pull/40499#discussion_r1142839025 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -119,7 +119,7 @@ abstract class DataType extends AbstractDataType { override private[sql] def acceptsType(other: DataType): Boolean = sameType(other) - def physicalDataType: PhysicalDataType = UninitializedPhysicalType + private[sql] def physicalDataType: PhysicalDataType = UninitializedPhysicalType Review Comment: can we change the modifier in subclasses as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation
cloud-fan commented on code in PR #40488: URL: https://github.com/apache/spark/pull/40488#discussion_r1142837488 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala: ## @@ -296,12 +298,17 @@ object PhysicalAggregation { // build a set of semantically distinct aggregate expressions and re-write expressions so // that they reference the single copy of the aggregate function which actually gets computed. // Non-deterministic aggregate expressions are not deduplicated. - val equivalentAggregateExpressions = new EquivalentExpressions + val equivalentAggregateExpressions = mutable.Map.empty[Expression, Expression] val aggregateExpressions = resultExpressions.flatMap { expr => expr.collect { - // addExpr() always returns false for non-deterministic expressions and do not add them. case a -if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) => Review Comment: what's wrong with `addExpr` here? It does simplify the code IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1477193966 TL;DR I want to apply scalar subquery to optimize `FPGrowthModel.transform`, there are two options: 1, create temp views and use `spark.sql`, see https://github.com/apache/spark/commit/63595ba03d9f18fe0b43bfb09f974ea50cb2c651; 2, add `private[spark] def withScalarSubquery(colName: String, subquery: Dataset[_]): DataFrame`, it seems much more convenient but not sure whether it is a proper way. cc @cloud-fan @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`
LuciferYang commented on PR #40492: URL: https://github.com/apache/spark/pull/40492#issuecomment-1477180380 Thanks All ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
beliefer commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1142827066 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -163,6 +164,14 @@ message AnalyzePlanRequest { // (Required) The logical plan to get a hashCode. Plan plan = 1; } + + // Explains the expression based on extended is true or not. + message ExplainExpression { +// (Required) The expression to be analyzed. +Expression expr = 1; + +bool extended = 2; Review Comment: Thank you! I forgot it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
beliefer commented on PR #40467: URL: https://github.com/apache/spark/pull/40467#issuecomment-1477176898 > Do we need python side compatible API? Maybe. I will check the python side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
beliefer commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1142826734 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: Good question. It seems we must send the msg to server side. So we require `SparkSession`. But it's a question how to get `SparkSession` in connect's `Column` ? or we define the SparkSession in `Column` directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #40466: [SPARK-42835][SQL][TESTS] Add test cases for `Column.explain`
beliefer commented on code in PR #40466: URL: https://github.com/apache/spark/pull/40466#discussion_r1142825198 ## sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala: ## @@ -921,6 +922,132 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { } } + private def captureStdOut(block: => Unit): String = { +val capturedOut = new ByteArrayOutputStream() +Console.withOut(capturedOut)(block) +capturedOut.toString() + } + + test("explain") { Review Comment: It is hears better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #40476: [MINOR][BUILD] Remove unused properties in pom file
yaooqinn commented on code in PR #40476: URL: https://github.com/apache/spark/pull/40476#discussion_r1142823591 ## resource-managers/kubernetes/integration-tests/pom.xml: ## @@ -26,8 +26,6 @@ spark-kubernetes-integration-tests_2.12 -1.3.0 - Review Comment: So my guess is that `extraScalaTestArgs` can be helpful for local build/test for developers to add some custom java options, agents, etc. It just defaults to empty, not useless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #40487: [WIP] Implement CoGrouped Map API
xinrong-meng commented on code in PR #40487: URL: https://github.com/apache/spark/pull/40487#discussion_r1142823170 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -509,6 +511,26 @@ class SparkConnectPlanner(val session: SparkSession) { .logicalPlan } + private def transformCoGroupMap(rel: proto.GroupMap): LogicalPlan = { Review Comment: Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #40418: [SPARK-42790][SQL] Abstract the excluded method for better test for JDBC docker tests.
beliefer commented on PR #40418: URL: https://github.com/apache/spark/pull/40418#issuecomment-1477169805 @srowen Thank you! @cloud-fan @huaxingao Thank you too! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly
zhengruifeng commented on PR #40497: URL: https://github.com/apache/spark/pull/40497#issuecomment-1477168751 merged to master/branch-3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly
zhengruifeng closed pull request #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly URL: https://github.com/apache/spark/pull/40497 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()
cloud-fan commented on PR #40473: URL: https://github.com/apache/spark/pull/40473#issuecomment-1477165259 The check was added to `getExprState` in https://github.com/apache/spark/pull/39010, which is to avoid canonicalizing a subquery expression and leading to NPE. I agree that we should be consistent and this PR LGTM. Can we update the test case to use `LambdaVariable` as `NamedLambdaVariable` has been removed in https://github.com/apache/spark/pull/40475 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40499: [SPARK-42876][SQL] DataType's physicalDataType should be private[sql]
amaliujia opened a new pull request, #40499: URL: https://github.com/apache/spark/pull/40499 ### What changes were proposed in this pull request? `physicalDataType` should not be a public API but be private[sql]. ### Why are the changes needed? This is to limit API scope to not expose unnecessary API to be public. ### Does this PR introduce _any_ user-facing change? No since we have not released Spark 3.4.0 yet. ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on a diff in pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on code in PR #40385: URL: https://github.com/apache/spark/pull/40385#discussion_r1142811318 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala: ## @@ -73,14 +78,34 @@ object ExplainUtils extends AdaptiveSparkPlanHelper { */ def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = { try { + // Initialize a reference-unique set of Operators to avoid accdiental overwrites and to allow + // intentional overwriting of IDs generated in previous AQE iteration + val operators = newSetFromMap[QueryPlan[_]](new IdentityHashMap()) + // Initialize an array of ReusedExchanges to help find Adaptively Optimized Out + // Exchanges as part of SPARK-42753 + val reusedExchanges = ArrayBuffer.empty[ReusedExchangeExec] Review Comment: This `ArrayBuffer` is guaranteed unique because we only insert `ReusedExchange` nodes in the `setOpId` function which checks against the `IdentityHashMap`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on a diff in pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on code in PR #40385: URL: https://github.com/apache/spark/pull/40385#discussion_r1142810230 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala: ## @@ -73,14 +78,34 @@ object ExplainUtils extends AdaptiveSparkPlanHelper { */ def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = { try { + // Initialize a reference-unique set of Operators to avoid accdiental overwrites and to allow + // intentional overwriting of IDs generated in previous AQE iteration + val operators = newSetFromMap[QueryPlan[_]](new IdentityHashMap()) Review Comment: `SparkPlan` has ID but not all `QueryPlan` have ID and this function allows all `QueryPlan`. I attempted creating HashMap of IDs and casting the `plan` argument as a `SparkPlan` but it would fail some tests cases where the node isn't a `SparkPlan`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Kimahriman commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()
Kimahriman commented on PR #40473: URL: https://github.com/apache/spark/pull/40473#issuecomment-1477140427 > @Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again. Yeah I was just commenting on the related PR that broke CSE for anything using a HOF. I had plans for trying to do CSE inside a HOF but that stalled when I didn't get any traction on the initial adding codegen support -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rednaxelafx commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()
rednaxelafx commented on PR #40473: URL: https://github.com/apache/spark/pull/40473#issuecomment-1477136150 @Kimahriman I'd love to see a good CSE implementation for higher-order functions too. But for backporting the fix (which is this PR's primary intent) that would have been too much. For this one (or the one @peter-toth forked off) we're just aiming for a narrow fix that allows the aggregate to work again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rednaxelafx commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()
rednaxelafx commented on PR #40473: URL: https://github.com/apache/spark/pull/40473#issuecomment-1477132202 @peter-toth could you please clarify why `supportedExpression()` was needed in `getExprState()` in the first place? i.e. why isn't it sufficient to add it to `addExprTree()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rednaxelafx commented on pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation
rednaxelafx commented on PR #40488: URL: https://github.com/apache/spark/pull/40488#issuecomment-1477131544 Before the recent rounds of changes to EquivalentExpressions, the old `addExprTree` used to call `addExpr` in its core: https://github.com/apache/spark/blob/branch-2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L90 That was still the case when `PhysicalAggregation` started using this mechanism to deduplicate expressions. I guess it started becoming "detached" from the main path when the recent refactoring happened that allows updating a separate equivalence map instead of the "main" one. Your proposed PR here further orphans that function from any actual use. Which is okay for keeping binary compatibility as much as possible. The inlined dedup logic in `PhysicalAggregation` looks rather ugly though. I don't have a strong opinion about that as long as other reviewers are okay. I'd prefer still retaining some sort of encapsulated collection for the dedup usage. BTW I updated my PR's test case because it makes more sense to check the return value from `addExpr` on the second invocation rather than on the first (both "not supported expression" and actual new expression cases would have gotten a `false` return value if we add that guard to the `addExpr()` function). https://github.com/apache/spark/pull/40473/commits/28d101ee6765c5453189fa62d6b8ade1568d99d2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive
github-actions[bot] commented on PR #38534: URL: https://github.com/apache/spark/pull/38534#issuecomment-1477119915 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #38661: [SPARK-41085][SQL] Support Bit manipulation function COUNTSET
github-actions[bot] closed pull request #38661: [SPARK-41085][SQL] Support Bit manipulation function COUNTSET URL: https://github.com/apache/spark/pull/38661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38608: [SPARK-41080][SQL] Support Bit manipulation function SETBIT
github-actions[bot] commented on PR #38608: URL: https://github.com/apache/spark/pull/38608#issuecomment-1477119894 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40479: [CONNECT][ML][WIP] Spark connect ML for scala client
WeichenXu123 commented on code in PR #40479: URL: https://github.com/apache/spark/pull/40479#discussion_r1142784162 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Pipeline.scala: ## @@ -17,47 +17,13 @@ package org.apache.spark.ml -import org.apache.spark.annotation.DeveloperApi import org.apache.spark.internal.Logging import org.apache.spark.ml.param.{ParamMap, Params} -import org.apache.spark.sql.types.StructType /** * A stage in a pipeline, either an [[Estimator]] or a [[Transformer]]. */ abstract class PipelineStage extends Params with Logging { - /** - * Check transform validity and derive the output schema from the input schema. - * - * We check validity for interactions between parameters during `transformSchema` and raise an - * exception if any parameter value is invalid. Parameter value checks which do not depend on - * other parameters are handled by `Param.validate()`. - * - * Typical implementation should first conduct verification on schema change and parameter - * validity, including complex parameter interaction checks. - */ - def transformSchema(schema: StructType): StructType Review Comment: Similarly, in pyspark side, the `Transformer/ JavaTransformer` also does not do any schema transformation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [WIP] reader table API could also accept options
amaliujia commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142782057 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -148,6 +143,13 @@ message Read { // This is only supported by the JDBC data source. repeated string predicates = 5; } + + // Options for data sources and named table. + // + // When using for data sources, the context of this map varies based on the + // data source format. This options could be empty for valid data source format. + // The map key is case insensitive. + map options = 3; Review Comment: Sounds good. I probably will go with this non-breaking approach that only `options` to `NamedTable`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #40498: [WIP] reader table API could also accept options
grundprinzip commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142781033 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -148,6 +143,13 @@ message Read { // This is only supported by the JDBC data source. repeated string predicates = 5; } + + // Options for data sources and named table. + // + // When using for data sources, the context of this map varies based on the + // data source format. This options could be empty for valid data source format. + // The map key is case insensitive. + map options = 3; Review Comment: I agree that adding the options to named table is probably the better approach given that we have many relations. Alternatively simply support both. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #40498: [WIP] reader table API could also accept options
grundprinzip commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142780483 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -148,6 +143,13 @@ message Read { // This is only supported by the JDBC data source. repeated string predicates = 5; } + + // Options for data sources and named table. + // + // When using for data sources, the context of this map varies based on the + // data source format. This options could be empty for valid data source format. + // The map key is case insensitive. + map options = 3; Review Comment: You must not break. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #40498: [WIP] reader table API could also accept options
zhenlineo commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142775827 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -148,6 +143,13 @@ message Read { // This is only supported by the JDBC data source. repeated string predicates = 5; } + + // Options for data sources and named table. + // + // When using for data sources, the context of this map varies based on the + // data source format. This options could be empty for valid data source format. + // The map key is case insensitive. + map options = 3; Review Comment: What's the policy around "breaking changes"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #40498: [WIP] reader table API could also accept options
zhenlineo commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142775413 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -183,7 +183,7 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging dataSourceBuilder.setFormat(source) userSpecifiedSchema.foreach(schema => dataSourceBuilder.setSchema(schema.toDDL)) extraOptions.foreach { case (k, v) => -dataSourceBuilder.putOptions(k, v) +builder.getReadBuilder.putOptions(k, v) Review Comment: +1 Can you also add a simple test `reader.options().table()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40498: [WIP] reader table API could also accept options
ueshin commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1142758041 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -183,7 +183,7 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging dataSourceBuilder.setFormat(source) userSpecifiedSchema.foreach(schema => dataSourceBuilder.setSchema(schema.toDDL)) extraOptions.foreach { case (k, v) => -dataSourceBuilder.putOptions(k, v) +builder.getReadBuilder.putOptions(k, v) Review Comment: should modify `table` method as well? ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -148,6 +143,13 @@ message Read { // This is only supported by the JDBC data source. repeated string predicates = 5; } + + // Options for data sources and named table. + // + // When using for data sources, the context of this map varies based on the + // data source format. This options could be empty for valid data source format. + // The map key is case insensitive. + map options = 3; Review Comment: I guess just adding the field to `NamedTable` is simpler and doesn't break anything? ## python/pyspark/sql/connect/plan.py: ## @@ -293,7 +293,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: plan.read.data_source.schema = self._schema if self._options is not None and len(self._options) > 0: for k, v in self._options.items(): -plan.read.data_source.options[k] = v +plan.read.options[k] = v Review Comment: should modify `Read` as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40498: [WIP] reader table API could also accept options
amaliujia opened a new pull request, #40498: URL: https://github.com/apache/spark/pull/40498 ### What changes were proposed in this pull request? It turns out that `spark.read.option.table` is a valid call chain and the `table` API does accept options when open a table. Existing Spark Connect implementation does not consider it. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JohnTortugo commented on pull request #40225: [SPARK-42625][BUILD] Upgrade `zstd-jni` to 1.5.4-2
JohnTortugo commented on PR #40225: URL: https://github.com/apache/spark/pull/40225#issuecomment-1477026864 Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40225: [SPARK-42625][BUILD] Upgrade `zstd-jni` to 1.5.4-2
dongjoon-hyun commented on PR #40225: URL: https://github.com/apache/spark/pull/40225#issuecomment-1477024392 Here is the official document about how to run the benchmark in your GitHub Action. Please see `Running benchmarks in your forked repository` Section. - https://spark.apache.org/developer-tools.html In addition, you can check GitHub Action benchmark job to see the logic. - https://github.com/apache/spark/blob/master/.github/workflows/benchmark.yml Lastly, each benchmark file has a command-line direction in their header file, @JohnTortugo . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JohnTortugo commented on pull request #40225: [SPARK-42625][BUILD] Upgrade `zstd-jni` to 1.5.4-2
JohnTortugo commented on PR #40225: URL: https://github.com/apache/spark/pull/40225#issuecomment-1477003627 Hey @dongjoon-hyun - Can you please point me to some documentation about how this benchmarking is done? I'd like to run the same benchmark locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly
ueshin opened a new pull request, #40497: URL: https://github.com/apache/spark/pull/40497 ### What changes were proposed in this pull request? Fix `DataFrame.toPandas()` to handle timezone and map types properly. ### Why are the changes needed? Currently `DataFrame.toPandas()` doesn't handle timezone for timestamp type, and map types properly. For example: ```py >>> schema = StructType().add("ts", TimestampType()) >>> spark.createDataFrame([(datetime(1969, 1, 1, 1, 1, 1),), (datetime(2012, 3, 3, 3, 3, 3),), (datetime(2100, 4, 4, 4, 4, 4),)], schema).toPandas() ts 0 1969-01-01 01:01:01-08:00 1 2012-03-03 03:03:03-08:00 2 2100-04-04 03:04:04-08:00 ``` which should be: ```py ts 0 1969-01-01 01:01:01 1 2012-03-03 03:03:03 2 2100-04-04 04:04:04 ``` ### Does this PR introduce _any_ user-facing change? The result of `DataFrame.toPandas()` with timestamp type and map type will be the same as PySpark. ### How was this patch tested? Enabled the related tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
HyukjinKwon commented on PR #40496: URL: https://github.com/apache/spark/pull/40496#issuecomment-1476856830 LGTM if tests pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common`
HyukjinKwon closed pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common` URL: https://github.com/apache/spark/pull/40485 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common`
HyukjinKwon commented on PR #40485: URL: https://github.com/apache/spark/pull/40485#issuecomment-1476846845 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on code in PR #40393: URL: https://github.com/apache/spark/pull/40393#discussion_r1142572022 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { + +initPushBasedShuffleConfs(conf) +conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3) +DAGSchedulerSuite.clearMergerLocs() +DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3")) + +val rddA = new MyRDD(sc, 2, Nil) +val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) +val shuffleIdA = shuffleDepA.shuffleId + +val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) +val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) + +val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) + +submit(rddC, Array(0, 1)) + +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// Fetch failed +runEvent(makeCompletionEvent( + taskSets(1).tasks(0), + FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0, +"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"), + result = null)) + +// long running task complete +completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA")) +assert(!shuffleDepB.shuffleMergeFinalized) + +// stage1`s tasks have all completed +val shuffleStage1 = scheduler.stageIdToStage(1).asInstanceOf[ShuffleMapStage] +assert(shuffleStage1.pendingPartitions.isEmpty) + +// resubmit +scheduler.resubmitFailedStages() + +// complete parentStage0 +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// stage1 should be shuffleMergeFinalized +assert(shuffleDepB.shuffleMergeFinalized) + } + + for (pushBasedShuffleEnabled <- Seq(true, false)) { +test("SPARK-40082: recomputation of shuffle map stage with no pending partitions should not " + + s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") { Review Comment: Good question - I have not tried that :-) This is a pattern used for other tests as well when we want to do a config sweep. Does specifying pushBasedShuffleEnabled = true in that string work ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on code in PR #40393: URL: https://github.com/apache/spark/pull/40393#discussion_r1142572022 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { + +initPushBasedShuffleConfs(conf) +conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3) +DAGSchedulerSuite.clearMergerLocs() +DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3")) + +val rddA = new MyRDD(sc, 2, Nil) +val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) +val shuffleIdA = shuffleDepA.shuffleId + +val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) +val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) + +val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) + +submit(rddC, Array(0, 1)) + +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// Fetch failed +runEvent(makeCompletionEvent( + taskSets(1).tasks(0), + FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0, +"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"), + result = null)) + +// long running task complete +completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA")) +assert(!shuffleDepB.shuffleMergeFinalized) + +// stage1`s tasks have all completed +val shuffleStage1 = scheduler.stageIdToStage(1).asInstanceOf[ShuffleMapStage] +assert(shuffleStage1.pendingPartitions.isEmpty) + +// resubmit +scheduler.resubmitFailedStages() + +// complete parentStage0 +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// stage1 should be shuffleMergeFinalized +assert(shuffleDepB.shuffleMergeFinalized) + } + + for (pushBasedShuffleEnabled <- Seq(true, false)) { +test("SPARK-40082: recomputation of shuffle map stage with no pending partitions should not " + + s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") { Review Comment: Good question - I have not tried that :-) Does specifying pushBasedShuffleEnabled = true in that string work ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation
tgravescs commented on PR #39127: URL: https://github.com/apache/spark/pull/39127#issuecomment-1476782666 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] asfgit closed pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation
asfgit closed pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation URL: https://github.com/apache/spark/pull/39127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`
MaxGekk closed pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend` URL: https://github.com/apache/spark/pull/40492 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #40478: [SPARK-42779][SQL][FOLLOWUP] Allow V2 writes to indicate advisory shuffle partition size
aokolnychyi commented on PR #40478: URL: https://github.com/apache/spark/pull/40478#issuecomment-1476769291 Thanks, @dongjoon-hyun @cloud-fan! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`
MaxGekk commented on PR #40492: URL: https://github.com/apache/spark/pull/40492#issuecomment-1476767577 +1, LGTM. Merging to master. Thank you, @LuciferYang and @dongjoon-hyun @gengliangwang @dtenedor for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`
LuciferYang commented on PR #40492: URL: https://github.com/apache/spark/pull/40492#issuecomment-1476757882 GA passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
ueshin commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1142537848 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: Who will provide the implicit `SparkSession` in the users' context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor opened a new pull request, #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
dtenedor opened a new pull request, #40496: URL: https://github.com/apache/spark/pull/40496 ### What changes were proposed in this pull request? This PR enables the new golden file test framework for analysis for all input files. Background: * In https://github.com/apache/spark/pull/40449 we added the ability to exercise the analyzer on the SQL queries in existing golden files in the `sql/core/src/test/resources/sql-tests/inputs` directory, writing separate output test files in the new `sql/core/src/test/resources/sql-tests/analyzer-results` directory in additional to the original output directory for full end-to-end query execution results. * That PR also added an allowlist of input files to include in this new dual-run mode. * In this PR, we remove that allowlist exercise the new dual-run mode for all the input files. We also extend the analyzer testing to support separate test cases in ANSI-mode, TimestampNTZ, and UDFs. ### Why are the changes needed? This improves test coverage and helps prevent against accidental regressions in the future as we edit the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR adds testing only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #40493: modified for SPARK-42839: Assign a name to the error class _LEGACY_ER…
MaxGekk commented on code in PR #40493: URL: https://github.com/apache/spark/pull/40493#discussion_r1142511975 ## sql/core/src/test/scala/org/apache/spark/sql/LegacyErrorTempSuit.scala: ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/* + * Created by ruilibuaa + * 2023-3-19 + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkException +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSparkSession + + +class LegacyErrorTempSuit extends SharedSparkSession { + + + test("CANNOT_ZIP_MAPS") { Review Comment: Could you place the test to `QueryCompilationErrorsSuite`, please. Usually, we put tests for exceptions from: - QueryParsingErrors to QueryParsingErrorsSuite - QueryCompilationErrors to QueryCompilationErrorsSuite - QueryExecutionErrors to QueryExecutionErrorsSuite ## core/src/main/resources/error/error-classes.json: ## @@ -3684,7 +3684,7 @@ ". If necessary set to false to bypass this error." ] }, - "_LEGACY_ERROR_TEMP_2003" : { + "CANNOT_ZIP_MAPS" : { Review Comment: Highly likely, you need to reorder the error class since the error classes in the file should be ordered alphabetically. ## sql/core/src/test/scala/org/apache/spark/sql/LegacyErrorTempSuit.scala: ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/* + * Created by ruilibuaa + * 2023-3-19 + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkException +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSparkSession + + +class LegacyErrorTempSuit extends SharedSparkSession { + + + test("CANNOT_ZIP_MAPS") { + +val spark = SparkSession.builder().getOrCreate() +import spark.implicits._ + +val maxRoundedArrayLength = 2 + +val zipMapsbyUniqueKey: UserDefinedFunction = udf( + (mapA: Map[String, Int], mapB: Map[String, Int]) => { +if (mapA.keySet.intersect(mapB.keySet).nonEmpty) { + throw new + IllegalArgumentException(s"Maps have overlapping keys") +} +val mergedMap = mapA ++ mapB +val mergeMapSize = mergedMap.size +// GlobalVariables.mergeMapSize = Some(mergeMapSize) +if (mergeMapSize > maxRoundedArrayLength) { + throw new + SparkException(s"Unsuccessful try to zip maps with $mergeMapSize unique keys " + +s"due to exceeding the array size limit $maxRoundedArrayLength.") +} +mergedMap + } +) + +val data = Seq( + (1, Map("a" -> 1, "b" -> 2), Map("c" -> 3, "d" -> 4)), + (2, Map("e" -> 5, "f" -> 6), Map("g" -> 7, "h" -> 8)) +) + +val exception = intercept[SparkException] { + val df = data.toDF("id", "map1", "map2") + val zippedDf = df.withColumn("zipped_map", zipMapsbyUniqueKey(col("map1"), col("map2"))) + zippedDf.collect() +} + +val errorMessage = exception.getMessage +if (errorMessage.contains("exceeding the array size limit")) { + assert(errorMessage.contains(s"$maxRoundedArrayLength")) Review Comment: Please, use `checkError` instead of it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t
[GitHub] [spark] otterc commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
otterc commented on code in PR #40393: URL: https://github.com/apache/spark/pull/40393#discussion_r1142475865 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { + +initPushBasedShuffleConfs(conf) +conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3) +DAGSchedulerSuite.clearMergerLocs() +DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3")) + +val rddA = new MyRDD(sc, 2, Nil) +val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) +val shuffleIdA = shuffleDepA.shuffleId + +val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) +val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) + +val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) + +submit(rddC, Array(0, 1)) + +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// Fetch failed +runEvent(makeCompletionEvent( + taskSets(1).tasks(0), + FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0, +"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"), + result = null)) + +// long running task complete +completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA")) +assert(!shuffleDepB.shuffleMergeFinalized) + +// stage1`s tasks have all completed +val shuffleStage1 = scheduler.stageIdToStage(1).asInstanceOf[ShuffleMapStage] +assert(shuffleStage1.pendingPartitions.isEmpty) + +// resubmit +scheduler.resubmitFailedStages() + +// complete parentStage0 +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// stage1 should be shuffleMergeFinalized +assert(shuffleDepB.shuffleMergeFinalized) + } + + for (pushBasedShuffleEnabled <- Seq(true, false)) { +test("SPARK-40082: recomputation of shuffle map stage with no pending partitions should not " + + s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") { Review Comment: How do we run just the individual tests? ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { Review Comment: nit: missing ` ` between `partitions` and `should`. Since this test is checking that the stage gets finalized, we should change the name to "SPARK-40082: re-computation of shuffle map stage with no pending partitions should finalize the stage" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #40494: [MINOR][DOCS] Fix typos
bjornjorgensen commented on PR #40494: URL: https://github.com/apache/spark/pull/40494#issuecomment-1476637999 @srowen FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on a diff in pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on code in PR #40385: URL: https://github.com/apache/spark/pull/40385#discussion_r114291 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala: ## @@ -119,17 +155,40 @@ object ExplainUtils extends AdaptiveSparkPlanHelper { * @param plan Input query plan to process * @param startOperatorID The start value of operation id. The subsequent operations will be *assigned higher value. + * @param visited A unique set of operators visited by generateOperatorIds. The set is scoped + *at the callsite function processPlan. It serves two purpose: Firstly, it is + *used to avoid accidentally overwriting existing IDs that were generated in + *the same processPlan call. Secondly, it is used to allow for intentional ID + *overwriting as part of SPARK-42753 where an Adaptively Optimized Out Exchange + *and its subtree may contain IDs that were generated in a previous AQE + *iteration's processPlan call which would result in incorrect IDs. + * @param reusedExchanges A unique set of ReusedExchange nodes visited which will be used to + *idenitfy adaptively optimized out exchanges in SPARK-42753. + * @param addReusedExchanges Whether to add ReusedExchange nodes to reusedExchanges set. We set it + * to false to avoid processing more nested ReusedExchanges nodes in the + * subtree of an Adpatively Optimized Out Exchange. * @return The last generated operation id for this input plan. This is to ensure we always * assign incrementing unique id to each operator. */ - private def generateOperatorIDs(plan: QueryPlan[_], startOperatorID: Int): Int = { + private def generateOperatorIDs( + plan: QueryPlan[_], + startOperatorID: Int, + visited: Set[QueryPlan[_]], + reusedExchanges: ArrayBuffer[ReusedExchangeExec], + addReusedExchanges: Boolean): Int = { var currentOperationID = startOperatorID // Skip the subqueries as they are not printed as part of main query block. if (plan.isInstanceOf[BaseSubqueryExec]) { return currentOperationID } -def setOpId(plan: QueryPlan[_]): Unit = if (plan.getTagValue(QueryPlan.OP_ID_TAG).isEmpty) { Review Comment: I'm removing the check "if OP_ID_TAG is empty" because we are allowing overwriting the OP_ID_TAG since in some "Optimized Out Exchange" it may contain nodes that have "OP_ID_TAG" generated from previous `processPlan` call in a previous AQE iteration. Therefore the "OP_ID_TAG" would be incorrect and needs overwriting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
amaliujia commented on PR #40467: URL: https://github.com/apache/spark/pull/40467#issuecomment-1476589894 Do we need python side compatible API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
amaliujia commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1142418545 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -163,6 +164,14 @@ message AnalyzePlanRequest { // (Required) The logical plan to get a hashCode. Plan plan = 1; } + + // Explains the expression based on extended is true or not. + message ExplainExpression { +// (Required) The expression to be analyzed. +Expression expr = 1; + +bool extended = 2; Review Comment: Nit: please follow the style guide to indicate if this is required or optional and also document the semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1476585572 The test failure is unrelated to this PR - once the changes above are made, the reexecution should pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on code in PR #40393: URL: https://github.com/apache/spark/pull/40393#discussion_r1142362530 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { + +initPushBasedShuffleConfs(conf) +conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3) +DAGSchedulerSuite.clearMergerLocs() +DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3")) + +val rddA = new MyRDD(sc, 2, Nil) +val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) +val shuffleIdA = shuffleDepA.shuffleId + +val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) +val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) + +val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) + +submit(rddC, Array(0, 1)) + +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) Review Comment: nit: For consistency, let us make it `hostA` and `hostB` ... and have `hostA` (say) fail. ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { + +initPushBasedShuffleConfs(conf) +conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3) +DAGSchedulerSuite.clearMergerLocs() +DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3")) + +val rddA = new MyRDD(sc, 2, Nil) +val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) +val shuffleIdA = shuffleDepA.shuffleId + +val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) +val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) + +val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) + +submit(rddC, Array(0, 1)) + +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// Fetch failed +runEvent(makeCompletionEvent( + taskSets(1).tasks(0), + FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0, Review Comment: nit: Since map tasks did not run on `hostC`, let us change it to `hostA` as per comment above. ## pom.xml: ## @@ -114,7 +114,7 @@ 1.8 ${java.version} ${java.version} -3.8.7 +3.6.3 Review Comment: Revert this ? ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti } } + test("SPARK-40082: recomputation of shuffle map stage with no pending partitions" + +"should not hang") { + +initPushBasedShuffleConfs(conf) +conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3) +DAGSchedulerSuite.clearMergerLocs() +DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3")) + +val rddA = new MyRDD(sc, 2, Nil) +val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2)) +val shuffleIdA = shuffleDepA.shuffleId + +val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker) +val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2)) + +val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker) + +submit(rddC, Array(0, 1)) + +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA")) + +// Fetch failed +runEvent(makeCompletionEvent( + taskSets(1).tasks(0), + FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0, +"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"), + result = null)) + +// long running task complete +completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA")) Review Comment: nit: let us change the host to `hostB` for successes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40466: [SPARK-42835][SQL][TESTS] Add test cases for `Column.explain`
amaliujia commented on code in PR #40466: URL: https://github.com/apache/spark/pull/40466#discussion_r1142414407 ## sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala: ## @@ -921,6 +922,132 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { } } + private def captureStdOut(block: => Unit): String = { +val capturedOut = new ByteArrayOutputStream() +Console.withOut(capturedOut)(block) +capturedOut.toString() + } + + test("explain") { Review Comment: +1 if we still want be aware it's changing, golden files is a better way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org