[GitHub] [spark] MaxGekk closed pull request #39332: [WIP][SPARK-40822][SQL] Stable derived column aliases

2023-03-20 Thread via GitHub


MaxGekk closed pull request #39332: [WIP][SPARK-40822][SQL] Stable derived 
column aliases
URL: https://github.com/apache/spark/pull/39332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

2023-03-20 Thread via GitHub


MaxGekk closed pull request #40126: [SPARK-40822][SQL] Stable derived column 
aliases
URL: https://github.com/apache/spark/pull/40126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

2023-03-20 Thread via GitHub


MaxGekk commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1477333100

   Merging to master. Thank you, @srielau and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-20 Thread via GitHub


amaliujia commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142929343


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -148,6 +143,13 @@ message Read {
 // This is only supported by the JDBC data source.
 repeated string predicates = 5;
   }
+
+  // Options for data sources and named table.
+  //
+  // When using for data sources, the context of this map varies based on the
+  // data source format. This options could be empty for valid data source 
format.
+  // The map key is case insensitive.
+  map options = 3;

Review Comment:
   Choose to use non-breaking way to change the proto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-20 Thread via GitHub


amaliujia commented on PR #40498:
URL: https://github.com/apache/spark/pull/40498#issuecomment-1477332325

   @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-20 Thread via GitHub


amaliujia commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142929171


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -183,7 +183,7 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
   dataSourceBuilder.setFormat(source)
   userSpecifiedSchema.foreach(schema => 
dataSourceBuilder.setSchema(schema.toDDL))
   extraOptions.foreach { case (k, v) =>
-dataSourceBuilder.putOptions(k, v)
+builder.getReadBuilder.putOptions(k, v)

Review Comment:
   I found I can only add a meaningful test in server side. On client sides 
there is no way to verify an option has passed through.
   
   In existing codebase, it is tested because we can do 
`df.queryExecution.analyzed` then get the complete plan the to fetch the 
options and verify it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yliou opened a new pull request, #40503: [SPARK-42830] [UI] Link skipped stages on Spark UI

2023-03-20 Thread via GitHub


yliou opened a new pull request, #40503:
URL: https://github.com/apache/spark/pull/40503

   
   
   ### What changes were proposed in this pull request?
   Adds text on the UI that shows which executed stage a skipped stage on the 
UI refers to.
   
   
   
   ### Why are the changes needed?
   Helps find the execution details, in terms of  figuring out which stages 
from earlier jobs feed into a later stage in a specific job.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the jobs page can look like the following, where the skipped stage is 
now clickable and redirects to the stage that was actually executed 
   https://user-images.githubusercontent.com/16739760/226529417-a2384904-7e46-48f6-bcd5-c708416e6353.png";>
   
   
   
   ### How was this patch tested?
   Manual test locally on UI.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40444: [SPARK-42813][K8S] Print application info when waitAppCompletion is false

2023-03-20 Thread via GitHub


pan3793 commented on PR #40444:
URL: https://github.com/apache/spark/pull/40444#issuecomment-1477316443

   @dongjoon-hyun thank you, I updated the log message to
   ```
   "Application $appName with application ID $appId and submission ID $sId 
finished"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yliou opened a new pull request, #40502: [SPARK-42829] [UI] add repeat identifier to cached RDD on stage page

2023-03-20 Thread via GitHub


yliou opened a new pull request, #40502:
URL: https://github.com/apache/spark/pull/40502

   
   
   ### What changes were proposed in this pull request?
   Adds Repeat Identifier: to the cached RDD node on the Stages page. Made the 
Repeat Identifier: have bolded text so that it's easier to distinguish from the 
rest of the text.
   
   
   
   ### Why are the changes needed?
   Currently there is no way to distinguish which cached RDD is being executed 
in a particular stage. This aims to fix that.
   
   
   
   ### Does this PR introduce 
   _any_ user-facing change?
   Yes, on the Stages page in the UI when there is a cached RDD. One example is 
https://user-images.githubusercontent.com/16739760/226527044-4c0719f4-aaf8-4e99-8bce-fd033106379c.png";>
   
   
   
   ### How was this patch tested?
   Manual test locally in SQL UI.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40501: [SPARK-42864][ML] Make IsotonicRegression.PointsAccumulator private

2023-03-20 Thread via GitHub


zhengruifeng closed pull request #40501: [SPARK-42864][ML] Make 
IsotonicRegression.PointsAccumulator private
URL: https://github.com/apache/spark/pull/40501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


xinrong-meng commented on PR #40500:
URL: https://github.com/apache/spark/pull/40500#issuecomment-1477308745

   Merged to branch-3.4, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng closed pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


xinrong-meng closed pull request #40500: [SPARK-42864][ML][3.4] Make 
`IsotonicRegression.PointsAccumulator` private
URL: https://github.com/apache/spark/pull/40500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


zhengruifeng commented on PR #40500:
URL: https://github.com/apache/spark/pull/40500#issuecomment-1477304948

   @xinrong-meng  it seems this PR was already merged? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression

2023-03-20 Thread via GitHub


ulysses-you commented on code in PR #40446:
URL: https://github.com/apache/spark/pull/40446#discussion_r1142906959


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -125,13 +128,27 @@ class EquivalentExpressions {
 }
   }
 
+  private def skipForShortcut(expr: Expression): Expression = {
+if (skipForShortcutEnable) {
+  // The subexpression may not need to eval even if it appears more than 
once.
+  // e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if 
`a` is true.
+  expr match {
+case and: And => skipForShortcut(and.left)

Review Comment:
   yea it is the fact, but after second thought the `updateExprTree` has side 
effect `updateExprInMap` during recursion. If we decide to skip for shortcut, 
is it better to return the final valid expression in one shot ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression

2023-03-20 Thread via GitHub


ulysses-you commented on code in PR #40446:
URL: https://github.com/apache/spark/pull/40446#discussion_r1142906959


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -125,13 +128,27 @@ class EquivalentExpressions {
 }
   }
 
+  private def skipForShortcut(expr: Expression): Expression = {
+if (skipForShortcutEnable) {
+  // The subexpression may not need to eval even if it appears more than 
once.
+  // e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if 
`a` is true.
+  expr match {
+case and: And => skipForShortcut(and.left)

Review Comment:
   yea it is the fact, but after second though the `updateExprTree` has side 
effect `updateExprInMap` during recursion. If we decide to skip for shortcut, 
is it better to return the final valid expression in one shot ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sudoliyang commented on pull request #40494: [MINOR][DOCS] Fix typos

2023-03-20 Thread via GitHub


sudoliyang commented on PR #40494:
URL: https://github.com/apache/spark/pull/40494#issuecomment-1477300074

   I did enable action on my forked repo and rebase to `apache/spark` master. 
   Can anyone re-run the workflows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rednaxelafx commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

2023-03-20 Thread via GitHub


rednaxelafx commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142905889


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##
@@ -296,12 +298,17 @@ object PhysicalAggregation {
   // build a set of semantically distinct aggregate expressions and 
re-write expressions so
   // that they reference the single copy of the aggregate function which 
actually gets computed.
   // Non-deterministic aggregate expressions are not deduplicated.
-  val equivalentAggregateExpressions = new EquivalentExpressions
+  val equivalentAggregateExpressions = mutable.Map.empty[Expression, 
Expression]
   val aggregateExpressions = resultExpressions.flatMap { expr =>
 expr.collect {
-  // addExpr() always returns false for non-deterministic expressions 
and do not add them.
   case a
-if AggregateExpression.isAggregate(a) && 
!equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   The line of thought would be: adding the `supportedExpression` guard to 
`addExpr()` would cause performance regression, so let's just close our eyes 
and make the only remaining use of `addExpr` break away and do its own 
deduplication in the old logic without taking things like `NamedLambdaVariable` 
into account -- which is the way it's been for quite a few releases. This PR 
essentially inlines the `addExpr` path of the old `EquivalentExpressions` into 
`PhysicalAggregation` to recover what it used to do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40444: [SPARK-42813][K8S] Print application info when waitAppCompletion is false

2023-03-20 Thread via GitHub


pan3793 commented on PR #40444:
URL: https://github.com/apache/spark/pull/40444#issuecomment-1477299823

   Kindly ping @dongjoon-hyun


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a diff in pull request #40495: test for reading footer within file range

2023-03-20 Thread via GitHub


huaxingao commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142903084


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
 if (aggregation.isEmpty) {
   ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
 } else {
+  val split = new FileSplit(file.toPath, file.start, file.length, 
Array.empty[String])
   // For aggregate push down, we will get max/min/count from footer 
statistics.
-  ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @sunchao What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a diff in pull request #40495: test for reading footer within file range

2023-03-20 Thread via GitHub


huaxingao commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142902856


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
 if (aggregation.isEmpty) {
   ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
 } else {
+  val split = new FileSplit(file.toPath, file.start, file.length, 
Array.empty[String])
   // For aggregate push down, we will get max/min/count from footer 
statistics.
-  ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @yabola Thanks for the ping. I was assuming that we always need to read the 
whole `PartitionedFile`, i.e. the start is always 0 and the length is always 
the file length. Maybe it's safer to add the range.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-20 Thread via GitHub


dtenedor commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1142891754


##
sql/core/src/test/resources/sql-tests/analyzer-results/ansi/cast.sql.out:
##
@@ -0,0 +1,881 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+SELECT CAST('1.23' AS int)
+-- !query analysis
+Project [cast(1.23 as int) AS CAST(1.23 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('1.23' AS long)
+-- !query analysis
+Project [cast(1.23 as bigint) AS CAST(1.23 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-4.56' AS int)
+-- !query analysis
+Project [cast(-4.56 as int) AS CAST(-4.56 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-4.56' AS long)
+-- !query analysis
+Project [cast(-4.56 as bigint) AS CAST(-4.56 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS int)
+-- !query analysis
+Project [cast(abc as int) AS CAST(abc AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS long)
+-- !query analysis
+Project [cast(abc as bigint) AS CAST(abc AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS float)
+-- !query analysis
+Project [cast(abc as float) AS CAST(abc AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS double)
+-- !query analysis
+Project [cast(abc as double) AS CAST(abc AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('1234567890123' AS int)
+-- !query analysis
+Project [cast(1234567890123 as int) AS CAST(1234567890123 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('12345678901234567890123' AS long)
+-- !query analysis
+Project [cast(12345678901234567890123 as bigint) AS 
CAST(12345678901234567890123 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS int)
+-- !query analysis
+Project [cast( as int) AS CAST( AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS long)
+-- !query analysis
+Project [cast( as bigint) AS CAST( AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS float)
+-- !query analysis
+Project [cast( as float) AS CAST( AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS double)
+-- !query analysis
+Project [cast( as double) AS CAST( AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST(NULL AS int)
+-- !query analysis
+Project [cast(null as int) AS CAST(NULL AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST(NULL AS long)
+-- !query analysis
+Project [cast(null as bigint) AS CAST(NULL AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS int)
+-- !query analysis
+Project [cast(123.a as int) AS CAST(123.a AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS long)
+-- !query analysis
+Project [cast(123.a as bigint) AS CAST(123.a AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS float)
+-- !query analysis
+Project [cast(123.a as float) AS CAST(123.a AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS double)
+-- !query analysis
+Project [cast(123.a as double) AS CAST(123.a AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-2147483648' AS int)
+-- !query analysis
+Project [cast(-2147483648 as int) AS CAST(-2147483648 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-2147483649' AS int)
+-- !query analysis
+Project [cast(-2147483649 as int) AS CAST(-2147483649 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('2147483647' AS int)
+-- !query analysis
+Project [cast(2147483647 as int) AS CAST(2147483647 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('2147483648' AS int)
+-- !query analysis
+Project [cast(2147483648 as int) AS CAST(2147483648 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-9223372036854775808' AS long)
+-- !query analysis
+Project [cast(-9223372036854775808 as bigint) AS CAST(-9223372036854775808 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-9223372036854775809' AS long)
+-- !query analysis
+Project [cast(-9223372036854775809 as bigint) AS CAST(-9223372036854775809 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('9223372036854775807' AS long)
+-- !query analysis
+Project [cast(9223372036854775807 as bigint) AS CAST(9223372036854775807 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('9223372036854775808' AS long)
+-- !query analysis
+Project [cast(9223372036854775808 as bigint) AS CAST(9223372036854775808 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT HEX(CAST('abc' AS binary))
+-- !query analysis
+Project [hex(cast(abc as binary)) AS hex(CAST(abc AS BINARY))#x]
++- OneRowRelation
+
+
+-- !query
+SELECT HEX(CAST(CAST(123 AS byte) AS binary))
+-- !query analysis
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "DATATYPE_MISMATCH.CAST_WITH_CONF_SUGGESTION",
+  "sqlState" : "42K09",
+  "messageParameters" : {
+"config" : "\"spark.sql.ansi.enabled\"",
+"configVal" : "'false'",
+"sqlExpr" : "\"CAST(CAST(123 AS TINYINT) AS BINARY)\""

[GitHub] [spark] LuciferYang commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-20 Thread via GitHub


LuciferYang commented on code in PR #40496:
URL: https://github.com/apache/spark/pull/40496#discussion_r1142891030


##
sql/core/src/test/resources/sql-tests/analyzer-results/ansi/cast.sql.out:
##
@@ -0,0 +1,881 @@
+-- Automatically generated by SQLQueryTestSuite
+-- !query
+SELECT CAST('1.23' AS int)
+-- !query analysis
+Project [cast(1.23 as int) AS CAST(1.23 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('1.23' AS long)
+-- !query analysis
+Project [cast(1.23 as bigint) AS CAST(1.23 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-4.56' AS int)
+-- !query analysis
+Project [cast(-4.56 as int) AS CAST(-4.56 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-4.56' AS long)
+-- !query analysis
+Project [cast(-4.56 as bigint) AS CAST(-4.56 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS int)
+-- !query analysis
+Project [cast(abc as int) AS CAST(abc AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS long)
+-- !query analysis
+Project [cast(abc as bigint) AS CAST(abc AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS float)
+-- !query analysis
+Project [cast(abc as float) AS CAST(abc AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('abc' AS double)
+-- !query analysis
+Project [cast(abc as double) AS CAST(abc AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('1234567890123' AS int)
+-- !query analysis
+Project [cast(1234567890123 as int) AS CAST(1234567890123 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('12345678901234567890123' AS long)
+-- !query analysis
+Project [cast(12345678901234567890123 as bigint) AS 
CAST(12345678901234567890123 AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS int)
+-- !query analysis
+Project [cast( as int) AS CAST( AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS long)
+-- !query analysis
+Project [cast( as bigint) AS CAST( AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS float)
+-- !query analysis
+Project [cast( as float) AS CAST( AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('' AS double)
+-- !query analysis
+Project [cast( as double) AS CAST( AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST(NULL AS int)
+-- !query analysis
+Project [cast(null as int) AS CAST(NULL AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST(NULL AS long)
+-- !query analysis
+Project [cast(null as bigint) AS CAST(NULL AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS int)
+-- !query analysis
+Project [cast(123.a as int) AS CAST(123.a AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS long)
+-- !query analysis
+Project [cast(123.a as bigint) AS CAST(123.a AS BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS float)
+-- !query analysis
+Project [cast(123.a as float) AS CAST(123.a AS FLOAT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('123.a' AS double)
+-- !query analysis
+Project [cast(123.a as double) AS CAST(123.a AS DOUBLE)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-2147483648' AS int)
+-- !query analysis
+Project [cast(-2147483648 as int) AS CAST(-2147483648 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-2147483649' AS int)
+-- !query analysis
+Project [cast(-2147483649 as int) AS CAST(-2147483649 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('2147483647' AS int)
+-- !query analysis
+Project [cast(2147483647 as int) AS CAST(2147483647 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('2147483648' AS int)
+-- !query analysis
+Project [cast(2147483648 as int) AS CAST(2147483648 AS INT)#x]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-9223372036854775808' AS long)
+-- !query analysis
+Project [cast(-9223372036854775808 as bigint) AS CAST(-9223372036854775808 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('-9223372036854775809' AS long)
+-- !query analysis
+Project [cast(-9223372036854775809 as bigint) AS CAST(-9223372036854775809 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('9223372036854775807' AS long)
+-- !query analysis
+Project [cast(9223372036854775807 as bigint) AS CAST(9223372036854775807 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT CAST('9223372036854775808' AS long)
+-- !query analysis
+Project [cast(9223372036854775808 as bigint) AS CAST(9223372036854775808 AS 
BIGINT)#xL]
++- OneRowRelation
+
+
+-- !query
+SELECT HEX(CAST('abc' AS binary))
+-- !query analysis
+Project [hex(cast(abc as binary)) AS hex(CAST(abc AS BINARY))#x]
++- OneRowRelation
+
+
+-- !query
+SELECT HEX(CAST(CAST(123 AS byte) AS binary))
+-- !query analysis
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "DATATYPE_MISMATCH.CAST_WITH_CONF_SUGGESTION",
+  "sqlState" : "42K09",
+  "messageParameters" : {
+"config" : "\"spark.sql.ansi.enabled\"",
+"configVal" : "'false'",
+"sqlExpr" : "\"CAST(CAST(123 AS TINYINT) AS BINARY)

[GitHub] [spark] zhengruifeng opened a new pull request, #40501: [SPARK-42864][ML] Make IsotonicRegression.PointsAccumulator private

2023-03-20 Thread via GitHub


zhengruifeng opened a new pull request, #40501:
URL: https://github.com/apache/spark/pull/40501

   ### What changes were proposed in this pull request?
   Make `IsotonicRegression.PointsAccumulator` private, which was introduced in 
https://github.com/apache/spark/commit/3d05c7e037eff79de8ef9f6231aca8340bcc65ef
   
   
   ### Why are the changes needed?
   `PointsAccumulator` is implementation details, should not be exposed
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   existing UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40500: [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


xinrong-meng commented on PR #40500:
URL: https://github.com/apache/spark/pull/40500#issuecomment-1477276287

   LGTM, thank you @zhengruifeng !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common`

2023-03-20 Thread via GitHub


amaliujia commented on PR #40485:
URL: https://github.com/apache/spark/pull/40485#issuecomment-1477271752

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-20 Thread via GitHub


ueshin commented on PR #40402:
URL: https://github.com/apache/spark/pull/40402#issuecomment-1477260665

   @zhengruifeng ah, seems like something is wrong when the schema is a column 
name list.
   Could you use `StructType` to specify the schema as a workaround?
   I'll take a look later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-20 Thread via GitHub


srowen commented on PR #40263:
URL: https://github.com/apache/spark/pull/40263#issuecomment-1477246959

   I don't know enough to say whether it's worth a new method. Can we start 
with the change that needs no new API, is it a big enough win?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-20 Thread via GitHub


mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1477245713

   > Looks like there are a few failures after moving the rule 
([22e7886](https://github.com/apache/spark/commit/22e7886ff1059b98d1525380b2cb22718fd5dd09)).
 @mskapilks, do you think you can look into those failures?
   
   Yup I am working on them. I had wrong SPARK_HOME setup so missed the plan 
changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40500: [WIP][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


zhengruifeng commented on code in PR #40500:
URL: https://github.com/apache/spark/pull/40500#discussion_r1142863924


##
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala:
##
@@ -490,15 +490,13 @@ class IsotonicRegression private (private var isotonic: 
Boolean) extends Seriali
   .sortBy(_._2)
 poolAdjacentViolators(parallelStepResult)
   }
-}
 
-object IsotonicRegression {
   /**
* Utility class, holds a buffer of all points with unique features so far, 
and performs
* weighted sum accumulation of points. Hides these details for better 
readability of the
* main algorithm.
*/
-  class PointsAccumulator {
+  private class PointsAccumulator {

Review Comment:
   right now it is only used in `IsotonicRegression`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40500: [WIP][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


WeichenXu123 commented on code in PR #40500:
URL: https://github.com/apache/spark/pull/40500#discussion_r1142863633


##
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala:
##
@@ -490,15 +490,13 @@ class IsotonicRegression private (private var isotonic: 
Boolean) extends Seriali
   .sortBy(_._2)
 poolAdjacentViolators(parallelStepResult)
   }
-}
 
-object IsotonicRegression {
   /**
* Utility class, holds a buffer of all points with unique features so far, 
and performs
* weighted sum accumulation of points. Hides these details for better 
readability of the
* main algorithm.
*/
-  class PointsAccumulator {
+  private class PointsAccumulator {

Review Comment:
   private[ml] ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic closed pull request #40270: [WIP][SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

2023-03-20 Thread via GitHub


itholic closed pull request #40270: [WIP][SPARK-42662][CONNECT][PYTHON][PS] 
Support `withSequenceColumn` as PySpark DataFrame internal function.
URL: https://github.com/apache/spark/pull/40270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40270: [WIP][SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

2023-03-20 Thread via GitHub


itholic commented on PR #40270:
URL: https://github.com/apache/spark/pull/40270#issuecomment-1477238760

   As the #40456 has been completed, will resume this one. Let me close this PR 
for convenience and open a new one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] lyy-pineapple commented on a diff in pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-03-20 Thread via GitHub


lyy-pineapple commented on code in PR #38171:
URL: https://github.com/apache/spark/pull/38171#discussion_r1142860063


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressionsJoni.scala:
##
@@ -0,0 +1,471 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import java.util.Locale
+
+import scala.collection.JavaConverters._
+
+import org.apache.commons.text.StringEscapeUtils
+import org.jcodings.specific.UTF8Encoding
+import org.joni.{Option, Regex, Syntax}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.codegen._
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.catalyst.trees.TreePattern.{LIKE_FAMLIY, 
TreePattern}
+import org.apache.spark.sql.catalyst.util.StringUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+
+abstract class StringRegexExpressionJoni extends BinaryExpression
+  with ImplicitCastInputTypes with NullIntolerant with Predicate {
+
+  def escape(v: Array[Byte]): Array[Byte]
+  def matches(regex: Regex, str: Array[Byte]): Boolean
+
+  override def dataType: DataType = BooleanType
+  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
+
+  // try cache foldable pattern
+  private lazy val cache: Regex = right match {
+case p: Expression if p.foldable =>
+  compile(p.eval().asInstanceOf[UTF8String].getBytes)
+case _ => null
+  }
+
+  protected def compile(pattern: Array[Byte]): Regex = if (pattern == null) {
+null
+  } else {
+// Let it raise exception if couldn't compile the regex string
+val escapedPattern = escape(pattern)
+new Regex(escapedPattern, 0, escapedPattern.length,
+  Option.NONE, UTF8Encoding.INSTANCE, Syntax.Java)
+  }
+
+  protected def pattern(pattern: Array[Byte]) = if (cache == null) 
compile(pattern) else cache
+
+  protected override def nullSafeEval(input1: Any, input2: Any): Any = {
+val regex = pattern(input2.asInstanceOf[UTF8String].getBytes)
+if(regex == null) {
+  null
+} else {
+  matches(regex, input1.asInstanceOf[UTF8String].getBytes)
+}
+  }
+
+  override def sql: String = s"${left.sql} 
${prettyName.toUpperCase(Locale.ROOT)} ${right.sql}"
+}
+
+// scalastyle:off line.contains.tab
+/**
+ * Simple RegEx pattern matching function
+ */
+@ExpressionDescription(
+  usage = "str _FUNC_ pattern[ ESCAPE escape] - Returns true if str matches 
`pattern` with " +
+"`escape`, null if any arguments are null, false otherwise.",
+  arguments = """
+Arguments:
+  * str - a string expression
+  * pattern - a string expression. The pattern is a string which is 
matched literally, with
+  exception to the following special symbols:
+  exception to the following special symbols:
+_ matches any one character in the input (similar to . in posix 
regular expressions)
+  % matches zero or more characters in the input (similar to .* in 
posix regular
+  expressions)
+  expressions)
+  Since Spark 2.0, string literals are unescaped in our SQL parser. 
For example, in order
+  to match "\abc", the pattern should be "\\abc".
+  When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, 
it fallbacks
+  to Spark 1.6 behavior regarding string literal parsing. For example, 
if the config is
+  enabled, the pattern to match "\abc" should be "\abc".
+  * escape - an character added since Spark 3.0. The default escape 
character is the '\'.
+  If an escape character precedes a special symbol or another escape 
character, the
+  following character is matched literally. It is invalid to escape 
any other character.
+  """,
+  examples = """
+Examples:
+  > SELECT _FUNC_('Spark', '_park');
+  true
+  > SET spark.sql.parser.escapedStringLiterals=true;
+  spark.sql.parser.escapedStringLiterals   true
+  > SELECT '%SystemDrive%\Users\John' _FUNC_ '\%SystemDrive\%\\Users%';
+  true
+  > SET spark.sql.parser.escapedStringLiterals=false;
+  spark.sql.parser.esc

[GitHub] [spark] zhengruifeng opened a new pull request, #40500: [WIP][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private

2023-03-20 Thread via GitHub


zhengruifeng opened a new pull request, #40500:
URL: https://github.com/apache/spark/pull/40500

   ### What changes were proposed in this pull request?
   Make `IsotonicRegression.PointsAccumulator` private
   
   
   ### Why are the changes needed?
   `PointsAccumulator` is implementation details, should not be exposed
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   existing UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


Stove-hust commented on code in PR #40393:
URL: https://github.com/apache/spark/pull/40393#discussion_r1142853303


##
pom.xml:
##
@@ -114,7 +114,7 @@
 1.8
 ${java.version}
 ${java.version}
-3.8.7
+3.6.3

Review Comment:
   😂 Forgot to change it back after local compilation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-20 Thread via GitHub


yabola commented on code in PR #39950:
URL: https://github.com/apache/spark/pull/39950#discussion_r1142848071


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##
@@ -205,11 +212,21 @@ class ParquetFileFormat
 
   val sharedConf = broadcastedHadoopConf.value.value
 
-  lazy val footerFileMetaData =
-ParquetFooterReader.readFooter(sharedConf, filePath, 
SKIP_ROW_GROUPS).getFileMetaData
+  val fileRange = HadoopReadOptions.builder(sharedConf, split.getPath)

Review Comment:
   Yes, before I created and passed `ParquetFileReader` because I wanted to 
create one less `file.newStream()` in it (if  there is no filter pushdown), but 
it doesn’t seem to make much sense. I have changed to pass footer here.
   Please take a look, thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40499: [SPARK-42876][SQL] DataType's physicalDataType should be private[sql]

2023-03-20 Thread via GitHub


amaliujia commented on code in PR #40499:
URL: https://github.com/apache/spark/pull/40499#discussion_r1142847660


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -119,7 +119,7 @@ abstract class DataType extends AbstractDataType {
 
   override private[sql] def acceptsType(other: DataType): Boolean = 
sameType(other)
 
-  def physicalDataType: PhysicalDataType = UninitializedPhysicalType
+  private[sql] def physicalDataType: PhysicalDataType = 
UninitializedPhysicalType

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-20 Thread via GitHub


zhengruifeng commented on PR #40402:
URL: https://github.com/apache/spark/pull/40402#issuecomment-1477209035

   I save a df with UDT in pyspark, and then read it in python client, and it 
works fine. So I guess something is wrong in 
   `createDataFrame`
   
   vanilla PySpark:
   ```
   In [1]: from pyspark.ml.linalg import Vectors
   
   In [2]: df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), 
(0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0,
  ...:...: 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)),], ["label", 
"weight", "features"],)
   
   In [3]: df.write.parquet("/tmp/tmp.pq")
   ```
   
   Python Client:
   ```
   In [6]: df = spark.read.parquet("/tmp/tmp.pq")
   
   In [7]: df.schema
   Out[7]: StructType([StructField('label', DoubleType(), True), 
StructField('weight', DoubleType(), True), StructField('features', VectorUDT(), 
True)])
   
   In [8]: df.collect()
   Out[8]: 
   [Row(label=0.0, weight=4.0, features=DenseVector([3.0, 3.0])),
Row(label=0.0, weight=2.0, features=DenseVector([1.0, 2.0])),
Row(label=1.0, weight=3.0, features=DenseVector([2.0, 1.0])),
Row(label=1.0, weight=1.0, features=DenseVector([0.0, 5.0]))]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yabola commented on a diff in pull request #40495: test for reading footer within file range

2023-03-20 Thread via GitHub


yabola commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142841503


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
 if (aggregation.isEmpty) {
   ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
 } else {
+  val split = new FileSplit(file.toPath, file.start, file.length, 
Array.empty[String])
   // For aggregate push down, we will get max/min/count from footer 
statistics.
-  ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @huaxingao Hi~when reviewing the code, I have a point of doubt, why not take 
the RowGroup metrics information in the current file range here? I feel like 
there's going to be a problem here...
   Here is a example in `SpecificParquetRecordReaderBase`
   
https://github.com/apache/spark/blob/61035129a354d0b31c66908106238b12b1f2f7b0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L96-L102



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-20 Thread via GitHub


zhengruifeng commented on PR #40402:
URL: https://github.com/apache/spark/pull/40402#issuecomment-1477202593

   @ueshin it seem that `createDataFrame` always use the underlying `sqlType` 
other than the UDT itself:
   
   ```
   In [1]: from pyspark.ml.linalg import Vectors
   
   In [2]: df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), 
(0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0,
  ...: 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)),], ["label", "weight", 
"features"],)
   
   In [3]: df.schema
   Out[3]: StructType([StructField('label', DoubleType(), True), 
StructField('weight', DoubleType(), True), StructField('features', 
StructType([StructField('type', ByteType(), False), StructField('size', 
IntegerType(), True), StructField('indices', ArrayType(IntegerType(), False), 
True), StructField('values', ArrayType(DoubleType(), False), True)]), True)])
   
   In [4]: df.collect()
   Out[4]: :>  (0 + 4) 
/ 4]
   [Row(label=1.0, weight=1.0, features=Row(type=1, size=None, indices=None, 
values=[0.0, 5.0])),
Row(label=0.0, weight=2.0, features=Row(type=1, size=None, indices=None, 
values=[1.0, 2.0])),
Row(label=1.0, weight=3.0, features=Row(type=1, size=None, indices=None, 
values=[2.0, 1.0])),
Row(label=0.0, weight=4.0, features=Row(type=1, size=None, indices=None, 
values=[3.0, 3.0]))]
   ```
   
   while in vanilla PySpark:
   ```
   In [1]: from pyspark.ml.linalg import Vectors
   
   In [2]: df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), 
(0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0,
  ...:...: 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0)),], ["label", 
"weight", "features"],)
   
   In [3]: df.schema
   Out[3]: StructType([StructField('label', DoubleType(), True), 
StructField('weight', DoubleType(), True), StructField('features', VectorUDT(), 
True)])
   
   In [4]: df.collect()
   Out[4]:  
   
   [Row(label=1.0, weight=1.0, features=DenseVector([0.0, 5.0])),
Row(label=0.0, weight=2.0, features=DenseVector([1.0, 2.0])),
Row(label=1.0, weight=3.0, features=DenseVector([2.0, 1.0])),
Row(label=0.0, weight=4.0, features=DenseVector([3.0, 3.0]))]
   ```
   
   
   also cc @WeichenXu123 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yabola commented on a diff in pull request #40495: test for reading footer within file range

2023-03-20 Thread via GitHub


yabola commented on code in PR #40495:
URL: https://github.com/apache/spark/pull/40495#discussion_r1142841503


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##
@@ -92,8 +93,13 @@ case class ParquetPartitionReaderFactory(
 if (aggregation.isEmpty) {
   ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
 } else {
+  val split = new FileSplit(file.toPath, file.start, file.length, 
Array.empty[String])
   // For aggregate push down, we will get max/min/count from footer 
statistics.
-  ParquetFooterReader.readFooter(conf, filePath, NO_FILTER)

Review Comment:
   @huaxingao Hi~when reviewing the code, I have a point of doubt, why not take 
the RowGroup metrics information in the current file range here?
   Here is a example in `SpecificParquetRecordReaderBase`
   
https://github.com/apache/spark/blob/61035129a354d0b31c66908106238b12b1f2f7b0/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L96-L102



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40408: [SPARK-42780][BUILD] Upgrade `Tink` to 1.8.0

2023-03-20 Thread via GitHub


LuciferYang commented on code in PR #40408:
URL: https://github.com/apache/spark/pull/40408#discussion_r1142839820


##
pom.xml:
##
@@ -214,7 +214,7 @@
 1.1.0
 1.5.0
 1.60
-1.7.0
+1.8.0

Review Comment:
   @bjornjorgensen We can exclude `androidx.annotation:annotation` from `tink`, 
and then we should be able to remove the newly added maven repository. I have 
tested it and it should be that only `androidx.annotation:annotation` cannot be 
downloaded from the central repository and exclude does not cause UT failures.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40499: [SPARK-42876][SQL] DataType's physicalDataType should be private[sql]

2023-03-20 Thread via GitHub


cloud-fan commented on code in PR #40499:
URL: https://github.com/apache/spark/pull/40499#discussion_r1142839025


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -119,7 +119,7 @@ abstract class DataType extends AbstractDataType {
 
   override private[sql] def acceptsType(other: DataType): Boolean = 
sameType(other)
 
-  def physicalDataType: PhysicalDataType = UninitializedPhysicalType
+  private[sql] def physicalDataType: PhysicalDataType = 
UninitializedPhysicalType

Review Comment:
   can we change the modifier in subclasses as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

2023-03-20 Thread via GitHub


cloud-fan commented on code in PR #40488:
URL: https://github.com/apache/spark/pull/40488#discussion_r1142837488


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala:
##
@@ -296,12 +298,17 @@ object PhysicalAggregation {
   // build a set of semantically distinct aggregate expressions and 
re-write expressions so
   // that they reference the single copy of the aggregate function which 
actually gets computed.
   // Non-deterministic aggregate expressions are not deduplicated.
-  val equivalentAggregateExpressions = new EquivalentExpressions
+  val equivalentAggregateExpressions = mutable.Map.empty[Expression, 
Expression]
   val aggregateExpressions = resultExpressions.flatMap { expr =>
 expr.collect {
-  // addExpr() always returns false for non-deterministic expressions 
and do not add them.
   case a
-if AggregateExpression.isAggregate(a) && 
!equivalentAggregateExpressions.addExpr(a) =>

Review Comment:
   what's wrong with `addExpr` here? It does simplify the code IMO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-20 Thread via GitHub


zhengruifeng commented on PR #40263:
URL: https://github.com/apache/spark/pull/40263#issuecomment-1477193966

   TL;DR  I want to apply scalar subquery to optimize 
`FPGrowthModel.transform`, there are two options:
   
   1, create temp views and use `spark.sql`, see 
https://github.com/apache/spark/commit/63595ba03d9f18fe0b43bfb09f974ea50cb2c651;
   
   2, add `private[spark] def withScalarSubquery(colName: String, subquery: 
Dataset[_]): DataFrame`, it seems much more convenient but not sure whether it 
is a proper way.
   
   cc @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`

2023-03-20 Thread via GitHub


LuciferYang commented on PR #40492:
URL: https://github.com/apache/spark/pull/40492#issuecomment-1477180380

   Thanks All ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-20 Thread via GitHub


beliefer commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1142827066


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -163,6 +164,14 @@ message AnalyzePlanRequest {
 // (Required) The logical plan to get a hashCode.
 Plan plan = 1;
   }
+
+  // Explains the expression based on extended is true or not.
+  message ExplainExpression {
+// (Required) The expression to be analyzed.
+Expression expr = 1;
+
+bool extended = 2;

Review Comment:
   Thank you! I forgot it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-20 Thread via GitHub


beliefer commented on PR #40467:
URL: https://github.com/apache/spark/pull/40467#issuecomment-1477176898

   > Do we need python side compatible API?
   
   Maybe. I will check the python side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-20 Thread via GitHub


beliefer commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1142826734


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   Good question. It seems we must send the msg to server side. So we require 
`SparkSession`.
   But it's a question how to get `SparkSession` in connect's `Column` ? or we 
define the SparkSession in `Column` directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40466: [SPARK-42835][SQL][TESTS] Add test cases for `Column.explain`

2023-03-20 Thread via GitHub


beliefer commented on code in PR #40466:
URL: https://github.com/apache/spark/pull/40466#discussion_r1142825198


##
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala:
##
@@ -921,6 +922,132 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSparkSession {
 }
   }
 
+  private def captureStdOut(block: => Unit): String = {
+val capturedOut = new ByteArrayOutputStream()
+Console.withOut(capturedOut)(block)
+capturedOut.toString()
+  }
+
+  test("explain") {

Review Comment:
   It is hears better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #40476: [MINOR][BUILD] Remove unused properties in pom file

2023-03-20 Thread via GitHub


yaooqinn commented on code in PR #40476:
URL: https://github.com/apache/spark/pull/40476#discussion_r1142823591


##
resource-managers/kubernetes/integration-tests/pom.xml:
##
@@ -26,8 +26,6 @@
 
   spark-kubernetes-integration-tests_2.12
   
-1.3.0
-

Review Comment:
   So my guess is that `extraScalaTestArgs` can be helpful for local build/test 
for developers to add some custom java options, agents, etc. It just defaults 
to empty, not useless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #40487: [WIP] Implement CoGrouped Map API

2023-03-20 Thread via GitHub


xinrong-meng commented on code in PR #40487:
URL: https://github.com/apache/spark/pull/40487#discussion_r1142823170


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -509,6 +511,26 @@ class SparkConnectPlanner(val session: SparkSession) {
   .logicalPlan
   }
 
+  private def transformCoGroupMap(rel: proto.GroupMap): LogicalPlan = {

Review Comment:
   Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40418: [SPARK-42790][SQL] Abstract the excluded method for better test for JDBC docker tests.

2023-03-20 Thread via GitHub


beliefer commented on PR #40418:
URL: https://github.com/apache/spark/pull/40418#issuecomment-1477169805

   @srowen Thank you! @cloud-fan @huaxingao Thank you too!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly

2023-03-20 Thread via GitHub


zhengruifeng commented on PR #40497:
URL: https://github.com/apache/spark/pull/40497#issuecomment-1477168751

   merged to master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly

2023-03-20 Thread via GitHub


zhengruifeng closed pull request #40497: [SPARK-42875][CONNECT][PYTHON] Fix 
toPandas to handle timezone and map types properly
URL: https://github.com/apache/spark/pull/40497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()

2023-03-20 Thread via GitHub


cloud-fan commented on PR #40473:
URL: https://github.com/apache/spark/pull/40473#issuecomment-1477165259

   The check was added to `getExprState` in 
https://github.com/apache/spark/pull/39010, which is to avoid canonicalizing a 
subquery expression and leading to NPE.
   
   I agree that we should be consistent and this PR LGTM. Can we update the 
test case to use `LambdaVariable` as `NamedLambdaVariable` has been removed in 
https://github.com/apache/spark/pull/40475 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40499: [SPARK-42876][SQL] DataType's physicalDataType should be private[sql]

2023-03-20 Thread via GitHub


amaliujia opened a new pull request, #40499:
URL: https://github.com/apache/spark/pull/40499

   
   
   ### What changes were proposed in this pull request?
   
   `physicalDataType` should not be a public API but be private[sql].
   
   ### Why are the changes needed?
   
   This is to limit API scope to not expose unnecessary API to be public.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No since we have not released Spark 3.4.0 yet.
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] StevenChenDatabricks commented on a diff in pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-20 Thread via GitHub


StevenChenDatabricks commented on code in PR #40385:
URL: https://github.com/apache/spark/pull/40385#discussion_r1142811318


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala:
##
@@ -73,14 +78,34 @@ object ExplainUtils extends AdaptiveSparkPlanHelper {
*/
   def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = {
 try {
+  // Initialize a reference-unique set of Operators to avoid accdiental 
overwrites and to allow
+  // intentional overwriting of IDs generated in previous AQE iteration
+  val operators = newSetFromMap[QueryPlan[_]](new IdentityHashMap())
+  // Initialize an array of ReusedExchanges to help find Adaptively 
Optimized Out
+  // Exchanges as part of SPARK-42753
+  val reusedExchanges = ArrayBuffer.empty[ReusedExchangeExec]

Review Comment:
   This `ArrayBuffer` is guaranteed unique because we only insert 
`ReusedExchange` nodes in the `setOpId` function which checks against the 
`IdentityHashMap`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] StevenChenDatabricks commented on a diff in pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-20 Thread via GitHub


StevenChenDatabricks commented on code in PR #40385:
URL: https://github.com/apache/spark/pull/40385#discussion_r1142810230


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala:
##
@@ -73,14 +78,34 @@ object ExplainUtils extends AdaptiveSparkPlanHelper {
*/
   def processPlan[T <: QueryPlan[T]](plan: T, append: String => Unit): Unit = {
 try {
+  // Initialize a reference-unique set of Operators to avoid accdiental 
overwrites and to allow
+  // intentional overwriting of IDs generated in previous AQE iteration
+  val operators = newSetFromMap[QueryPlan[_]](new IdentityHashMap())

Review Comment:
   `SparkPlan` has ID but not all `QueryPlan` have ID and this function allows 
all `QueryPlan`. I attempted creating HashMap of IDs and casting the `plan` 
argument as a `SparkPlan` but it would fail some tests cases where the node 
isn't a `SparkPlan`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Kimahriman commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()

2023-03-20 Thread via GitHub


Kimahriman commented on PR #40473:
URL: https://github.com/apache/spark/pull/40473#issuecomment-1477140427

   > @Kimahriman I'd love to see a good CSE implementation for higher-order 
functions too. But for backporting the fix (which is this PR's primary intent) 
that would have been too much. For this one (or the one @peter-toth forked off) 
we're just aiming for a narrow fix that allows the aggregate to work again.
   
   Yeah I was just commenting on the related PR that broke CSE for anything 
using a HOF. I had plans for trying to do CSE inside a HOF but that stalled 
when I didn't get any traction on the initial adding codegen support


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rednaxelafx commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()

2023-03-20 Thread via GitHub


rednaxelafx commented on PR #40473:
URL: https://github.com/apache/spark/pull/40473#issuecomment-1477136150

   @Kimahriman I'd love to see a good CSE implementation for higher-order 
functions too. But for backporting the fix (which is this PR's primary intent) 
that would have been too much. For this one (or the one @peter-toth forked off) 
we're just aiming for a narrow fix that allows the aggregate to work again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rednaxelafx commented on pull request #40473: [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression()

2023-03-20 Thread via GitHub


rednaxelafx commented on PR #40473:
URL: https://github.com/apache/spark/pull/40473#issuecomment-1477132202

   @peter-toth could you please clarify why `supportedExpression()` was needed 
in `getExprState()` in the first place? i.e. why isn't it sufficient to add it 
to `addExprTree()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rednaxelafx commented on pull request #40488: [SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation

2023-03-20 Thread via GitHub


rednaxelafx commented on PR #40488:
URL: https://github.com/apache/spark/pull/40488#issuecomment-1477131544

   Before the recent rounds of changes to EquivalentExpressions, the old 
`addExprTree` used to call `addExpr` in its core:
   
https://github.com/apache/spark/blob/branch-2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L90
   That was still the case when `PhysicalAggregation` started using this 
mechanism to deduplicate expressions. I guess it started becoming "detached" 
from the main path when the recent refactoring happened that allows updating a 
separate equivalence map instead of the "main" one.
   
   Your proposed PR here further orphans that function from any actual use. 
Which is okay for keeping binary compatibility as much as possible.
   The inlined dedup logic in `PhysicalAggregation` looks rather ugly though. I 
don't have a strong opinion about that as long as other reviewers are okay. I'd 
prefer still retaining some sort of encapsulated collection for the dedup usage.
   
   BTW I updated my PR's test case because it makes more sense to check the 
return value from `addExpr` on the second invocation rather than on the first 
(both "not supported expression" and actual new expression cases would have 
gotten a `false` return value if we add that guard to the `addExpr()` function).
   
https://github.com/apache/spark/pull/40473/commits/28d101ee6765c5453189fa62d6b8ade1568d99d2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive

2023-03-20 Thread via GitHub


github-actions[bot] commented on PR #38534:
URL: https://github.com/apache/spark/pull/38534#issuecomment-1477119915

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38661: [SPARK-41085][SQL] Support Bit manipulation function COUNTSET

2023-03-20 Thread via GitHub


github-actions[bot] closed pull request #38661: [SPARK-41085][SQL] Support Bit 
manipulation function COUNTSET
URL: https://github.com/apache/spark/pull/38661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38608: [SPARK-41080][SQL] Support Bit manipulation function SETBIT

2023-03-20 Thread via GitHub


github-actions[bot] commented on PR #38608:
URL: https://github.com/apache/spark/pull/38608#issuecomment-1477119894

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40479: [CONNECT][ML][WIP] Spark connect ML for scala client

2023-03-20 Thread via GitHub


WeichenXu123 commented on code in PR #40479:
URL: https://github.com/apache/spark/pull/40479#discussion_r1142784162


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Pipeline.scala:
##
@@ -17,47 +17,13 @@
 
 package org.apache.spark.ml
 
-import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.internal.Logging
 import org.apache.spark.ml.param.{ParamMap, Params}
-import org.apache.spark.sql.types.StructType
 
 /**
  * A stage in a pipeline, either an [[Estimator]] or a [[Transformer]].
  */
 abstract class PipelineStage extends Params with Logging {
 
-  /**
-   * Check transform validity and derive the output schema from the input 
schema.
-   *
-   * We check validity for interactions between parameters during 
`transformSchema` and raise an
-   * exception if any parameter value is invalid. Parameter value checks which 
do not depend on
-   * other parameters are handled by `Param.validate()`.
-   *
-   * Typical implementation should first conduct verification on schema change 
and parameter
-   * validity, including complex parameter interaction checks.
-   */
-  def transformSchema(schema: StructType): StructType

Review Comment:
   Similarly, in pyspark side, the `Transformer/ JavaTransformer` also does not 
do any schema transformation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


amaliujia commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142782057


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -148,6 +143,13 @@ message Read {
 // This is only supported by the JDBC data source.
 repeated string predicates = 5;
   }
+
+  // Options for data sources and named table.
+  //
+  // When using for data sources, the context of this map varies based on the
+  // data source format. This options could be empty for valid data source 
format.
+  // The map key is case insensitive.
+  map options = 3;

Review Comment:
   Sounds good. I probably will go with this non-breaking approach that only 
`options` to `NamedTable`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


grundprinzip commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142781033


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -148,6 +143,13 @@ message Read {
 // This is only supported by the JDBC data source.
 repeated string predicates = 5;
   }
+
+  // Options for data sources and named table.
+  //
+  // When using for data sources, the context of this map varies based on the
+  // data source format. This options could be empty for valid data source 
format.
+  // The map key is case insensitive.
+  map options = 3;

Review Comment:
   I agree that adding the options to named table is probably the better 
approach given that we have many relations. 
   
   Alternatively simply support both. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


grundprinzip commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142780483


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -148,6 +143,13 @@ message Read {
 // This is only supported by the JDBC data source.
 repeated string predicates = 5;
   }
+
+  // Options for data sources and named table.
+  //
+  // When using for data sources, the context of this map varies based on the
+  // data source format. This options could be empty for valid data source 
format.
+  // The map key is case insensitive.
+  map options = 3;

Review Comment:
   You must not break. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on a diff in pull request #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


zhenlineo commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142775827


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -148,6 +143,13 @@ message Read {
 // This is only supported by the JDBC data source.
 repeated string predicates = 5;
   }
+
+  // Options for data sources and named table.
+  //
+  // When using for data sources, the context of this map varies based on the
+  // data source format. This options could be empty for valid data source 
format.
+  // The map key is case insensitive.
+  map options = 3;

Review Comment:
   What's the policy around "breaking changes"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on a diff in pull request #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


zhenlineo commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142775413


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -183,7 +183,7 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
   dataSourceBuilder.setFormat(source)
   userSpecifiedSchema.foreach(schema => 
dataSourceBuilder.setSchema(schema.toDDL))
   extraOptions.foreach { case (k, v) =>
-dataSourceBuilder.putOptions(k, v)
+builder.getReadBuilder.putOptions(k, v)

Review Comment:
   +1
   Can you also add a simple test `reader.options().table()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


ueshin commented on code in PR #40498:
URL: https://github.com/apache/spark/pull/40498#discussion_r1142758041


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -183,7 +183,7 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
   dataSourceBuilder.setFormat(source)
   userSpecifiedSchema.foreach(schema => 
dataSourceBuilder.setSchema(schema.toDDL))
   extraOptions.foreach { case (k, v) =>
-dataSourceBuilder.putOptions(k, v)
+builder.getReadBuilder.putOptions(k, v)

Review Comment:
   should modify `table` method as well?



##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -148,6 +143,13 @@ message Read {
 // This is only supported by the JDBC data source.
 repeated string predicates = 5;
   }
+
+  // Options for data sources and named table.
+  //
+  // When using for data sources, the context of this map varies based on the
+  // data source format. This options could be empty for valid data source 
format.
+  // The map key is case insensitive.
+  map options = 3;

Review Comment:
   I guess just adding the field to `NamedTable` is simpler and doesn't break 
anything?



##
python/pyspark/sql/connect/plan.py:
##
@@ -293,7 +293,7 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 plan.read.data_source.schema = self._schema
 if self._options is not None and len(self._options) > 0:
 for k, v in self._options.items():
-plan.read.data_source.options[k] = v
+plan.read.options[k] = v

Review Comment:
   should modify `Read` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40498: [WIP] reader table API could also accept options

2023-03-20 Thread via GitHub


amaliujia opened a new pull request, #40498:
URL: https://github.com/apache/spark/pull/40498

   
   
   ### What changes were proposed in this pull request?
   
   It turns out that `spark.read.option.table` is a valid call chain and the 
`table` API does accept options when open a table.
   
   Existing Spark Connect implementation does not consider it.
   
   ### Why are the changes needed?
   
   Feature parity.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JohnTortugo commented on pull request #40225: [SPARK-42625][BUILD] Upgrade `zstd-jni` to 1.5.4-2

2023-03-20 Thread via GitHub


JohnTortugo commented on PR #40225:
URL: https://github.com/apache/spark/pull/40225#issuecomment-1477026864

   Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40225: [SPARK-42625][BUILD] Upgrade `zstd-jni` to 1.5.4-2

2023-03-20 Thread via GitHub


dongjoon-hyun commented on PR #40225:
URL: https://github.com/apache/spark/pull/40225#issuecomment-1477024392

   Here is the official document about how to run the benchmark in your GitHub 
Action. Please see `Running benchmarks in your forked repository` Section.
   - https://spark.apache.org/developer-tools.html
   
   In addition, you can check GitHub Action benchmark job to see the logic.
   - https://github.com/apache/spark/blob/master/.github/workflows/benchmark.yml
   
   Lastly, each benchmark file has a command-line direction in their header 
file, @JohnTortugo .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JohnTortugo commented on pull request #40225: [SPARK-42625][BUILD] Upgrade `zstd-jni` to 1.5.4-2

2023-03-20 Thread via GitHub


JohnTortugo commented on PR #40225:
URL: https://github.com/apache/spark/pull/40225#issuecomment-1477003627

   Hey @dongjoon-hyun - Can you please point me to some documentation about how 
this benchmarking is done? I'd like to run the same benchmark locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #40497: [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly

2023-03-20 Thread via GitHub


ueshin opened a new pull request, #40497:
URL: https://github.com/apache/spark/pull/40497

   ### What changes were proposed in this pull request?
   
   Fix `DataFrame.toPandas()` to handle timezone and map types properly.
   
   ### Why are the changes needed?
   
   Currently `DataFrame.toPandas()` doesn't handle timezone for timestamp type, 
and map types properly.
   
   For example:
   
   ```py
   >>> schema = StructType().add("ts", TimestampType())
   >>> spark.createDataFrame([(datetime(1969, 1, 1, 1, 1, 1),), (datetime(2012, 
3, 3, 3, 3, 3),), (datetime(2100, 4, 4, 4, 4, 4),)], schema).toPandas()
ts
   0 1969-01-01 01:01:01-08:00
   1 2012-03-03 03:03:03-08:00
   2 2100-04-04 03:04:04-08:00
   ```
   
   which should be:
   
   ```py
  ts
   0 1969-01-01 01:01:01
   1 2012-03-03 03:03:03
   2 2100-04-04 04:04:04
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   The result of `DataFrame.toPandas()` with timestamp type and map type will 
be the same as PySpark.
   
   ### How was this patch tested?
   
   Enabled the related tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-20 Thread via GitHub


HyukjinKwon commented on PR #40496:
URL: https://github.com/apache/spark/pull/40496#issuecomment-1476856830

   LGTM if tests pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common`

2023-03-20 Thread via GitHub


HyukjinKwon closed pull request #40485: [SPARK-42870][CONNECT] Move 
`toCatalystValue` to `connect-common`
URL: https://github.com/apache/spark/pull/40485


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40485: [SPARK-42870][CONNECT] Move `toCatalystValue` to `connect-common`

2023-03-20 Thread via GitHub


HyukjinKwon commented on PR #40485:
URL: https://github.com/apache/spark/pull/40485#issuecomment-1476846845

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


mridulm commented on code in PR #40393:
URL: https://github.com/apache/spark/pull/40393#discussion_r1142572022


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {
+
+initPushBasedShuffleConfs(conf)
+conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3"))
+
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)
+
+submit(rddC, Array(0, 1))
+
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// Fetch failed
+runEvent(makeCompletionEvent(
+  taskSets(1).tasks(0),
+  FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0,
+"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"),
+  result = null))
+
+// long running task complete
+completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA"))
+assert(!shuffleDepB.shuffleMergeFinalized)
+
+// stage1`s tasks have all completed
+val shuffleStage1 = 
scheduler.stageIdToStage(1).asInstanceOf[ShuffleMapStage]
+assert(shuffleStage1.pendingPartitions.isEmpty)
+
+// resubmit
+scheduler.resubmitFailedStages()
+
+// complete parentStage0
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// stage1 should be shuffleMergeFinalized
+assert(shuffleDepB.shuffleMergeFinalized)
+  }
+
+  for (pushBasedShuffleEnabled <- Seq(true, false)) {
+test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions should not " +
+  s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") {

Review Comment:
   Good question - I have not tried that :-) This is a pattern used for other 
tests as well when we want to do a config sweep.
   Does specifying pushBasedShuffleEnabled = true in that string work ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


mridulm commented on code in PR #40393:
URL: https://github.com/apache/spark/pull/40393#discussion_r1142572022


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {
+
+initPushBasedShuffleConfs(conf)
+conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3"))
+
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)
+
+submit(rddC, Array(0, 1))
+
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// Fetch failed
+runEvent(makeCompletionEvent(
+  taskSets(1).tasks(0),
+  FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0,
+"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"),
+  result = null))
+
+// long running task complete
+completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA"))
+assert(!shuffleDepB.shuffleMergeFinalized)
+
+// stage1`s tasks have all completed
+val shuffleStage1 = 
scheduler.stageIdToStage(1).asInstanceOf[ShuffleMapStage]
+assert(shuffleStage1.pendingPartitions.isEmpty)
+
+// resubmit
+scheduler.resubmitFailedStages()
+
+// complete parentStage0
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// stage1 should be shuffleMergeFinalized
+assert(shuffleDepB.shuffleMergeFinalized)
+  }
+
+  for (pushBasedShuffleEnabled <- Seq(true, false)) {
+test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions should not " +
+  s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") {

Review Comment:
   Good question - I have not tried that :-)
   Does specifying pushBasedShuffleEnabled = true in that string work ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tgravescs commented on pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation

2023-03-20 Thread via GitHub


tgravescs commented on PR #39127:
URL: https://github.com/apache/spark/pull/39127#issuecomment-1476782666

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] asfgit closed pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation

2023-03-20 Thread via GitHub


asfgit closed pull request #39127: [SPARK-41585][YARN] The Spark exclude node 
functionality for YARN should work independently of dynamic allocation
URL: https://github.com/apache/spark/pull/39127


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`

2023-03-20 Thread via GitHub


MaxGekk closed pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate 
golden files for `array_prepend`
URL: https://github.com/apache/spark/pull/40492


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #40478: [SPARK-42779][SQL][FOLLOWUP] Allow V2 writes to indicate advisory shuffle partition size

2023-03-20 Thread via GitHub


aokolnychyi commented on PR #40478:
URL: https://github.com/apache/spark/pull/40478#issuecomment-1476769291

   Thanks, @dongjoon-hyun @cloud-fan!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`

2023-03-20 Thread via GitHub


MaxGekk commented on PR #40492:
URL: https://github.com/apache/spark/pull/40492#issuecomment-1476767577

   +1, LGTM. Merging to master.
   Thank you, @LuciferYang and @dongjoon-hyun @gengliangwang @dtenedor for 
review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40492: [SPARK-42791][SQL][FOLLOWUP] Re-generate golden files for `array_prepend`

2023-03-20 Thread via GitHub


LuciferYang commented on PR #40492:
URL: https://github.com/apache/spark/pull/40492#issuecomment-1476757882

   GA passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-20 Thread via GitHub


ueshin commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1142537848


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: 
proto.Expression) extends Logg
* @group df_ops
* @since 3.4.0
*/
-  def explain(extended: Boolean): Unit = {
+  def explain(extended: Boolean)(implicit spark: SparkSession): Unit = {

Review Comment:
   Who will provide the implicit `SparkSession` in the users' context?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor opened a new pull request, #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files

2023-03-20 Thread via GitHub


dtenedor opened a new pull request, #40496:
URL: https://github.com/apache/spark/pull/40496

   ### What changes were proposed in this pull request?
   
   This PR enables the new golden file test framework for analysis for all 
input files.
   
   Background:
   * In https://github.com/apache/spark/pull/40449 we added the ability to 
exercise the analyzer on the SQL queries in existing golden files in the 
`sql/core/src/test/resources/sql-tests/inputs` directory, writing separate 
output test files in the new 
`sql/core/src/test/resources/sql-tests/analyzer-results` directory in 
additional to the original output directory for full end-to-end query execution 
results.
   * That PR also added an allowlist of input files to include in this new 
dual-run mode.
   * In this PR, we remove that allowlist exercise the new dual-run mode for 
all the input files. We also extend the analyzer testing to support separate 
test cases in ANSI-mode, TimestampNTZ, and UDFs.
   
   ### Why are the changes needed?
   
   This improves test coverage and helps prevent against accidental regressions 
in the future as we edit the code.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   This PR adds testing only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #40493: modified for SPARK-42839: Assign a name to the error class _LEGACY_ER…

2023-03-20 Thread via GitHub


MaxGekk commented on code in PR #40493:
URL: https://github.com/apache/spark/pull/40493#discussion_r1142511975


##
sql/core/src/test/scala/org/apache/spark/sql/LegacyErrorTempSuit.scala:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * Created by ruilibuaa
+ * 2023-3-19
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.SharedSparkSession
+
+
+class LegacyErrorTempSuit extends SharedSparkSession {
+
+
+  test("CANNOT_ZIP_MAPS") {

Review Comment:
   Could you place the test to `QueryCompilationErrorsSuite`, please. Usually, 
we put tests for exceptions from:
   - QueryParsingErrors to QueryParsingErrorsSuite
   - QueryCompilationErrors to QueryCompilationErrorsSuite
   - QueryExecutionErrors to QueryExecutionErrorsSuite



##
core/src/main/resources/error/error-classes.json:
##
@@ -3684,7 +3684,7 @@
   ". If necessary set  to false to bypass this error."
 ]
   },
-  "_LEGACY_ERROR_TEMP_2003" : {
+  "CANNOT_ZIP_MAPS" : {

Review Comment:
   Highly likely, you need to reorder the error class since the error classes 
in the file should be ordered alphabetically.



##
sql/core/src/test/scala/org/apache/spark/sql/LegacyErrorTempSuit.scala:
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * Created by ruilibuaa
+ * 2023-3-19
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.SharedSparkSession
+
+
+class LegacyErrorTempSuit extends SharedSparkSession {
+
+
+  test("CANNOT_ZIP_MAPS") {
+
+val spark = SparkSession.builder().getOrCreate()
+import spark.implicits._
+
+val maxRoundedArrayLength = 2
+
+val zipMapsbyUniqueKey: UserDefinedFunction = udf(
+  (mapA: Map[String, Int], mapB: Map[String, Int]) => {
+if (mapA.keySet.intersect(mapB.keySet).nonEmpty) {
+  throw new
+  IllegalArgumentException(s"Maps have overlapping keys")
+}
+val mergedMap = mapA ++ mapB
+val mergeMapSize = mergedMap.size
+// GlobalVariables.mergeMapSize = Some(mergeMapSize)
+if (mergeMapSize > maxRoundedArrayLength) {
+  throw new
+  SparkException(s"Unsuccessful try to zip maps with $mergeMapSize 
unique keys " +
+s"due to exceeding the array size limit 
$maxRoundedArrayLength.")
+}
+mergedMap
+  }
+)
+
+val data = Seq(
+  (1, Map("a" -> 1, "b" -> 2), Map("c" -> 3, "d" -> 4)),
+  (2, Map("e" -> 5, "f" -> 6), Map("g" -> 7, "h" -> 8))
+)
+
+val exception = intercept[SparkException] {
+  val df = data.toDF("id", "map1", "map2")
+  val zippedDf = df.withColumn("zipped_map", 
zipMapsbyUniqueKey(col("map1"), col("map2")))
+  zippedDf.collect()
+}
+
+val errorMessage = exception.getMessage
+if (errorMessage.contains("exceeding the array size limit")) {
+  assert(errorMessage.contains(s"$maxRoundedArrayLength"))

Review Comment:
   Please, use `checkError` instead of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go t

[GitHub] [spark] otterc commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


otterc commented on code in PR #40393:
URL: https://github.com/apache/spark/pull/40393#discussion_r1142475865


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {
+
+initPushBasedShuffleConfs(conf)
+conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3"))
+
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)
+
+submit(rddC, Array(0, 1))
+
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// Fetch failed
+runEvent(makeCompletionEvent(
+  taskSets(1).tasks(0),
+  FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0,
+"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"),
+  result = null))
+
+// long running task complete
+completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA"))
+assert(!shuffleDepB.shuffleMergeFinalized)
+
+// stage1`s tasks have all completed
+val shuffleStage1 = 
scheduler.stageIdToStage(1).asInstanceOf[ShuffleMapStage]
+assert(shuffleStage1.pendingPartitions.isEmpty)
+
+// resubmit
+scheduler.resubmitFailedStages()
+
+// complete parentStage0
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// stage1 should be shuffleMergeFinalized
+assert(shuffleDepB.shuffleMergeFinalized)
+  }
+
+  for (pushBasedShuffleEnabled <- Seq(true, false)) {
+test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions should not " +
+  s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") {

Review Comment:
   How do we run just the individual tests? 



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {

Review Comment:
   nit: missing ` ` between `partitions` and `should`.
   Since this test is checking that the stage gets finalized, we should change 
the name to "SPARK-40082: re-computation of shuffle map stage with no pending 
partitions should finalize the stage"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen commented on pull request #40494: [MINOR][DOCS] Fix typos

2023-03-20 Thread via GitHub


bjornjorgensen commented on PR #40494:
URL: https://github.com/apache/spark/pull/40494#issuecomment-1476637999

   @srowen FYI 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] StevenChenDatabricks commented on a diff in pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-20 Thread via GitHub


StevenChenDatabricks commented on code in PR #40385:
URL: https://github.com/apache/spark/pull/40385#discussion_r114291


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala:
##
@@ -119,17 +155,40 @@ object ExplainUtils extends AdaptiveSparkPlanHelper {
* @param plan Input query plan to process
* @param startOperatorID The start value of operation id. The subsequent 
operations will be
*assigned higher value.
+   * @param visited A unique set of operators visited by generateOperatorIds. 
The set is scoped
+   *at the callsite function processPlan. It serves two 
purpose: Firstly, it is
+   *used to avoid accidentally overwriting existing IDs that 
were generated in
+   *the same processPlan call. Secondly, it is used to allow 
for intentional ID
+   *overwriting as part of SPARK-42753 where an Adaptively 
Optimized Out Exchange
+   *and its subtree may contain IDs that were generated in a 
previous AQE
+   *iteration's processPlan call which would result in 
incorrect IDs.
+   * @param reusedExchanges A unique set of ReusedExchange nodes visited which 
will be used to
+   *idenitfy adaptively optimized out exchanges in 
SPARK-42753.
+   * @param addReusedExchanges Whether to add ReusedExchange nodes to 
reusedExchanges set. We set it
+   *   to false to avoid processing more nested 
ReusedExchanges nodes in the
+   *   subtree of an Adpatively Optimized Out Exchange.
* @return The last generated operation id for this input plan. This is to 
ensure we always
* assign incrementing unique id to each operator.
*/
-  private def generateOperatorIDs(plan: QueryPlan[_], startOperatorID: Int): 
Int = {
+  private def generateOperatorIDs(
+  plan: QueryPlan[_],
+  startOperatorID: Int,
+  visited: Set[QueryPlan[_]],
+  reusedExchanges: ArrayBuffer[ReusedExchangeExec],
+  addReusedExchanges: Boolean): Int = {
 var currentOperationID = startOperatorID
 // Skip the subqueries as they are not printed as part of main query block.
 if (plan.isInstanceOf[BaseSubqueryExec]) {
   return currentOperationID
 }
 
-def setOpId(plan: QueryPlan[_]): Unit = if 
(plan.getTagValue(QueryPlan.OP_ID_TAG).isEmpty) {

Review Comment:
   I'm removing the check "if OP_ID_TAG is empty" because we are allowing 
overwriting the OP_ID_TAG since in some "Optimized Out Exchange" it may contain 
nodes that have "OP_ID_TAG" generated from previous `processPlan` call in a 
previous AQE iteration. Therefore the "OP_ID_TAG" would be incorrect and needs 
overwriting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-20 Thread via GitHub


amaliujia commented on PR #40467:
URL: https://github.com/apache/spark/pull/40467#issuecomment-1476589894

   Do we need python side compatible API?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`

2023-03-20 Thread via GitHub


amaliujia commented on code in PR #40467:
URL: https://github.com/apache/spark/pull/40467#discussion_r1142418545


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -163,6 +164,14 @@ message AnalyzePlanRequest {
 // (Required) The logical plan to get a hashCode.
 Plan plan = 1;
   }
+
+  // Explains the expression based on extended is true or not.
+  message ExplainExpression {
+// (Required) The expression to be analyzed.
+Expression expr = 1;
+
+bool extended = 2;

Review Comment:
   Nit: please follow the style guide to indicate if this is required or 
optional and also document the semantics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


mridulm commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1476585572

   The test failure is unrelated to this PR - once the changes above are made, 
the reexecution should pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


mridulm commented on code in PR #40393:
URL: https://github.com/apache/spark/pull/40393#discussion_r1142362530


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {
+
+initPushBasedShuffleConfs(conf)
+conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3"))
+
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)
+
+submit(rddC, Array(0, 1))
+
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))

Review Comment:
   nit: For consistency, let us make it `hostA` and `hostB` ... and have 
`hostA` (say) fail.



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {
+
+initPushBasedShuffleConfs(conf)
+conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3"))
+
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)
+
+submit(rddC, Array(0, 1))
+
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// Fetch failed
+runEvent(makeCompletionEvent(
+  taskSets(1).tasks(0),
+  FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0,

Review Comment:
   nit: Since map tasks did not run on `hostC`, let us change it to `hostA` as 
per comment above.



##
pom.xml:
##
@@ -114,7 +114,7 @@
 1.8
 ${java.version}
 ${java.version}
-3.8.7
+3.6.3

Review Comment:
   Revert this ?



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4595,6 +4595,184 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions" +
+"should not hang") {
+
+initPushBasedShuffleConfs(conf)
+conf.set(config.SHUFFLE_MERGER_LOCATIONS_MIN_STATIC_THRESHOLD, 3)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3"))
+
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)
+
+submit(rddC, Array(0, 1))
+
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostA"))
+
+// Fetch failed
+runEvent(makeCompletionEvent(
+  taskSets(1).tasks(0),
+  FetchFailed(makeBlockManagerId("hostC"), shuffleIdA, 0L, 0, 0,
+"Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"),
+  result = null))
+
+// long running task complete
+completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostA", "hostA"))

Review Comment:
   nit: let us change the host to `hostB` for successes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40466: [SPARK-42835][SQL][TESTS] Add test cases for `Column.explain`

2023-03-20 Thread via GitHub


amaliujia commented on code in PR #40466:
URL: https://github.com/apache/spark/pull/40466#discussion_r1142414407


##
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala:
##
@@ -921,6 +922,132 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSparkSession {
 }
   }
 
+  private def captureStdOut(block: => Unit): String = {
+val capturedOut = new ByteArrayOutputStream()
+Console.withOut(capturedOut)(block)
+capturedOut.toString()
+  }
+
+  test("explain") {

Review Comment:
   +1 if we still want be aware it's changing, golden files is a better way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >