[GitHub] [spark] itholic commented on pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-19 Thread GitBox


itholic commented on PR #36509:
URL: https://github.com/apache/spark/pull/36509#issuecomment-1131335722

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #36594: [SPARK-39223][PS] Implement skew and kurt in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-05-19 Thread GitBox


HyukjinKwon commented on PR #36594:
URL: https://github.com/apache/spark/pull/36594#issuecomment-1131351089

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #36594: [SPARK-39223][PS] Implement skew and kurt in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-05-19 Thread GitBox


HyukjinKwon closed pull request #36594: [SPARK-39223][PS] Implement skew and 
kurt in Rolling/RollingGroupby/Expanding/ExpandingGroupby
URL: https://github.com/apache/spark/pull/36594


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #36569: [SPARK-39201][PYTHON][PS] Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-19 Thread GitBox


Yikun commented on code in PR #36569:
URL: https://github.com/apache/spark/pull/36569#discussion_r876716230


##
python/pyspark/pandas/frame.py:
##
@@ -12212,7 +12237,8 @@ def explode(self, column: Name) -> "DataFrame":
 data_fields[idx] = field.copy(dtype=dtype, spark_type=spark_type, 
nullable=True)
 
 internal = psdf._internal.with_new_sdf(sdf, data_fields=data_fields)
-return DataFrame(internal)
+result_df: DataFrame = DataFrame(internal)

Review Comment:
   Just curious: any helps of this annotaion? Because I saw many inconsist in 
current PS codebase.
   
   ```python
   pdf = DataFrame(internal)
   pdf: DataFrame = DataFrame(internal)
   ```
   
   I thought annotation only helps in public interface/parameter or schema 
infer, so any other one should be cleanup.
   
   cc @zero323 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #36604: [SPARK-39229][SQL] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


MaxGekk closed pull request #36604: [SPARK-39229][SQL] Separate query contexts 
from error-classes.json
URL: https://github.com/apache/spark/pull/36604


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #36604: [SPARK-39229][SQL] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


MaxGekk commented on PR #36604:
URL: https://github.com/apache/spark/pull/36604#issuecomment-1131372321

   @gengliangwang The changes cause conflicts in branch-3.3. Please, open a 
separate PR to backport this (if it is needed)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #36512: [SPARK-39152][CORE] Deregistering disk persisted local RDD blocks in case of IO related errors

2022-05-19 Thread GitBox


mridulm commented on code in PR #36512:
URL: https://github.com/apache/spark/pull/36512#discussion_r876759477


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -933,10 +933,29 @@ private[spark] class BlockManager(
   })
   Some(new BlockResult(ci, DataReadMethod.Memory, info.size))
 } else if (level.useDisk && diskStore.contains(blockId)) {
-  try {
-val diskData = diskStore.getBytes(blockId)
-val iterToReturn: Iterator[Any] = {
-  if (level.deserialized) {
+  var retryCount = 0
+  val retryLimit = 3

Review Comment:
   That sounds reasonable to me.
   My only concern is, if this is some jitter/transient issue, we will end up 
recomputing stage(s)/blocks - though I have not checked this in our clusters - 
+CC @otterc have you observed this behavior ?
   
   +CC @Ngone51 as well for your thoughts.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.

2022-05-19 Thread GitBox


mridulm commented on PR #36162:
URL: https://github.com/apache/spark/pull/36162#issuecomment-1131396197

   Thanks for the changes @weixiuli.
   I will try to take a look early next week - a bit swamped by some other work 
unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #36561: [SPARK-37939][SQL] Use error classes in the parsing errors of properties

2022-05-19 Thread GitBox


MaxGekk closed pull request #36561: [SPARK-37939][SQL] Use error classes in the 
parsing errors of properties
URL: https://github.com/apache/spark/pull/36561


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #36561: [SPARK-37939][SQL] Use error classes in the parsing errors of properties

2022-05-19 Thread GitBox


MaxGekk commented on PR #36561:
URL: https://github.com/apache/spark/pull/36561#issuecomment-1131399297

   @panbingkun Could you backport this to branch-3.3, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #36600: [SPARK-39212][SQL][3.3] Use double quotes for values of SQL configs/DS options in error messages

2022-05-19 Thread GitBox


gengliangwang commented on PR #36600:
URL: https://github.com/apache/spark/pull/36600#issuecomment-1131420222

   Thanks, merging to 3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #36600: [SPARK-39212][SQL][3.3] Use double quotes for values of SQL configs/DS options in error messages

2022-05-19 Thread GitBox


gengliangwang closed pull request #36600: [SPARK-39212][SQL][3.3] Use double 
quotes for values of SQL configs/DS options in error messages
URL: https://github.com/apache/spark/pull/36600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #36330: [SPARK-38897][SQL] DS V2 supports push down string functions

2022-05-19 Thread GitBox


beliefer commented on code in PR #36330:
URL: https://github.com/apache/spark/pull/36330#discussion_r876791321


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##
@@ -228,4 +244,18 @@ protected String visitSQLFunction(String funcName, 
String[] inputs) {
   protected String visitUnexpectedExpr(Expression expr) throws 
IllegalArgumentException {
 throw new IllegalArgumentException("Unexpected V2 expression: " + expr);
   }
+
+  protected String visitOverlay(String[] inputs) {
+throw new UnsupportedOperationException("Function: OVERLAY does not 
support ");
+  }
+
+  protected String visitTrim(String direction, String[] inputs) {
+if (inputs.length == 1) {
+  return "TRIM(" + direction + " FROM " + inputs[0] + ")";
+} else if (inputs.length == 2) {
+  return "TRIM(" + direction + " " + inputs[1] + " FROM " + inputs[0] + 
")";
+} else {
+  throw new IllegalStateException("Unexpected V2 function: TRIM");
+}

Review Comment:
   ```
if (inputs.length == 1) {
  return "TRIM(" + direction + " FROM " + inputs[0] + ")";
} else {
  return "TRIM(" + direction + " " + inputs[1] + " FROM " + inputs[0] + 
")";
}
   ```



##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -251,6 +251,30 @@ abstract class JdbcDialect extends Serializable with 
Logging{
   s"${this.getClass.getSimpleName} does not support function: 
$funcName")
   }
 }
+
+override def visitOverlay(inputs: Array[String]): String = {
+  if (isSupportedFunction("OVERLAY")) {
+if (inputs.length == 3) {
+  "OVERLAY(" + inputs(0) + " PLACING " + inputs(1) + " FROM " + 
inputs(2) + ")";
+} else if (inputs.length == 4) {
+  "OVERLAY(" + inputs(0) + " PLACING " + inputs(1) + " FROM " + 
inputs(2) +
+" FOR " + inputs(3) + ")";
+} else {
+  throw new IllegalStateException("Unexpected V2 function: OVERLAY");
+}

Review Comment:
   ```
if (inputs.length == 3) {
  "OVERLAY(" + inputs(0) + " PLACING " + inputs(1) + " FROM " + 
inputs(2) + ")";
} else {
  "OVERLAY(" + inputs(0) + " PLACING " + inputs(1) + " FROM " + 
inputs(2) +
" FOR " + inputs(3) + ")";
}
   ```



##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##
@@ -228,4 +244,18 @@ protected String visitSQLFunction(String funcName, 
String[] inputs) {
   protected String visitUnexpectedExpr(Expression expr) throws 
IllegalArgumentException {
 throw new IllegalArgumentException("Unexpected V2 expression: " + expr);
   }
+
+  protected String visitOverlay(String[] inputs) {
+throw new UnsupportedOperationException("Function: OVERLAY does not 
support ");

Review Comment:
   ```suggestion
   throw new UnsupportedOperationException(s"${this.getClass.getSimpleName} 
does not support function: OVERLAY");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #36607: [SPARK-39229][SQL][3.3] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


gengliangwang opened a new pull request, #36607:
URL: https://github.com/apache/spark/pull/36607

   
   
   ### What changes were proposed in this pull request?
   
   Separate query contexts for runtime errors from error-classes.json.
   
   ### Why are the changes needed?
   
   The message is JSON should only contain parameters explicitly thrown. It is 
more elegant to separate query contexts from error-classes.json.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #36604: [SPARK-39229][SQL] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


gengliangwang commented on PR #36604:
URL: https://github.com/apache/spark/pull/36604#issuecomment-1131435266

   @MaxGekk I have opened a backport in 
https://github.com/apache/spark/pull/36607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer opened a new pull request, #36608: [SPARK-39230][SQL] Support ANSI Aggregate Function: regr_slope

2022-05-19 Thread GitBox


beliefer opened a new pull request, #36608:
URL: https://github.com/apache/spark/pull/36608

   ### What changes were proposed in this pull request?
   `REGR_SLOPE` is an ANSI aggregate functions
   
   **Syntax**: REGR_SLOPE(y, x)
   **Arguments**: 
   - **y**:The dependent variable. This must be an expression that can be 
evaluated to a numeric type.
   - **x**:The independent variable. This must be an expression that can be 
evaluated to a numeric type.
   
   **Examples**:
   `select k, regr_slope(v, v2) from aggr group by k;`
   
   |  k |regr_slope(v, v2) |
   |---||
   | 1  |  [NULL]|
   | 2 | 0.831408776   |
   
   The algorithm refers 
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
   
   The mainstream database supports `regr_count` show below:
   **Teradata**
   https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/I0~kqsq3f3uNmjUaZr8hDg
   **Snowflake**
   https://docs.snowflake.com/en/sql-reference/functions/regr_slope.html
   **Oracle**
   
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGR_-Linear-Regression-Functions.html#GUID-A675B68F-2A88-4843-BE2C-FCDE9C65F9A9
   **DB2**
   
https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count
   **H2**
   http://www.h2database.com/html/functions-aggregate.html#regr_slope
   **Postgresql**
   https://www.postgresql.org/docs/8.4/functions-aggregate.html
   **Sybase**
   
https://infocenter.sybase.com/help/topic/com.sybase.help.sqlanywhere.12.0.0/dbreference/regr-slope-function.html
   **Presto**
   https://prestodb.io/docs/current/functions/aggregate.html
   Exasol
   
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/regr_function.htm
   
   ### Why are the changes needed?
   `REGR_SLOPE` is very useful.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'Yes'. New feature.
   
   
   ### How was this patch tested?
   New tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #36593: [SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF

2022-05-19 Thread GitBox


beliefer commented on PR #36593:
URL: https://github.com/apache/spark/pull/36593#issuecomment-1131436131

   ping @huaxingao cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] andersonm-ibm commented on pull request #36523: [SPARK-37956][DOCS] Add Python and Java examples of Parquet encryption in Spark SQL to documentation

2022-05-19 Thread GitBox


andersonm-ibm commented on PR #36523:
URL: https://github.com/apache/spark/pull/36523#issuecomment-1131484387

   @HyukjinKwon  Thank you for your review. Your comment was addressed. Could 
you please have another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #36464: [SPARK-38947][PYTHON][PS] Supports groupby positional indexing

2022-05-19 Thread GitBox


HyukjinKwon commented on PR #36464:
URL: https://github.com/apache/spark/pull/36464#issuecomment-1131494802

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #36464: [SPARK-38947][PYTHON][PS] Supports groupby positional indexing

2022-05-19 Thread GitBox


HyukjinKwon closed pull request #36464: [SPARK-38947][PYTHON][PS] Supports 
groupby positional indexing
URL: https://github.com/apache/spark/pull/36464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] eejbyfeldt commented on pull request #36004: [SPARK-38681][SQL] Support nested generic case classes

2022-05-19 Thread GitBox


eejbyfeldt commented on PR #36004:
URL: https://github.com/apache/spark/pull/36004#issuecomment-1131496353

   While testing the spark 3.3.0 release candidate I noticed that this is 
actually a regression from 3.2 that was introduced in 
https://github.com/apache/spark/pull/33205 so I this should also be fixed in 
the 3.3 branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #36569: [SPARK-39201][PYTHON][PS] Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-19 Thread GitBox


HyukjinKwon commented on PR #36569:
URL: https://github.com/apache/spark/pull/36569#issuecomment-1131510364

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #36569: [SPARK-39201][PYTHON][PS] Implement `ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`

2022-05-19 Thread GitBox


HyukjinKwon closed pull request #36569: [SPARK-39201][PYTHON][PS] Implement 
`ignore_index` of `DataFrame.explode` and `DataFrame.drop_duplicates`
URL: https://github.com/apache/spark/pull/36569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #36609: [SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer

2022-05-19 Thread GitBox


gengliangwang opened a new pull request, #36609:
URL: https://github.com/apache/spark/pull/36609

   
   
   ### What changes were proposed in this pull request?
   
   In [#36094](https://github.com/apache/spark/pull/36094), a check for failing 
TimestampNTZ output is added.
   
   However, the check can cause misleading error message.
   
   In 3.3:
   ```
   > sql( "select date '2018-11-17' > 1").show()
   org.apache.spark.sql.AnalysisException: Invalid call to toAttribute on 
unresolved object;
   'Project [unresolvedalias((2018-11-17 > 1), None)]
   +- OneRowRelation   

 at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.toAttribute(unresolved.scala:510)
 at 
org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:70)
 
   ```
   In master or 3.2
   ```
   > sql( "select date '2018-11-17' > 1").show()
   org.apache.spark.sql.AnalysisException: cannot resolve '(DATE '2018-11-17' > 
1)' due to data type mismatch: differing types in '(DATE '2018-11-17' > 1)' 
(date and int).; line 1 pos 7;
   'Project [unresolvedalias((2018-11-17 > 1), None)]
   +- OneRowRelation
   
 at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 
   ```
   
   We should just remove the check to avoid such regression. It's not necessary 
for disabling TimestampNTZ anyway.
   
   ### Why are the changes needed?
   
   Fix regression in the error output of analysis check.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it is not released yet.
   
   ### How was this patch tested?
   
   Build and try on `spark-shell`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #36609: [SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer

2022-05-19 Thread GitBox


yaooqinn commented on PR #36609:
URL: https://github.com/apache/spark/pull/36609#issuecomment-1131517228

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36330: [SPARK-38897][SQL] DS V2 supports push down string functions

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36330:
URL: https://github.com/apache/spark/pull/36330#discussion_r876959009


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##
@@ -228,4 +244,18 @@ protected String visitSQLFunction(String funcName, 
String[] inputs) {
   protected String visitUnexpectedExpr(Expression expr) throws 
IllegalArgumentException {
 throw new IllegalArgumentException("Unexpected V2 expression: " + expr);
   }
+
+  protected String visitOverlay(String[] inputs) {
+throw new UnsupportedOperationException("Function: OVERLAY does not 
support ");

Review Comment:
   why do we fail here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #36607: [SPARK-39229][SQL][3.3] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


gengliangwang commented on PR #36607:
URL: https://github.com/apache/spark/pull/36607#issuecomment-1131600384

   Merging to 3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #36377: [SPARK-39043][SQL] Spark SQL Hive client should not gather statistic by default.

2022-05-19 Thread GitBox


cloud-fan commented on PR #36377:
URL: https://github.com/apache/spark/pull/36377#issuecomment-1131603731

   It seems like this also impacts Spark. First of all, Spark will read hive 
Table stats and turn them into Spark's own stats, see
   
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L539
   
   Secondly, Spark also read Hive partition stats, see
   
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L68
   
   I'm reverting it. @AngersZh can you think of a different fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #36607: [SPARK-39229][SQL][3.3] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


gengliangwang closed pull request #36607: [SPARK-39229][SQL][3.3] Separate 
query contexts from error-classes.json
URL: https://github.com/apache/spark/pull/36607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #36377: [SPARK-39043][SQL] Spark SQL Hive client should not gather statistic by default.

2022-05-19 Thread GitBox


AngersZh commented on PR #36377:
URL: https://github.com/apache/spark/pull/36377#issuecomment-1131608162

   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #36606: [SPARK-39232][CORE] History Server Main Page App List Filtering

2022-05-19 Thread GitBox


AmplabJenkins commented on PR #36606:
URL: https://github.com/apache/spark/pull/36606#issuecomment-1131614614

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #36603: [SPARK-39163][SQL] Throw an exception w/ error class for an invalid bucket file

2022-05-19 Thread GitBox


AmplabJenkins commented on PR #36603:
URL: https://github.com/apache/spark/pull/36603#issuecomment-1131614649

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #36602: [MINOR][SQL] JDBCTableCatalog: "initialized" typo

2022-05-19 Thread GitBox


AmplabJenkins commented on PR #36602:
URL: https://github.com/apache/spark/pull/36602#issuecomment-1131614708

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to tr

2022-05-19 Thread GitBox


AmplabJenkins commented on PR #36601:
URL: https://github.com/apache/spark/pull/36601#issuecomment-1131614769

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #36610: [SPARK-39204][CORE][SQL] Replace `Utils.createTempDir` with `JavaUtils.createTempDir`

2022-05-19 Thread GitBox


LuciferYang opened a new pull request, #36610:
URL: https://github.com/apache/spark/pull/36610

   ### What changes were proposed in this pull request?
   This main change of this pr is replace all use of `Utils.createTempDir` with 
`JavaUtils.createTempDir`, the replacement rules are as follows:
   
   - `Utils.createTempDir()` -> `JavaUtils.createTempDir()`
   - `Utils.createTempDir(rootDir)` and `Utils.createTempDir(root = rootDir)` 
-> `JavaUtils.createTempDirWithRoot(rootDir)`
   - `Utils.createTempDir(namePrefix = prefix)` -> 
`JavaUtils.createTempDirWithPrefix(prefix)`
   - `Utils.createTempDir(rootDir, prefix)` -> 
`JavaUtils.createTempDir(rootDir, prefix)`
   
   Another change is to delete `Utils.createTempDir()` method to keep only one 
`createTempDir()` method.
   
   ### Why are the changes needed?
   Keep only one `createTempDir()` method in Spark.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang closed pull request #36610: [SPARK-39204][CORE][SQL] Replace `Utils.createTempDir` with `JavaUtils.createTempDir`

2022-05-19 Thread GitBox


LuciferYang closed pull request #36610: [SPARK-39204][CORE][SQL] Replace 
`Utils.createTempDir` with `JavaUtils.createTempDir`
URL: https://github.com/apache/spark/pull/36610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #36611: [SPARK-39204][CORE][SQL] Replace `Utils.createTempDir` with `JavaUtils.createTempDir`

2022-05-19 Thread GitBox


LuciferYang opened a new pull request, #36611:
URL: https://github.com/apache/spark/pull/36611

   ### What changes were proposed in this pull request?
   This main change of this pr is replace all use of `Utils.createTempDir` with 
`JavaUtils.createTempDir`, the replacement rules are as follows:
   
   - `Utils.createTempDir()` -> `JavaUtils.createTempDir()`
   - `Utils.createTempDir(rootDir)` and `Utils.createTempDir(root = rootDir)` 
-> `JavaUtils.createTempDirWithRoot(rootDir)`
   - `Utils.createTempDir(namePrefix = prefix)` -> 
`JavaUtils.createTempDirWithPrefix(prefix)`
   - `Utils.createTempDir(rootDir, prefix)` -> 
`JavaUtils.createTempDir(rootDir, prefix)`
   
   Another change is to delete `Utils.createTempDir()` method to keep only one 
`createTempDir()` method.
   
   ### Why are the changes needed?
   Keep only one `createTempDir()` method in Spark.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #36599: [SPARK-39228][PYTHON][PS] Implement `skipna` of `Series.argmax`

2022-05-19 Thread GitBox


Yikun commented on code in PR #36599:
URL: https://github.com/apache/spark/pull/36599#discussion_r876936498


##
python/pyspark/pandas/series.py:
##
@@ -6239,13 +6239,19 @@ def argsort(self) -> "Series":
 ps.concat([psser, self.loc[self.isnull()].spark.transform(lambda 
_: SF.lit(-1))]),
 )
 
-def argmax(self) -> int:
+def argmax(self, skipna: bool = True) -> int:

Review Comment:
   ```suggestion
   def argmax(self, axis: Axis = None, skipna: bool = True) -> int:
   ```
   
   We'd better to keep the pandas behaviors even if the `axis` is a 
dummy(placeholder) parameter.
   
   
   [1] 
https://pandas.pydata.org/docs/reference/api/pandas.Series.argmax.html#pandas.Series.argmax



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #36609: [SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer

2022-05-19 Thread GitBox


gengliangwang commented on PR #36609:
URL: https://github.com/apache/spark/pull/36609#issuecomment-1131629803

   Merging to 3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #36609: [SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer

2022-05-19 Thread GitBox


gengliangwang closed pull request #36609: [SPARK-39233][SQL] Remove the check 
for TimestampNTZ output in Analyzer
URL: https://github.com/apache/spark/pull/36609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #36612: [SPARK-39234][SQL] Code clean up in SparkThrowableHelper.getMessage

2022-05-19 Thread GitBox


gengliangwang opened a new pull request, #36612:
URL: https://github.com/apache/spark/pull/36612

   
   
   ### What changes were proposed in this pull request?
   
   1. Remove the starting "\n" in `Origin.context`. The "\n" will be append in 
the method `SparkThrowableHelper.getMessage` instead.
   2. Code clean up the method SparkThrowableHelper.getMessage to eliminate 
redundant code.
   
   ### Why are the changes needed?
   
   
   Code clean up to eliminate redundant code.
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   Existing UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36612: [SPARK-39234][SQL] Code clean up in SparkThrowableHelper.getMessage

2022-05-19 Thread GitBox


gengliangwang commented on code in PR #36612:
URL: https://github.com/apache/spark/pull/36612#discussion_r877015458


##
core/src/main/scala/org/apache/spark/ErrorInfo.scala:
##
@@ -77,20 +77,25 @@ private[spark] object SparkThrowableHelper {
   queryContext: String = ""): String = {
 val errorInfo = errorClassToInfoMap.getOrElse(errorClass,
   throw new IllegalArgumentException(s"Cannot find error class 
'$errorClass'"))
-if (errorInfo.subClass.isDefined) {
+val (displayClass, displayMessageParameters, displayFormat) = if 
(errorInfo.subClass.isEmpty) {
+  (errorClass, messageParameters, errorInfo.messageFormat)
+} else {
   val subClass = errorInfo.subClass.get
   val subErrorClass = messageParameters.head
   val errorSubInfo = subClass.getOrElse(subErrorClass,
 throw new IllegalArgumentException(s"Cannot find sub error class 
'$subErrorClass'"))
-  val subMessageParameters = messageParameters.tail
-  "[" + errorClass + "." + subErrorClass + "] " + 
String.format((errorInfo.messageFormat +
-errorSubInfo.messageFormat).replaceAll("<[a-zA-Z0-9_-]+>", "%s"),
-subMessageParameters: _*) + queryContext
+  (errorClass + "." + subErrorClass, messageParameters.tail,
+errorInfo.messageFormat + errorSubInfo.messageFormat)
+}
+val displayMessage = String.format(
+  displayFormat.replaceAll("<[a-zA-Z0-9_-]+>", "%s"),

Review Comment:
   Here we avoid writing `.replaceAll("<[a-zA-Z0-9_-]+>", "%s")` twice 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36604: [SPARK-39229][SQL] Separate query contexts from error-classes.json

2022-05-19 Thread GitBox


gengliangwang commented on code in PR #36604:
URL: https://github.com/apache/spark/pull/36604#discussion_r877016968


##
core/src/main/scala/org/apache/spark/ErrorInfo.scala:
##
@@ -82,11 +85,11 @@ private[spark] object SparkThrowableHelper {
   val subMessageParameters = messageParameters.tail
   "[" + errorClass + "." + subErrorClass + "] " + 
String.format((errorInfo.messageFormat +
 errorSubInfo.messageFormat).replaceAll("<[a-zA-Z0-9_-]+>", "%s"),
-subMessageParameters: _*)
+subMessageParameters: _*) + queryContext

Review Comment:
   Opened https://github.com/apache/spark/pull/36612 for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #36611: [WIP][SPARK-39204][BUILD][CORE][SQL][DSTREAM][GRAPHX][K8S][ML][MLLIB][SS][YARN][EXAMPLES][SHELL] Replace `Utils.createTempDir` with `Java

2022-05-19 Thread GitBox


LuciferYang commented on PR #36611:
URL: https://github.com/apache/spark/pull/36611#issuecomment-1131660071

   cc @attilapiros as discussed in 
SPARK-39102(https://github.com/apache/spark/pull/36529#discussion_r872838034), 
this pr  replace `Utils.createTempDir()` with `JavaUtils.createTempDir()` and 
only keeping the one in `JavaUtils`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #36597: [SPARK-39225][CORE] Support `spark.history.fs.update.batchSize`

2022-05-19 Thread GitBox


AmplabJenkins commented on PR #36597:
URL: https://github.com/apache/spark/pull/36597#issuecomment-1131715634

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #36475: [SPARK-38869][SQL] Respect table capability ACCEPT_ANY_SCHEMA in DEFAULT column resolution

2022-05-19 Thread GitBox


gengliangwang commented on PR #36475:
URL: https://github.com/apache/spark/pull/36475#issuecomment-1131731396

   @dtenedor after a closer look, I think we can resolve this in a simpler way.
   I make a PR on your repo: https://github.com/dtenedor/spark/pull/4
   You can merge it on your repo if you think it is OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36593: [SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36593:
URL: https://github.com/apache/spark/pull/36593#discussion_r877113355


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -201,6 +203,14 @@ class V2ExpressionBuilder(
 None
   }
 // TODO supports other expressions
+case ApplyFunctionExpression(function, children) =>
+  val childrenExpressions = children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == children.length) {
+Some(new 
GeneralScalarExpression(function.name().toUpperCase(Locale.ROOT),

Review Comment:
   UDF is kind of special and I think it's better to have a new v2 API for it, 
instead of reusing `GeneralScalarExpression`.
   
   e.g., we can add a `ScalarUDFExpression`, which defines the function name, 
canonical name and inputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36593: [SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36593:
URL: https://github.com/apache/spark/pull/36593#discussion_r877115137


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala:
##
@@ -744,6 +744,14 @@ object DataSourceStrategy
 PushableColumnWithoutNestedColumn(right), _) =>
   Some(new GeneralAggregateFunc("CORR", agg.isDistinct,
 Array(FieldReference.column(left), FieldReference.column(right
+case aggregate.V2Aggregator(aggrFunc, children, _, _) =>
+  val translatedExprs = children.flatMap(PushableExpression.unapply(_))
+  if (translatedExprs.length == children.length) {
+Some(new 
GeneralAggregateFunc(aggrFunc.name().toUpperCase(Locale.ROOT), agg.isDistinct,

Review Comment:
   ditto, let' create a dedicated v2 api, such as `UserDefinedAggregateFunction`



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala:
##
@@ -744,6 +744,14 @@ object DataSourceStrategy
 PushableColumnWithoutNestedColumn(right), _) =>
   Some(new GeneralAggregateFunc("CORR", agg.isDistinct,
 Array(FieldReference.column(left), FieldReference.column(right
+case aggregate.V2Aggregator(aggrFunc, children, _, _) =>
+  val translatedExprs = children.flatMap(PushableExpression.unapply(_))
+  if (translatedExprs.length == children.length) {
+Some(new 
GeneralAggregateFunc(aggrFunc.name().toUpperCase(Locale.ROOT), agg.isDistinct,

Review Comment:
   ditto, let's create a dedicated v2 api, such as 
`UserDefinedAggregateFunction`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36593: [SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36593:
URL: https://github.com/apache/spark/pull/36593#discussion_r877113355


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -201,6 +203,14 @@ class V2ExpressionBuilder(
 None
   }
 // TODO supports other expressions
+case ApplyFunctionExpression(function, children) =>
+  val childrenExpressions = children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == children.length) {
+Some(new 
GeneralScalarExpression(function.name().toUpperCase(Locale.ROOT),

Review Comment:
   UDF is kind of special and I think it's better to have a new v2 API for it, 
instead of reusing `GeneralScalarExpression`.
   
   e.g., we can add a `UserDefinedScalaExpression`, which defines the function 
name, canonical name and inputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36593: [SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36593:
URL: https://github.com/apache/spark/pull/36593#discussion_r877113355


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -201,6 +203,14 @@ class V2ExpressionBuilder(
 None
   }
 // TODO supports other expressions
+case ApplyFunctionExpression(function, children) =>
+  val childrenExpressions = children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == children.length) {
+Some(new 
GeneralScalarExpression(function.name().toUpperCase(Locale.ROOT),

Review Comment:
   UDF is kind of special and I think it's better to have a new v2 API for it, 
instead of reusing `GeneralScalarExpression`.
   
   e.g., we can add a `UserDefinedScalaFunction`, which defines the function 
name, canonical name and inputs.



##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala:
##
@@ -201,6 +203,14 @@ class V2ExpressionBuilder(
 None
   }
 // TODO supports other expressions
+case ApplyFunctionExpression(function, children) =>
+  val childrenExpressions = children.flatMap(generateExpression(_))
+  if (childrenExpressions.length == children.length) {
+Some(new 
GeneralScalarExpression(function.name().toUpperCase(Locale.ROOT),

Review Comment:
   UDF is kind of special and I think it's better to have a new v2 API for it, 
instead of reusing `GeneralScalarExpression`.
   
   e.g., we can add a `UserDefinedScalarFunction`, which defines the function 
name, canonical name and inputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36593: [SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36593:
URL: https://github.com/apache/spark/pull/36593#discussion_r877123725


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCCatalog.scala:
##
@@ -32,11 +35,14 @@ import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects}
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.util.CaseInsensitiveStringMap
 
-class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with 
Logging {
+class JDBCCatalog extends TableCatalog with SupportsNamespaces with 
FunctionCatalog with Logging {
   private var catalogName: String = null
   private var options: JDBCOptions = _
   private var dialect: JdbcDialect = _
 
+  private val functions: util.Map[Identifier, UnboundFunction] =
+new ConcurrentHashMap[Identifier, UnboundFunction]()

Review Comment:
   We should clearly define how can this be used. I thought each JDBC dialect 
should have APIs to register its own UDFs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36295:
URL: https://github.com/apache/spark/pull/36295#discussion_r877141680


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -203,6 +204,245 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df5, Seq(Row(1.00, 1000.0, "amy")))
   }
 
+  private def checkOffsetRemoved(df: DataFrame, removed: Boolean = true): Unit 
= {
+val offsets = df.queryExecution.optimizedPlan.collect {
+  case offset: Offset => offset
+}
+if (removed) {
+  assert(offsets.isEmpty)
+} else {
+  assert(offsets.nonEmpty)
+}
+  }
+
+  test("simple scan with OFFSET") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .where($"dept" === 1)
+  .offset(1)
+checkOffsetRemoved(df1)
+checkPushedInfo(df1,
+  "PushedFilters: [DEPT IS NOT NULL, DEPT = 1], PushedOffset: OFFSET 1,")
+checkAnswer(df1, Seq(Row(1, "cathy", 9000.00, 1200.0, false)))
+
+val df2 = spark.read
+  .option("pushDownOffset", "false")
+  .table("h2.test.employee")
+  .where($"dept" === 1)
+  .offset(1)
+checkOffsetRemoved(df2, false)
+checkPushedInfo(df2,
+  "PushedFilters: [DEPT IS NOT NULL, DEPT = 1], ReadSchema:")
+checkAnswer(df2, Seq(Row(1, "cathy", 9000.00, 1200.0, false)))
+
+val df3 = spark.read
+  .option("partitionColumn", "dept")
+  .option("lowerBound", "0")
+  .option("upperBound", "2")
+  .option("numPartitions", "2")
+  .table("h2.test.employee")
+  .filter($"dept" > 1)
+  .offset(1)
+checkOffsetRemoved(df3, false)
+checkPushedInfo(df3, "PushedFilters: [DEPT IS NOT NULL, DEPT > 1], 
ReadSchema:")
+checkAnswer(df3, Seq(Row(2, "david", 1, 1300, true), Row(6, "jen", 
12000, 1200, true)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .offset(1)
+checkOffsetRemoved(df4, false)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)], PushedFilters: [], 
PushedGroupByExpressions: [DEPT], ")
+checkAnswer(df4, Seq(Row(2, 22000.00), Row(6, 12000.00)))
+
+val name = udf { (x: String) => x.matches("cat|dav|amy") }
+val sub = udf { (x: String) => x.substring(0, 3) }
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"BONUS", sub($"NAME").as("shortName"))
+  .filter(name($"shortName"))
+  .offset(1)
+checkOffsetRemoved(df5, false)
+// OFFSET is pushed down only if all the filters are pushed down
+checkPushedInfo(df5, "PushedFilters: [], ")
+checkAnswer(df5, Seq(Row(1.00, 1300.0, "dav"), Row(9000.00, 1200.0, 
"cat")))
+  }
+
+  test("simple scan with LIMIT and OFFSET") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .where($"dept" === 1)
+  .limit(2)
+  .offset(1)
+checkLimitRemoved(df1)
+checkOffsetRemoved(df1)
+checkPushedInfo(df1,
+  "PushedFilters: [DEPT IS NOT NULL, DEPT = 1], PushedLimit: LIMIT 1, 
PushedOffset: OFFSET 1,")

Review Comment:
   This does not match 
https://github.com/apache/spark/pull/36295/files#diff-85c754089fc8e0db142a16714e92b127001bab9e6433684d1e3a15af04cb219aR26
   
   Assume that I have a local array data source. According to the API doc, 
Spark pushes down LIMIT first. For this query, I'll do 
`array.limit(1).drop(1)`. This is wrong and doesn't match the query 
`df.limit(2).offset(1)`.
   
   we should either fix the API doc, or fix the pushdown logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #36295:
URL: https://github.com/apache/spark/pull/36295#discussion_r877141680


##
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala:
##
@@ -203,6 +204,245 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
 checkAnswer(df5, Seq(Row(1.00, 1000.0, "amy")))
   }
 
+  private def checkOffsetRemoved(df: DataFrame, removed: Boolean = true): Unit 
= {
+val offsets = df.queryExecution.optimizedPlan.collect {
+  case offset: Offset => offset
+}
+if (removed) {
+  assert(offsets.isEmpty)
+} else {
+  assert(offsets.nonEmpty)
+}
+  }
+
+  test("simple scan with OFFSET") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .where($"dept" === 1)
+  .offset(1)
+checkOffsetRemoved(df1)
+checkPushedInfo(df1,
+  "PushedFilters: [DEPT IS NOT NULL, DEPT = 1], PushedOffset: OFFSET 1,")
+checkAnswer(df1, Seq(Row(1, "cathy", 9000.00, 1200.0, false)))
+
+val df2 = spark.read
+  .option("pushDownOffset", "false")
+  .table("h2.test.employee")
+  .where($"dept" === 1)
+  .offset(1)
+checkOffsetRemoved(df2, false)
+checkPushedInfo(df2,
+  "PushedFilters: [DEPT IS NOT NULL, DEPT = 1], ReadSchema:")
+checkAnswer(df2, Seq(Row(1, "cathy", 9000.00, 1200.0, false)))
+
+val df3 = spark.read
+  .option("partitionColumn", "dept")
+  .option("lowerBound", "0")
+  .option("upperBound", "2")
+  .option("numPartitions", "2")
+  .table("h2.test.employee")
+  .filter($"dept" > 1)
+  .offset(1)
+checkOffsetRemoved(df3, false)
+checkPushedInfo(df3, "PushedFilters: [DEPT IS NOT NULL, DEPT > 1], 
ReadSchema:")
+checkAnswer(df3, Seq(Row(2, "david", 1, 1300, true), Row(6, "jen", 
12000, 1200, true)))
+
+val df4 = spark.read
+  .table("h2.test.employee")
+  .groupBy("DEPT").sum("SALARY")
+  .offset(1)
+checkOffsetRemoved(df4, false)
+checkPushedInfo(df4,
+  "PushedAggregates: [SUM(SALARY)], PushedFilters: [], 
PushedGroupByExpressions: [DEPT], ")
+checkAnswer(df4, Seq(Row(2, 22000.00), Row(6, 12000.00)))
+
+val name = udf { (x: String) => x.matches("cat|dav|amy") }
+val sub = udf { (x: String) => x.substring(0, 3) }
+val df5 = spark.read
+  .table("h2.test.employee")
+  .select($"SALARY", $"BONUS", sub($"NAME").as("shortName"))
+  .filter(name($"shortName"))
+  .offset(1)
+checkOffsetRemoved(df5, false)
+// OFFSET is pushed down only if all the filters are pushed down
+checkPushedInfo(df5, "PushedFilters: [], ")
+checkAnswer(df5, Seq(Row(1.00, 1300.0, "dav"), Row(9000.00, 1200.0, 
"cat")))
+  }
+
+  test("simple scan with LIMIT and OFFSET") {
+val df1 = spark.read
+  .table("h2.test.employee")
+  .where($"dept" === 1)
+  .limit(2)
+  .offset(1)
+checkLimitRemoved(df1)
+checkOffsetRemoved(df1)
+checkPushedInfo(df1,
+  "PushedFilters: [DEPT IS NOT NULL, DEPT = 1], PushedLimit: LIMIT 1, 
PushedOffset: OFFSET 1,")

Review Comment:
   This does not match 
https://github.com/apache/spark/pull/36295/files#diff-85c754089fc8e0db142a16714e92b127001bab9e6433684d1e3a15af04cb219aR26
   
   Assume that I have a local array data source. According to the API doc, 
Spark pushes down LIMIT first. For this query, I'll do `array.take(1).drop(1)`. 
This is wrong and doesn't match the query `df.limit(2).offset(1)`.
   
   we should either fix the API doc, or fix the pushdown logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

2022-05-19 Thread GitBox


cloud-fan commented on code in PR #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r877169574


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala:
##
@@ -321,6 +321,38 @@ object GeneratorNestedColumnAliasing {
 // need to prune nested columns through Project and under Generate. The 
difference is
 // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned 
further at
 // file format readers if it is supported.
+
+// There are [[ExtractValue]] expressions on or not on the output of the 
generator. Generator
+// can also have different types:
+// 1. For [[ExtractValue]]s not on the output of the generator, 
theoretically speaking, there
+//lots of expressions that we can push down, including non 
ExtractValues and GetArrayItem
+//and GetMapValue. But to be safe, we only handle GetStructField and 
GetArrayStructFields.
+// 2. For [[ExtractValue]]s on the output of the generator, the situation 
depends on the type
+//of the generator expression. *For now, we only support Explode*.
+//   2.1 Inline
+//   Inline takes an input of ARRAY>, and 
returns an output of
+//   STRUCT, the output field can be directly accessed 
by name "field1".
+//   In this case, we should not try to push down the ExtractValue 
expressions to the
+//   input of the Inline. For example:
+//   Project[field1.x AS x]
+//   - Generate[ARRAY, field2:int>>, 
..., field1, field2]
+//   It is incorrect to push down the .x to the input of the Inline.
+//   A valid field pruning would be to extract all the fields that are 
accessed by the
+//   Project, and manually reconstruct an expression using those 
fields.
+//   2.2 Explode
+//   Explode takes an input of ARRAY and returns an output 
of
+//   STRUCT. The default field name "col" can be 
overwritten.
+//   If the input is MAP, it returns STRUCT.
+//   For the array case, it is only valid to push down GetStructField. 
After push down,
+//   the GetStructField becomes a GetArrayStructFields. Note that we 
cannot push down
+//   GetArrayStructFields, since the pushed down expression will 
operate on an array of
+//   array which is invalid.
+//   2.3 Stack
+//   Stack takes a sequence of expressions, and returns an output of
+//   STRUCT
+//   The push down is doable but more complicated in this case as the 
expression that
+//   operates on the col_i of the output needs to pushed down to every 
(kn+i)-th input
+//   expression where n is the total number of columns (or struct 
fields) of the output.

Review Comment:
   actually, I find it's useful to understand why we only support explode today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

2022-05-19 Thread GitBox


cloud-fan commented on PR #35850:
URL: https://github.com/apache/spark/pull/35850#issuecomment-1131827928

   thanks, merging to master/3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

2022-05-19 Thread GitBox


cloud-fan closed pull request #35850: [SPARK-38529][SQL] Prevent 
GeneratorNestedColumnAliasing to be applied to non-Explode generators
URL: https://github.com/apache/spark/pull/35850


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #36377: [SPARK-39043][SQL] Spark SQL Hive client should not gather statistic by default.

2022-05-19 Thread GitBox


dongjoon-hyun commented on PR #36377:
URL: https://github.com/apache/spark/pull/36377#issuecomment-1131851884

   Thank you for the reverting decision, @cloud-fan and @AngersZh .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tanvn commented on pull request #27590: [SPARK-30703][SQL][DOCS][FollowUp] Declare the ANSI SQL compliance options as experimental

2022-05-19 Thread GitBox


tanvn commented on PR #27590:
URL: https://github.com/apache/spark/pull/27590#issuecomment-1131855914

   @gengliangwang @dongjoon-hyun 
   Hi, I have a question.
   In Spark 3.2.1, are `spark.sql.ansi.enabled` and 
`spark.sql.storeAssignmentPolicy` still considered as experimental options ?
   I understand that there is an epic named `ANSI enhancements in Spark 3.3`  
https://issues.apache.org/jira/browse/SPARK-38860 which means there will be new 
features for the  `spark.sql.ansi.enabled`.
   But as in https://spark.apache.org/releases/spark-release-3-2-0.html, ANSI 
SQL mode GA ([SPARK-35030](https://issues.apache.org/jira/browse/SPARK-35030)) 
is mentioned in the Highlights section, so I think it is not `experimental` 
anymore. Could you please give me your opinion?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set t

2022-05-19 Thread GitBox


mridulm commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877251054


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4342,6 +4342,56 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 assertDataStructuresEmpty()
   }
 
+  test("SPARK-38987: corrupted shuffle block FetchFailure should unregister 
merge result") {
+initPushBasedShuffleConfs(conf)
+DAGSchedulerSuite.clearMergerLocs()
+DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", "host4", 
"host5"))
+
+scheduler = new MyDAGScheduler(
+  sc,
+  taskScheduler,
+  sc.listenerBus,
+  mapOutputTracker,
+  blockManagerMaster,
+  sc.env,
+  shuffleMergeFinalize = false,
+  shuffleMergeRegister = false)
+dagEventProcessLoopTester = new 
DAGSchedulerEventProcessLoopTester(scheduler)
+
+val parts = 2
+val shuffleMapRdd = new MyRDD(sc, parts, Nil)
+val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(parts))
+val reduceRdd = new MyRDD(sc, parts, List(shuffleDep), tracker = 
mapOutputTracker)
+
+// Submit a reduce job that depends which will create a map stage
+submit(reduceRdd, (0 until parts).toArray)
+
+// Pass in custom bitmap so that the mergeStatus can store
+// the correct mapIndex.
+val bitmap = new RoaringBitmap()
+bitmap.add(-1)
+
+val shuffleMapStage = 
scheduler.stageIdToStage(0).asInstanceOf[ShuffleMapStage]
+scheduler.handleRegisterMergeStatuses(shuffleMapStage,
+  Seq((0, MergeStatus(makeBlockManagerId("hostA"), 
shuffleDep.shuffleMergeId, bitmap, 1000L
+scheduler.handleShuffleMergeFinalized(shuffleMapStage,
+  shuffleMapStage.shuffleDep.shuffleMergeId)
+scheduler.handleRegisterMergeStatuses(shuffleMapStage,
+  Seq((1, MergeStatus(makeBlockManagerId("hostA"), 
shuffleDep.shuffleMergeId, bitmap, 1000L
+

Review Comment:
   You are right, we need the bitmap to be valid, and not mocked.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vli-databricks commented on pull request #36584: [SPARK-39213] Create ANY_VALUE aggregate function

2022-05-19 Thread GitBox


vli-databricks commented on PR #36584:
URL: https://github.com/apache/spark/pull/36584#issuecomment-1131915487

   @MaxGekk please review and help me merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #36612: [SPARK-39234][SQL] Code clean up in SparkThrowableHelper.getMessage

2022-05-19 Thread GitBox


MaxGekk commented on PR #36612:
URL: https://github.com/apache/spark/pull/36612#issuecomment-1131917484

   +1, LGTM. Merging to master.
   Thank you, @gengliangwang and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #36612: [SPARK-39234][SQL] Code clean up in SparkThrowableHelper.getMessage

2022-05-19 Thread GitBox


MaxGekk closed pull request #36612: [SPARK-39234][SQL] Code clean up in 
SparkThrowableHelper.getMessage
URL: https://github.com/apache/spark/pull/36612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #36580: [SPARK-39167][SQL] Throw an exception w/ an error class for multiple rows from a subquery used as an expression

2022-05-19 Thread GitBox


MaxGekk commented on code in PR #36580:
URL: https://github.com/apache/spark/pull/36580#discussion_r877269334


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##
@@ -1971,4 +1971,10 @@ object QueryExecutionErrors extends QueryErrorsBase {
 s"add ${toSQLValue(amount, IntegerType)} $unit to " +
 s"${toSQLValue(DateTimeUtils.microsToInstant(micros), 
TimestampType)}"))
   }
+
+  def multipleRowSubqueryError(plan: String): Throwable = {
+new SparkIllegalStateException(

Review Comment:
   Please, use SparkException or SparkRuntimeException. I removed the the 
exception by https://github.com/apache/spark/pull/36550.
   
   BTW, we shouldn't throw `IllegalStateException` if the exception can be 
trigger by user code in regular cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set t

2022-05-19 Thread GitBox


mridulm commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877273610


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// the FetchFailed event contains a mapIndex of -1, and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null && pushBasedShuffleEnabled) {

Review Comment:
   Why do we need `pushBasedShuffleEnabled` check here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set t

2022-05-19 Thread GitBox


mridulm commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877275914


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// the FetchFailed event contains a mapIndex of -1, and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null && pushBasedShuffleEnabled) {
+  mapOutputTracker.
+unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))

Review Comment:
   Given `mapIndex` is -1, should this not be `None` passed in to 
`unregisterMergeResult` instead of `Option(-1)` ?
   The tests are passing because you explicitly added `-1` to the bitmap, which 
looks incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #36584: [SPARK-39213][SQL] Create ANY_VALUE aggregate function

2022-05-19 Thread GitBox


MaxGekk commented on PR #36584:
URL: https://github.com/apache/spark/pull/36584#issuecomment-1131939361

   How about to add the function to other APIs like first() in
   - PySpark: 
https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L500
   - R: 
https://github.com/apache/spark/blob/16d1c68e8b185457ae86a248d0874e61c3bc6f3a/R/pkg/R/functions.R#L1178
   
   BTW, if the purpose of this new feature is to make migrations to Spark SQL 
from other systems easier, I would propose to add it to Spark SQL only (and not 
extend functions.scala). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vli-databricks commented on pull request #36584: [SPARK-39213][SQL] Create ANY_VALUE aggregate function

2022-05-19 Thread GitBox


vli-databricks commented on PR #36584:
URL: https://github.com/apache/spark/pull/36584#issuecomment-1131948634

   Yes, the purpose is ease of migration, removed change to `functions.scala` 
to limit scope to Spark SQL only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nkronenfeld opened a new pull request, #36613: [WIP][SPARK-30983] Support typed select in Datasets up to the max tuple size

2022-05-19 Thread GitBox


nkronenfeld opened a new pull request, #36613:
URL: https://github.com/apache/spark/pull/36613

   ### What changes were proposed in this pull request?
   
   This PR simply adds typed select methods to Dataset up to the max Tuple size 
of 22.
   
   This has been bugging me for years, so I finally decided to get off my 
backside and do something about it :-).
   
   As noted in the JIRA issue, technically, this is a breaking change - indeed, 
I had to remove an old test that specifically tested that Spark didn't support 
typed select for tuples larger than 5.  However, it would take someone 
explicitly relying on select returning a DataFrame instead of a Dataset when 
using select on large tuples of typed columns (though I guess that test I had 
to remove exhibits one case where this may happen).
   
   I've set the PR as WIP because I've been unable to run all tests so far - 
not due to the fix, but rather due to not having things set up correctly on my 
computer.  Still working on that.
   
   ### Why are the changes needed?
   Arbitrarily supporting only up to 5-tuples is weird and unpredictable.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, select on tuples of all typed columns larger than 5 will now return a 
Dataset instead of a DataFrame
   
   ### How was this patch tested?
   I've run all sql tests, and they all pass (though testing itself still fails 
on my machine, I think with a path-too-long error
   I've added a test to make sure the typed select works on all sizes - mostly 
this is a compile issue, not a run-time issue, but I checked values too, just 
to double-check that I didn't miss anything (which is a big potential problem 
with long tuples and copy-paste errors)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] akpatnam25 commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is se

2022-05-19 Thread GitBox


akpatnam25 commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877316296


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// the FetchFailed event contains a mapIndex of -1, and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null && pushBasedShuffleEnabled) {
+  mapOutputTracker.
+unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))

Review Comment:
   yep, passing in `None` now. this way we do not have to explicitly pass in -1 
in the tests. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] akpatnam25 commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is se

2022-05-19 Thread GitBox


akpatnam25 commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877316473


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// the FetchFailed event contains a mapIndex of -1, and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null && pushBasedShuffleEnabled) {

Review Comment:
   we dont need that check here, removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #36611: [SPARK-39204][BUILD][CORE][SQL][DSTREAM][GRAPHX][K8S][ML][MLLIB][SS][YARN][EXAMPLES][SHELL] Replace `Utils.createTempDir` with `JavaUtils

2022-05-19 Thread GitBox


LuciferYang commented on PR #36611:
URL: https://github.com/apache/spark/pull/36611#issuecomment-1131979146

   It seems that this change is  big. Another way to keep one  `createTempDir` 
is to let `Utils.createTempDir` call `JavaUtils.createTempDir`   . Is this 
acceptable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to

2022-05-19 Thread GitBox


otterc commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877329983


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// the FetchFailed event contains a mapIndex of -1, and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null && pushBasedShuffleEnabled) {

Review Comment:
   I think we should check for `pushBasedShuffleEnabled` here just for 
consistency with lines 1882-1886. There aren't going to be any merge results if 
push-based shuffle is not enabled but since we do it at the other places we 
should do it here as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hai-tao-1 commented on pull request #36597: [SPARK-39225][CORE] Support `spark.history.fs.update.batchSize`

2022-05-19 Thread GitBox


hai-tao-1 commented on PR #36597:
URL: https://github.com/apache/spark/pull/36597#issuecomment-1131988355

   > Thank you for updates, @hai-tao-1 . Yes, the only remaining comment is the 
test case.
   > 
   > > We need a test case for the configuration. Please check the corner cases 
especially.
   
   @dongjoon-hyun Added unit test  test("SPARK-39225: Support 
spark.history.fs.update.batchSize").


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to

2022-05-19 Thread GitBox


otterc commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877340985


##
core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala:
##
@@ -1786,4 +1786,32 @@ class ShuffleBlockFetcherIteratorSuite extends 
SparkFunSuite with PrivateMethodT
   ShuffleBlockId(0, 5, 2), ShuffleBlockId(0, 6, 2)))
   }
 
+  test("SPARK-38987: failure to fetch corrupted shuffle block chunk should " +

Review Comment:
   Nit: modify this name so it's clear that when corruption goes undetected 
then it should throw fetch failure. Otherwise, it is confusing because one test 
says we should fallback on corruption and the other says it should throw fetch 
failure



##
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##
@@ -1166,6 +1166,9 @@ final class ShuffleBlockFetcherIterator(
   case ShuffleBlockBatchId(shuffleId, mapId, startReduceId, _) =>
 throw SparkCoreErrors.fetchFailedError(address, shuffleId, mapId, 
mapIndex, startReduceId,
   msg, e)
+  case ShuffleBlockChunkId(shuffleId, _, reduceId, _) =>
+SparkCoreErrors.fetchFailedError(address, shuffleId,
+  -1L, SHUFFLE_PUSH_MAP_ID, reduceId, msg, e)

Review Comment:
   Nit: Can we use SHUFFLE_PUSH_MAP for mapId as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #36603: [SPARK-39163][SQL] Throw an exception w/ error class for an invalid bucket file

2022-05-19 Thread GitBox


MaxGekk closed pull request #36603: [SPARK-39163][SQL] Throw an exception w/ 
error class for an invalid bucket file
URL: https://github.com/apache/spark/pull/36603


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #36603: [SPARK-39163][SQL] Throw an exception w/ error class for an invalid bucket file

2022-05-19 Thread GitBox


MaxGekk commented on PR #36603:
URL: https://github.com/apache/spark/pull/36603#issuecomment-1132003245

   @panbingkun Could you backport this to branch-3.3, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set t

2022-05-19 Thread GitBox


mridulm commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877361726


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// the FetchFailed event contains a mapIndex of -1, and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null && pushBasedShuffleEnabled) {

Review Comment:
   I am trying to understand, can we have a case where `mapIndex == -1` and 
`bmAddress != null` for non-push based shuffle ? If no, we should add an 
assertion there (not drop the check entirely).
   
   Note, L1882 has the check because the earlier check is for `mapIndex != -1` 
- which can be satisfied for all shuffles



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

2022-05-19 Thread GitBox


aokolnychyi commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132014116

   Thanks for the PR, @huaxingao. I think it is a great feature and it would be 
awesome to get it done.
   
   I spent some time thinking about this and have a few questions/proposals.
   
   If I understand correctly, we currently hard-code the number of shuffle 
partitions in `RepartitionByExpression`, which prohibits both coalescing and 
skew split optimizations.
   
   It seems reasonable to support cases when the requested distribution is 
best-effort but I also think there are valid cases when the distribution is 
required for correctness and it is actually the current API contract. What 
about extending `RequiredDistributionAndOrdering` to indicate the distribution 
is not strictly required? We can add some boolean method and default it to keep 
the existing behavior. If the distribution is required, we can still benefit 
from coalescing as I think `CoalesceShufflePartitions` and `AQEShuffleReadExec` 
would keep the original distribution in coalesce cases. That’s already a huge 
win. We can avoid too small files while keeping the requested distribution.
   
   I also agree about using `RebalancePartitions` when the distribution is not 
strictly required. What about extending `RebalancePartitions` to also support 
range partitioning? It currently supports only hash and round-robin. If we make 
that change, we will be able to remove unnecessary shuffles in the optimizer 
and keep the original distribution as long as there is no skew and we only 
coalesce. If there is a skew, an extra shuffle and changed distribution seems 
like a reasonable overhead.
   
   What does everybody else think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #36601: [SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to

2022-05-19 Thread GitBox


otterc commented on code in PR #36601:
URL: https://github.com/apache/spark/pull/36601#discussion_r877366840


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1885,6 +1885,14 @@ private[spark] class DAGScheduler(
   mapOutputTracker.
 unregisterMergeResult(shuffleId, reduceId, bmAddress, 
Option(mapIndex))
 }
+  } else {
+// Unregister the merge result of  if
+// there is a FetchFailed event and is not a
+// MetaDataFetchException which is signified by bmAddress being 
null
+if (bmAddress != null) {
+  mapOutputTracker.
+unregisterMergeResult(shuffleId, reduceId, bmAddress, None)
+}

Review Comment:
   Can you also check line 2031 where `removeExecutorAndUnregisterOutputs` is 
called. The `executorId` for ShuffleMergedChunk is `shuffle-push-merger`. Will 
that cause any issues? We should probably add a UT for this as well with the 
`unRegisterOutputOnHostOnFetchFailure` enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao opened a new pull request, #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

2022-05-19 Thread GitBox


huaxingao opened a new pull request, #34785:
URL: https://github.com/apache/spark/pull/34785

   
   ### What changes were proposed in this pull request?
   Support optimize skewed partitions in Distribution and Ordering if 
numPartitions is not specified
   
   ### Why are the changes needed?
   When doing repartition in distribution and sort, we will use Rebalance 
operator instead of RepartitionByExpression to optimize skewed partitions when
   1. numPartitions is not specified by the data source, and
   2. sortOrder is specified. This is because the requested distribution needs 
to be guaranteed, which can only be achieved by using RangePartitioning, not 
HashPartitioning.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing and new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #36586: [SPARK-39236][SQL] Make CreateTable and ListTables be compatible with 3 layer namespace

2022-05-19 Thread GitBox


amaliujia commented on PR #36586:
URL: https://github.com/apache/spark/pull/36586#issuecomment-1132121981

   R: @cloud-fan this PR is ready to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #36597: [SPARK-39225][CORE] Support `spark.history.fs.update.batchSize`

2022-05-19 Thread GitBox


dongjoon-hyun closed pull request #36597: [SPARK-39225][CORE] Support 
`spark.history.fs.update.batchSize`
URL: https://github.com/apache/spark/pull/36597


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #36597: [SPARK-39225][CORE] Support `spark.history.fs.update.batchSize`

2022-05-19 Thread GitBox


dongjoon-hyun commented on PR #36597:
URL: https://github.com/apache/spark/pull/36597#issuecomment-1132159846

   Merged to master. I added you to the Apache Spark contributor group and 
assigned SPARK-39225 to you, @hai-tao-1 .
   Welcome to the Apache Spark community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hai-tao-1 commented on pull request #36606: [SPARK-39232][CORE] History Server Main Page App List Filtering

2022-05-19 Thread GitBox


hai-tao-1 commented on PR #36606:
URL: https://github.com/apache/spark/pull/36606#issuecomment-1132271280

   The PR test fails with ```[error] spark-core: Failed binary compatibility 
check against org.apache.spark:spark-core_2.12:3.2.0! Found 9 potential 
problems (filtered 924)```. Anyone could advise what may be wrong? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hai-tao-1 commented on pull request #36606: [SPARK-39232][CORE] History Server Main Page App List Filtering

2022-05-19 Thread GitBox


hai-tao-1 commented on PR #36606:
URL: https://github.com/apache/spark/pull/36606#issuecomment-1132271279

   The PR test fails with ```[error] spark-core: Failed binary compatibility 
check against org.apache.spark:spark-core_2.12:3.2.0! Found 9 potential 
problems (filtered 924)```. Anyone could advise what may be wrong? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #36004: [SPARK-38681][SQL] Support nested generic case classes

2022-05-19 Thread GitBox


dongjoon-hyun commented on PR #36004:
URL: https://github.com/apache/spark/pull/36004#issuecomment-1132295581

   Thank you, @eejbyfeldt .
   
   cc @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #36004: [SPARK-38681][SQL] Support nested generic case classes

2022-05-19 Thread GitBox


srowen closed pull request #36004: [SPARK-38681][SQL] Support nested generic 
case classes
URL: https://github.com/apache/spark/pull/36004


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #36004: [SPARK-38681][SQL] Support nested generic case classes

2022-05-19 Thread GitBox


srowen commented on PR #36004:
URL: https://github.com/apache/spark/pull/36004#issuecomment-1132316150

   Merged to master/3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #35424: [WIP][SPARK-38116] Add auto commit option to JDBC PostgreSQL driver and set the option false default

2022-05-19 Thread GitBox


github-actions[bot] commented on PR #35424:
URL: https://github.com/apache/spark/pull/35424#issuecomment-1132318749

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #36004: [SPARK-38681][SQL] Support nested generic case classes

2022-05-19 Thread GitBox


dongjoon-hyun commented on PR #36004:
URL: https://github.com/apache/spark/pull/36004#issuecomment-1132318738

   Thank you, @eejbyfeldt , @cloud-fan , @srowen !
   
   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #35402: [SPARK-37536][SQL] Allow for API user to disable Shuffle on Local Mode

2022-05-19 Thread GitBox


github-actions[bot] closed pull request #35402: [SPARK-37536][SQL] Allow for 
API user to disable Shuffle on Local Mode
URL: https://github.com/apache/spark/pull/35402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

2022-05-19 Thread GitBox


github-actions[bot] closed pull request #34785: [SPARK-37523][SQL] Support 
optimize skewed partitions in Distribution and Ordering if numPartitions is not 
specified
URL: https://github.com/apache/spark/pull/34785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #35049: [SPARK-37757][BUILD] Enable Spark test scheduled job on ARM runner

2022-05-19 Thread GitBox


github-actions[bot] closed pull request #35049: [SPARK-37757][BUILD] Enable 
Spark test scheduled job on ARM runner
URL: https://github.com/apache/spark/pull/35049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a diff in pull request #36589: [SPARK-39218][SS][PYTHON] Make foreachBatch streaming query stop gracefully

2022-05-19 Thread GitBox


zsxwing commented on code in PR #36589:
URL: https://github.com/apache/spark/pull/36589#discussion_r877642301


##
python/pyspark/sql/tests/test_streaming.py:
##
@@ -592,6 +592,18 @@ def collectBatch(df, id):
 if q:
 q.stop()
 
+def test_streaming_foreachBatch_graceful_stop(self):
+# SPARK-39218: Make foreachBatch streaming query stop gracefully
+def func(batch_df, _):
+time.sleep(10)
+batch_df.count()
+
+q = 
self.spark.readStream.format("rate").load().writeStream.foreachBatch(func).start()
+time.sleep(5)
+q.stop()
+time.sleep(15)  # Wait enough for the exception to be propagated if 
exists.

Review Comment:
   this is not needed. `q.stop()` will wait until the streaming thread is dead.



##
python/pyspark/sql/tests/test_streaming.py:
##
@@ -592,6 +592,18 @@ def collectBatch(df, id):
 if q:
 q.stop()
 
+def test_streaming_foreachBatch_graceful_stop(self):
+# SPARK-39218: Make foreachBatch streaming query stop gracefully
+def func(batch_df, _):
+time.sleep(10)

Review Comment:
   How does this func trigger InterruptedException? I would expect codes like 
`self.spark._jvm.java.lang.Thread.sleep(1)` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #36611: [SPARK-39204][BUILD][CORE][SQL][DSTREAM][GRAPHX][K8S][ML][MLLIB][SS][YARN][EXAMPLES][SHELL] Replace `Utils.createTempDir` with `JavaUtils

2022-05-19 Thread GitBox


HyukjinKwon commented on PR #36611:
URL: https://github.com/apache/spark/pull/36611#issuecomment-1132340311

   Yeah, I think we should better fix `Utils.createTempDir`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36589: [SPARK-39218][SS][PYTHON] Make foreachBatch streaming query stop gracefully

2022-05-19 Thread GitBox


HyukjinKwon commented on code in PR #36589:
URL: https://github.com/apache/spark/pull/36589#discussion_r877647489


##
python/pyspark/sql/tests/test_streaming.py:
##
@@ -592,6 +592,18 @@ def collectBatch(df, id):
 if q:
 q.stop()
 
+def test_streaming_foreachBatch_graceful_stop(self):
+# SPARK-39218: Make foreachBatch streaming query stop gracefully
+def func(batch_df, _):
+time.sleep(10)

Review Comment:
   Actually, `batch_df.count()` below triggers it via accessing to the 
interrupted Java thread if I'm not mistaken.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a diff in pull request #36589: [SPARK-39218][SS][PYTHON] Make foreachBatch streaming query stop gracefully

2022-05-19 Thread GitBox


zsxwing commented on code in PR #36589:
URL: https://github.com/apache/spark/pull/36589#discussion_r877648746


##
python/pyspark/sql/tests/test_streaming.py:
##
@@ -592,6 +592,18 @@ def collectBatch(df, id):
 if q:
 q.stop()
 
+def test_streaming_foreachBatch_graceful_stop(self):
+# SPARK-39218: Make foreachBatch streaming query stop gracefully
+def func(batch_df, _):
+time.sleep(10)

Review Comment:
   I see. `batch_df.count()` probably will touch some code that can be 
interrupted. Can we use `self.spark._jvm.java.lang.Thread.sleep(1)` to save 
5 seconds in this test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36589: [SPARK-39218][SS][PYTHON] Make foreachBatch streaming query stop gracefully

2022-05-19 Thread GitBox


HyukjinKwon commented on code in PR #36589:
URL: https://github.com/apache/spark/pull/36589#discussion_r877653141


##
python/pyspark/sql/tests/test_streaming.py:
##
@@ -592,6 +592,18 @@ def collectBatch(df, id):
 if q:
 q.stop()
 
+def test_streaming_foreachBatch_graceful_stop(self):
+# SPARK-39218: Make foreachBatch streaming query stop gracefully
+def func(batch_df, _):
+time.sleep(10)

Review Comment:
   Yeah, that should work too.



##
python/pyspark/sql/tests/test_streaming.py:
##
@@ -592,6 +592,18 @@ def collectBatch(df, id):
 if q:
 q.stop()
 
+def test_streaming_foreachBatch_graceful_stop(self):
+# SPARK-39218: Make foreachBatch streaming query stop gracefully
+def func(batch_df, _):
+time.sleep(10)

Review Comment:
   Let me fix 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #36330: [SPARK-38897][SQL] DS V2 supports push down string functions

2022-05-19 Thread GitBox


beliefer commented on code in PR #36330:
URL: https://github.com/apache/spark/pull/36330#discussion_r877653817


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##
@@ -228,4 +244,18 @@ protected String visitSQLFunction(String funcName, 
String[] inputs) {
   protected String visitUnexpectedExpr(Expression expr) throws 
IllegalArgumentException {
 throw new IllegalArgumentException("Unexpected V2 expression: " + expr);
   }
+
+  protected String visitOverlay(String[] inputs) {
+throw new UnsupportedOperationException("Function: OVERLAY does not 
support ");

Review Comment:
   @chenzhx The default just used to display. It should return `OVERLAY` syntax.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

2022-05-19 Thread GitBox


huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132363307

   Thanks @aokolnychyi for the proposal. I agree that we should support both 
strictly required distribution and best effort distribution. For best effort 
distribution, if user doesn't request a specific number of partitions, we will 
split skewed partitions and coalesce small partitions. For strictly required 
distribution, if user doesn't request a specific number of partitions, we will 
coalesce small partitions but we will NOT split skewed partitions since this 
changes the required distribution.
   
   In interface `RequiresDistributionAndOrdering`, I will add
   ```
   default boolean distributionStrictlyRequired() { return true; }
   ```
   Then in `DistributionAndOrderingUtils`.`prepareQuery`, I will change the 
code to something like this:
   ```  
 val queryWithDistribution = if (distribution.nonEmpty) {
   if (!write.distributionStrictlyRequired() && numPartitions == 0) {
 RebalancePartitions(distribution, query)
   } else {
 if (numPartitions > 0) {
   RepartitionByExpression(distribution, query, numPartitions)
 } else {
   RepartitionByExpression(distribution, query, None)
 }
   }
   ...
   ``` 
   Basically, in the best effort case, if the requested numPartitions is 0, we 
will use `RebalancePartitions` so both `OptimizeSkewInRebalancePartitions` and 
`CoalesceShufflePartitions` will be applied. In the strictly required 
distribution case,  if the requested numPartitions is 0, we will use 
`RepartitionByExpression(distribution, query, None)` so 
`CoalesceShufflePartitions` will be applied. 
   
   Does this sound correct for every one?
   
   
   
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >