[GitHub] [spark] dongjoon-hyun commented on pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals

2023-09-04 Thread via GitHub


dongjoon-hyun commented on PR #42809:
URL: https://github.com/apache/spark/pull/42809#issuecomment-1706021217

   Thank you. Merged to master for Apache Spark 4.0.0. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42809: [SPARK-45074][PYTHON][CONNECT] 
`DataFrame.{sort, sortWithinPartitions}` support column ordinals
URL: https://github.com/apache/spark/pull/42809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42812:
URL: https://github.com/apache/spark/pull/42812#issuecomment-1706010985

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function

2023-09-04 Thread via GitHub


zhengruifeng closed pull request #42812: [SPARK-45076][PS] Switch to built-in 
`repeat` function
URL: https://github.com/apache/spark/pull/42812


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

2023-09-04 Thread via GitHub


zhengruifeng commented on code in PR #42798:
URL: https://github.com/apache/spark/pull/42798#discussion_r1315417026


##
python/pyspark/pandas/groupby.py:
##
@@ -3534,7 +3534,12 @@ def _reduce_for_stat_function(
 for label in psdf._internal.column_labels:
 psser = psdf._psser_for(label)
 input_scol = psser._dtype_op.nan_to_null(psser).spark.column
-output_scol = sfun(input_scol)
+if sfun.__name__ == "sum" and isinstance(
+psdf._internal.spark_type_for(label), StringType
+):
+output_scol = F.concat_ws("", F.collect_list(input_scol))

Review Comment:
   since string columns are computed together with numerical ones, I think we 
have to compute strings' sum in an aggregation way:
   
   ```
   F.concat_ws("", F.array_sort(
 F.collect_list(F.struct(NATURAL_ORDER_COLUMN_NAME, input_scol))
   )
   ```
   
   For struct type, array_sort sort elements by first field then second field, 
IIRC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect

2023-09-04 Thread via GitHub


HyukjinKwon closed pull request #42813: [MINOR][PYTHON][TESTS] Add init file to 
pyspark.ml.tests.connect
URL: https://github.com/apache/spark/pull/42813


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect

2023-09-04 Thread via GitHub


HyukjinKwon commented on PR #42813:
URL: https://github.com/apache/spark/pull/42813#issuecomment-1706007913

   I manually tested this.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #42797: [SPARK-45068][SQL] Make function output column name consistent in case

2023-09-04 Thread via GitHub


panbingkun commented on PR #42797:
URL: https://github.com/apache/spark/pull/42797#issuecomment-1706004673

   cc @HyukjinKwon @zhengruifeng @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42810:
URL: https://github.com/apache/spark/pull/42810#discussion_r1315406872


##
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala:
##
@@ -213,7 +213,9 @@ private[sql] object CatalogV2Util {
 // enforced by the parser). On the other hand, commands that drop 
the default value such
 // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this 
string to empty.
 if (update.newDefaultValue().nonEmpty) {
-  Some(field.withCurrentDefaultValue(update.newDefaultValue()))
+  val result = 
field.withCurrentDefaultValue(update.newDefaultValue())
+  ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN")

Review Comment:
   can we check AlterTable in `DataSourceV2Strategy` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42109:
URL: https://github.com/apache/spark/pull/42109#discussion_r1315376064


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3219,12 +3219,63 @@
 },
 "sqlState" : "0A000"
   },
+  "UNSUPPORTED_TABLE_OPERATION" : {

Review Comment:
   This is too general, but what we actually report is relation type mismatch. 
How about
   ```
   "UNEXPECTED_RELATION_TYPE": " is a . '' expects a 
."
   "subClass": {
 "WITHOUT_SUGGESTION": "",
 "WITH_ALTER_COMMAND_SUGGESTION": "Please use ALTER  instead."
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42109:
URL: https://github.com/apache/spark/pull/42109#discussion_r1315365663


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala:
##
@@ -44,7 +44,7 @@ case class UnresolvedNamespace(multipartIdentifier: 
Seq[String]) extends Unresol
 case class UnresolvedTable(
 multipartIdentifier: Seq[String],
 commandName: String,
-relationTypeMismatchHint: Option[String]) extends UnresolvedLeafNode
+hint: Boolean = false) extends UnresolvedLeafNode

Review Comment:
   This is not a good name, as `hint` is too vague. How about 
`suggestAlternative: Boolean`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42109:
URL: https://github.com/apache/spark/pull/42109#discussion_r1315365663


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala:
##
@@ -44,7 +44,7 @@ case class UnresolvedNamespace(multipartIdentifier: 
Seq[String]) extends Unresol
 case class UnresolvedTable(
 multipartIdentifier: Seq[String],
 commandName: String,
-relationTypeMismatchHint: Option[String]) extends UnresolvedLeafNode
+hint: Boolean = false) extends UnresolvedLeafNode

Review Comment:
   This is not a good name, as `hint` is too vague. How about 
`suggestAlterCommand: Boolean`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


Hisoka-X commented on code in PR #42810:
URL: https://github.com/apache/spark/pull/42810#discussion_r1315353214


##
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala:
##
@@ -213,7 +213,9 @@ private[sql] object CatalogV2Util {
 // enforced by the parser). On the other hand, commands that drop 
the default value such
 // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this 
string to empty.
 if (update.newDefaultValue().nonEmpty) {
-  Some(field.withCurrentDefaultValue(update.newDefaultValue()))
+  val result = 
field.withCurrentDefaultValue(update.newDefaultValue())
+  ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN")

Review Comment:
   DataSourceV2Strategy.scala:180, also use `ResolveDefaultColumns.analyze` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


Hisoka-X commented on code in PR #42810:
URL: https://github.com/apache/spark/pull/42810#discussion_r1315353214


##
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala:
##
@@ -213,7 +213,9 @@ private[sql] object CatalogV2Util {
 // enforced by the parser). On the other hand, commands that drop 
the default value such
 // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this 
string to empty.
 if (update.newDefaultValue().nonEmpty) {
-  Some(field.withCurrentDefaultValue(update.newDefaultValue()))
+  val result = 
field.withCurrentDefaultValue(update.newDefaultValue())
+  ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN")

Review Comment:
   DataSourceV2Strategy.scala:180



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

2023-09-04 Thread via GitHub


itholic commented on code in PR #42798:
URL: https://github.com/apache/spark/pull/42798#discussion_r1315352479


##
python/pyspark/pandas/groupby.py:
##
@@ -3534,7 +3534,12 @@ def _reduce_for_stat_function(
 for label in psdf._internal.column_labels:
 psser = psdf._psser_for(label)
 input_scol = psser._dtype_op.nan_to_null(psser).spark.column
-output_scol = sfun(input_scol)
+if sfun.__name__ == "sum" and isinstance(
+psdf._internal.spark_type_for(label), StringType
+):
+output_scol = F.concat_ws("", F.collect_list(input_scol))

Review Comment:
   I think maybe this is different case from `head`?
   
   In head, we do `sdf = sdf.orderBy(NATURAL_ORDER_COLUMN_NAME)` and compute 
the `sdf.limit(n)`, so that we can keep the order because `DataFrame.limit` 
doesn't shuffle the data.
   
   But in this case, it shuffles the data again when computing the 
`collect_list` even after sorting the DataFrame by natural order in advance, so 
I think the order would not be guaranteed.
   
   Please let me know if I missed something??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42109:
URL: https://github.com/apache/spark/pull/42109#discussion_r1315350627


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1134,20 +1134,21 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   case v: ResolvedPersistentView =>
 val nameParts = v.catalog.name() +: 
v.identifier.asMultipartIdentifier
 throw QueryCompilationErrors.expectTableNotViewError(
-  nameParts, isTemp = false, cmd, relationTypeMismatchHint, u)
+  nameParts, isTemp = false, cmd, true, u)

Review Comment:
   why do we pass `true` directly? shall we use `UnresolvedTable#hint`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect

2023-09-04 Thread via GitHub


HyukjinKwon opened a new pull request, #42813:
URL: https://github.com/apache/spark/pull/42813

   ### What changes were proposed in this pull request?
   
   This PR proposes to add `__init__.py` file to `pyspark.ml.tests.connect`
   
   ### Why are the changes needed?
   
   To make `pyspark.ml.tests.connect` as a canonical Python package
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   N/A
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42109:
URL: https://github.com/apache/spark/pull/42109#discussion_r1315349400


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3389,12 +3389,63 @@
 },
 "sqlState" : "0A000"
   },
+  "UNSUPPORTED_TABLE_OPERATION" : {
+"message" : [
+  "The table  does not support ."
+],
+"subClass" : {
+  "WITHOUT_SUGGESTION" : {
+"message" : [
+  ""
+]
+  },
+  "WITH_SUGGESTION" : {
+"message" : [
+  "Please use ALTER TABLE instead."
+]
+  }
+}
+  },
+  "UNSUPPORTED_TEMP_VIEW_OPERATION" : {

Review Comment:
   Can we remove this? temp view is also view and there is nothing wrong to say 
`The view abc does not ...` even if `abc` is a temp view



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42812:
URL: https://github.com/apache/spark/pull/42812#issuecomment-1705926359

   CI linke: https://github.com/zhengruifeng/spark/actions/runs/6078712199


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42812:
URL: https://github.com/apache/spark/pull/42812#issuecomment-1705925083

   cc @itholic @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #42812: [SPARK-45076][PS] Switch to built-in `repeat` function

2023-09-04 Thread via GitHub


zhengruifeng opened a new pull request, #42812:
URL: https://github.com/apache/spark/pull/42812

   ### What changes were proposed in this pull request?
   Switch to built-in `repeat` function
   
   
   ### Why are the changes needed?
   
   https://github.com/apache/spark/pull/42794 make `repeat` support 
column-typed `n`, so we don't need this PS-specific function any more
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   
   ### How was this patch tested?
   CI
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42810:
URL: https://github.com/apache/spark/pull/42810#discussion_r1315344801


##
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala:
##
@@ -213,7 +213,9 @@ private[sql] object CatalogV2Util {
 // enforced by the parser). On the other hand, commands that drop 
the default value such
 // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this 
string to empty.
 if (update.newDefaultValue().nonEmpty) {
-  Some(field.withCurrentDefaultValue(update.newDefaultValue()))
+  val result = 
field.withCurrentDefaultValue(update.newDefaultValue())
+  ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN")

Review Comment:
   Where/How do we check the default value for CREATE TABLE?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

2023-09-04 Thread via GitHub


itholic commented on code in PR #42798:
URL: https://github.com/apache/spark/pull/42798#discussion_r1315344811


##
python/pyspark/pandas/groupby.py:
##
@@ -910,7 +910,7 @@ def sum(self, numeric_only: Optional[bool] = True, 
min_count: int = 0) -> FrameL
 

Review Comment:
   Yeah, we should update. Thanks for catching this out!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #42811: [SPARK-43241][FOLLOWUP] Add migration guide for behavior change

2023-09-04 Thread via GitHub


itholic opened a new pull request, #42811:
URL: https://github.com/apache/spark/pull/42811

   
   
   ### What changes were proposed in this pull request?
   
   This follow-ups for https://github.com/apache/spark/pull/42787 to update the 
migration guide.
   
   ### Why are the changes needed?
   
   
   We should mention all the behavior changes in migration guide.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. it's documentation update
   
   ### How was this patch tested?
   
   The existing CI should pass
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X commented on a diff in pull request #42802: [SPARK-43752][SQL] Support default column value on DataSource V2

2023-09-04 Thread via GitHub


Hisoka-X commented on code in PR #42802:
URL: https://github.com/apache/spark/pull/42802#discussion_r1315341345


##
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##
@@ -997,669 +997,6 @@ class InsertSuite extends DataSourceTest with 
SharedSparkSession {
 }
   }
 
-  test("SPARK-38336 INSERT INTO statements with tables with default columns: 
positive tests") {
-// When the INSERT INTO statement provides fewer values than expected, 
NULL values are appended
-// in their place.
-withTable("t") {
-  sql("create table t(i boolean, s bigint) using parquet")
-  sql("insert into t(i) values(true)")
-  checkAnswer(spark.table("t"), Row(true, null))
-}
-// The default value for the DEFAULT keyword is the NULL literal.
-withTable("t") {
-  sql("create table t(i boolean, s bigint) using parquet")
-  sql("insert into t values(true, default)")
-  checkAnswer(spark.table("t"), Row(true, null))
-}
-// There is a complex expression in the default value.
-withTable("t") {
-  sql("create table t(i boolean, s string default concat('abc', 'def')) 
using parquet")
-  sql("insert into t values(true, default)")
-  checkAnswer(spark.table("t"), Row(true, "abcdef"))
-}
-// The default value parses correctly and the provided value type is 
different but coercible.
-withTable("t") {
-  sql("create table t(i boolean, s bigint default 42) using parquet")
-  sql("insert into t(i) values(false)")
-  checkAnswer(spark.table("t"), Row(false, 42L))
-}
-// There are two trailing default values referenced implicitly by the 
INSERT INTO statement.
-withTable("t") {
-  sql("create table t(i int, s bigint default 42, x bigint default 43) 
using parquet")
-  sql("insert into t(i) values(1)")
-  checkAnswer(sql("select s + x from t where i = 1"), Seq(85L).map(i => 
Row(i)))
-}
-withTable("t") {
-  sql("create table t(i int, s bigint default 42, x bigint) using parquet")
-  sql("insert into t(i) values(1)")
-  checkAnswer(spark.table("t"), Row(1, 42L, null))
-}
-// The table has a partitioning column and a default value is injected.
-withTable("t") {
-  sql("create table t(i boolean, s bigint, q int default 42) using parquet 
partitioned by (i)")
-  sql("insert into t partition(i='true') values(5, default)")
-  checkAnswer(spark.table("t"), Row(5, 42, true))
-}
-// The table has a partitioning column and a default value is added per an 
explicit reference.
-withTable("t") {
-  sql("create table t(i boolean, s bigint default 42) using parquet 
partitioned by (i)")
-  sql("insert into t partition(i='true') (s) values(default)")
-  checkAnswer(spark.table("t"), Row(42L, true))
-}
-// The default value parses correctly as a constant but non-literal 
expression.
-withTable("t") {
-  sql("create table t(i boolean, s bigint default 41 + 1) using parquet")
-  sql("insert into t values(false, default)")
-  checkAnswer(spark.table("t"), Row(false, 42L))
-}
-// Explicit defaults may appear in different positions within the inline 
table provided as input
-// to the INSERT INTO statement.
-withTable("t") {
-  sql("create table t(i boolean default false, s bigint default 42) using 
parquet")
-  sql("insert into t(i, s) values(false, default), (default, 42)")
-  checkAnswer(spark.table("t"), Seq(Row(false, 42L), Row(false, 42L)))
-}
-// There is an explicit default value provided in the INSERT INTO 
statement in the VALUES,
-// with an alias over the VALUES.
-withTable("t") {
-  sql("create table t(i boolean, s bigint default 42) using parquet")
-  sql("insert into t select * from values (false, default) as tab(col, 
other)")
-  checkAnswer(spark.table("t"), Row(false, 42L))
-}
-// The explicit default value arrives first before the other value.
-withTable("t") {
-  sql("create table t(i boolean default false, s bigint) using parquet")
-  sql("insert into t values (default, 43)")
-  checkAnswer(spark.table("t"), Row(false, 43L))
-}
-// The 'create table' statement provides the default parameter first.
-withTable("t") {
-  sql("create table t(i boolean default false, s bigint) using parquet")
-  sql("insert into t values (default, 43)")
-  checkAnswer(spark.table("t"), Row(false, 43L))
-}
-// The explicit default value is provided in the wrong order (first 
instead of second), but
-// this is OK because the provided default value evaluates to literal NULL.
-withTable("t") {
-  sql("create table t(i boolean, s bigint default 42) using parquet")
-  sql("insert into t values (default, 43)")
-  checkAnswer(spark.table("t"), Row(null, 43L))
-}
-// There is an explicit default value provided in the INSERT INTO 
statement as a SELECT.
-// This is supported.
-withTable

[GitHub] [spark] Hisoka-X commented on pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


Hisoka-X commented on PR #42810:
URL: https://github.com/apache/spark/pull/42810#issuecomment-1705918893

   cc @cloud-fan @gengliangwang @dtenedor


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


Hisoka-X commented on code in PR #42810:
URL: https://github.com/apache/spark/pull/42810#discussion_r1315339168


##
sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala:
##
@@ -363,6 +363,35 @@ trait AlterTableTests extends SharedSparkSession with 
QueryErrorsBase {
 }
   }
 
+  test("SPARK-45075: ALTER COLUMN with invalid default value") {
+withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format, 
") {
+  withTable("t") {
+sql(s"create table t(i boolean) using $v2Format")
+// The default value fails to analyze.
+checkError(
+  exception = intercept[AnalysisException] {
+sql("alter table t add column s bigint default badvalue")
+  },
+  errorClass = "INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION",
+  parameters = Map(
+"statement" -> "ALTER TABLE",
+"colName" -> "`s`",
+"defaultValue" -> "badvalue"))

Review Comment:
   This test case will report error normally before this PR, but we don't have 
any test case to prove it.



##
sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala:
##
@@ -363,6 +363,35 @@ trait AlterTableTests extends SharedSparkSession with 
QueryErrorsBase {
 }
   }
 
+  test("SPARK-45075: ALTER COLUMN with invalid default value") {
+withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format, 
") {
+  withTable("t") {
+sql(s"create table t(i boolean) using $v2Format")
+// The default value fails to analyze.
+checkError(
+  exception = intercept[AnalysisException] {
+sql("alter table t add column s bigint default badvalue")
+  },
+  errorClass = "INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION",
+  parameters = Map(
+"statement" -> "ALTER TABLE",
+"colName" -> "`s`",
+"defaultValue" -> "badvalue"))
+
+sql("alter table t add column s bigint default 3L")
+checkError(
+  exception = intercept[AnalysisException] {
+sql("alter table t alter column s set default badvalue")
+  },
+  errorClass = "INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION",
+  parameters = Map(
+"statement" -> "ALTER TABLE ALTER COLUMN",
+"colName" -> "`s`",
+"defaultValue" -> "badvalue"))

Review Comment:
   Add this test case to prove the new change worked. But the more negative 
case in 
`sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala` and 
only worked for V1, after this PR merged, I will move it to 
`sql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala` in 
#42802 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X opened a new pull request, #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error

2023-09-04 Thread via GitHub


Hisoka-X opened a new pull request, #42810:
URL: https://github.com/apache/spark/pull/42810

   
   
   ### What changes were proposed in this pull request?
   This PR make sure ALTER TABLE ALTER COLUMN with invalid default value on 
DataSource V2 will report error, before this PR it will alter sucess.
   
   
   
   ### Why are the changes needed?
   Fix the error behavior on DataSource V2 with ALTER TABLE statement.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the invalid default value will report error.
   
   
   
   ### How was this patch tested?
   Add new test.
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


itholic commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315337724


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   > shouldn't we also mention this in our migration doc?
   
   Hmm.. I didn't mention this as a behavior change since it's a bug fix, but 
on second thought maybe we'd better to mention in the migration guide anyway.
   
   Let me create a follow-up for updating the migration guide.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42787:
URL: https://github.com/apache/spark/pull/42787#issuecomment-1705906007

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


zhengruifeng closed pull request #42787: [SPARK-43241][PS] `MultiIndex.append` 
not checking names for equality
URL: https://github.com/apache/spark/pull/42787


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


zhengruifeng commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315324075


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   ok, I misunderstood the Pandas PR. LTGM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by`

2023-09-04 Thread via GitHub


LuciferYang commented on PR #42789:
URL: https://github.com/apache/spark/pull/42789#issuecomment-1705894638

   Thanks @dongjoon-hyun @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42778:
URL: https://github.com/apache/spark/pull/42778#discussion_r1315322373


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] {
 }
   }
 
+  // Returns false if the parent 'w1' window function order by any of the 
window
+  // expressions produced by 'w2'.
+  private def compatibleOrderBy(w1: Window, w2: Window): Boolean = {
+val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute)
+val parentOrder = w1.orderSpec.flatMap(x => x.references)
+childWindowExprs.intersect(parentOrder).isEmpty
+  }
+
   private def windowsCompatible(w1: Window, w2: Window): Boolean = {
 w1.references.intersect(w2.windowOutputSet).isEmpty &&

Review Comment:
   no, the `Window` operator won't exclude C, as it only excludes 
`windowOutputSet`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses

2023-09-04 Thread via GitHub


cloud-fan commented on code in PR #42778:
URL: https://github.com/apache/spark/pull/42778#discussion_r1315321463


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TransposeWindowSuite.scala:
##
@@ -160,4 +160,18 @@ class TransposeWindowSuite extends PlanTest {
 comparePlans(optimized, correctAnswer.analyze)
   }
 
+  test("two windows with overlapping project/order by lists") {

Review Comment:
   This tests pass with the latest master branch.



##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TransposeWindowSuite.scala:
##
@@ -160,4 +160,18 @@ class TransposeWindowSuite extends PlanTest {
 comparePlans(optimized, correctAnswer.analyze)
   }
 
+  test("two windows with overlapping project/order by lists") {

Review Comment:
   This test passes with the latest master branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42808:
URL: https://github.com/apache/spark/pull/42808#issuecomment-1705889593

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`

2023-09-04 Thread via GitHub


zhengruifeng closed pull request #42808: [SPARK-45073][PS][CONNECT] Replace 
`LastNotNull` with `Last(ignoreNulls=True)`
URL: https://github.com/apache/spark/pull/42808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

2023-09-04 Thread via GitHub


HyukjinKwon commented on code in PR #42798:
URL: https://github.com/apache/spark/pull/42798#discussion_r1315318972


##
python/pyspark/pandas/groupby.py:
##
@@ -3534,7 +3534,12 @@ def _reduce_for_stat_function(
 for label in psdf._internal.column_labels:
 psser = psdf._psser_for(label)
 input_scol = psser._dtype_op.nan_to_null(psser).spark.column
-output_scol = sfun(input_scol)
+if sfun.__name__ == "sum" and isinstance(
+psdf._internal.spark_type_for(label), StringType
+):
+output_scol = F.concat_ws("", F.collect_list(input_scol))

Review Comment:
   can we sort by natural order? we have `compute.ordered_head` config. We can 
sort it by natural order, and perform `collect_list`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42809:
URL: https://github.com/apache/spark/pull/42809#issuecomment-1705887362

   cc @HyukjinKwon @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals

2023-09-04 Thread via GitHub


zhengruifeng opened a new pull request, #42809:
URL: https://github.com/apache/spark/pull/42809

   ### What changes were proposed in this pull request?
   `DataFrame.{sort, sortWithinPartitions}` support column ordinals
   
   
   ### Why are the changes needed?
   for feature parity:
   
   SQL:
   ```
   select a, 1, sum(b) from v group by 1, 2 order by 3, 1;
   ```
   
   DataFrame:
   ```
   df.select("a", sf.lit(1), "b").groupBy(1, 2).agg(sf.sum("b")).sort(3, 1)
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new feature
   
   
   ### How was this patch tested?
   added tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

2023-09-04 Thread via GitHub


HyukjinKwon commented on code in PR #42798:
URL: https://github.com/apache/spark/pull/42798#discussion_r1315317140


##
python/pyspark/pandas/groupby.py:
##
@@ -910,7 +910,7 @@ def sum(self, numeric_only: Optional[bool] = True, 
min_count: int = 0) -> FrameL
 

Review Comment:
   I think you gotta fix the log above too since not we support strings too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #42796: [SPARK-45067][BUILD] Upgrade slf4j to 2.0.9

2023-09-04 Thread via GitHub


LuciferYang commented on PR #42796:
URL: https://github.com/apache/spark/pull/42796#issuecomment-1705882880

   Thanks @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


HyukjinKwon commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315315397


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   shouldn't we also mention this in our migration doc?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #42804: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

2023-09-04 Thread via GitHub


ulysses-you commented on code in PR #42804:
URL: https://github.com/apache/spark/pull/42804#discussion_r1315314032


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -234,10 +234,17 @@ abstract class BinaryArithmetic extends BinaryOperator
 case _ => super.checkInputDataTypes()
   }
 
-  override def dataType: DataType = (left.dataType, right.dataType) match {
-case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
-  resultDecimalType(p1, s1, p2, s2)
-case _ => left.dataType
+  override def dataType: DataType = (left, right) match {
+case (_, r: AttributeReference) if !r.dataType.isInstanceOf[DecimalType] =>
+  left.dataType
+case (l: AttributeReference, _) if !l.dataType.isInstanceOf[DecimalType] =>
+  left.dataType
+case (_, _) =>
+  (left.dataType, right.dataType) match {
+case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+  resultDecimalType(p1, s1, p2, s2)
+case _ => left.dataType
+  }

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


itholic commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315309392


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   so I thought it was a bug in Pandas that is fixed in the Pandas 2.0.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


itholic commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315309089


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   I believe it's an intentional behavior since they mentioned this in ["Bug 
fixes"](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#bug-fixes) 
section in their release note?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`

2023-09-04 Thread via GitHub


gengliangwang commented on PR #42766:
URL: https://github.com/apache/spark/pull/42766#issuecomment-1705873204

   @LuciferYang Got it, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`

2023-09-04 Thread via GitHub


LuciferYang commented on PR #42766:
URL: https://github.com/apache/spark/pull/42766#issuecomment-1705871748

   @gengliangwang The `connector` module has already been set to false, and the 
`core` module is changed to false in the current PR. The streaming module is a 
historical legacy and it seems to be intentionally 
retained(https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`

2023-09-04 Thread via GitHub


gengliangwang commented on PR #42766:
URL: https://github.com/apache/spark/pull/42766#issuecomment-1705864189

   BTW, there were multiple places setting `shadeTestJar` as true. @LuciferYang 
could you update the unnecessary ones as well?
   
   ```
   $ git grep shadeTestJar
   connector/protobuf/pom.xml:  false
   core/pom.xml:  true
   streaming/pom.xml:  true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`

2023-09-04 Thread via GitHub


gengliangwang commented on PR #42766:
URL: https://github.com/apache/spark/pull/42766#issuecomment-1705863504

   > @gengliangwang Do you still remember why changed shadeTestJar to false?
   
   I guess it was a mistake... 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

2023-09-04 Thread via GitHub


itholic commented on PR #42798:
URL: https://github.com/apache/spark/pull/42798#issuecomment-1705851281

   @zhengruifeng I think the problem is that the Pandas compute the concat 
without sorting, so the result can be difficult when the index is not sorted as 
below:
   ## Problem
   
   **Pandas**
   ```python
   >>> pdf
  A  B
   4  a  1
   3  b  2
   2  c  3
   >>> pdf.sum()
   Aabc
   B  6
   dtype: object
   ```
   
   **Pandas API on Spark**
   ```python
   >>> psdf
  A  B
   4  a  1
   3  b  2
   2  c  3
   >>> psdf.sum()
   Acba  # we internally sorted the index, so the result is different from 
Pandas
   B  6
   dtype: object
   ```
   
   ## Solution
   I think for now we can pick the one of three ways below:
   1. We can document the warning note as below:
   ```
   The result for string type column is non-deterministic since the 
implementation depends on `collect_list` API from PySpark which is 
non-deterministic as well.
   ```
   2. We can `collect_list` both value and index, and sort by the indices 
before `concat_ws` as you suggested, and document the warning note as below:
   ```
   The result for string type column can be different from Pandas when the 
index is not sorted, since we always sort the indexes before computing since 
the implementation depends on `collect_list` API from PySpark which is 
non-deterministic.
   ```
   3. We don't support the string type column like so far, and add a note that 
why we don't support the string type column as below:
   ```
   String type column is not support for now, because it might yield 
non-deterministic results unlike in Pandas.
   ```
   
   WDYT? Also cc @HyukjinKwon, @ueshin @xinrong-meng , What strategy do we take 
for this situation? I believe that the same rules should apply to similar cases 
that already exist or may arise in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

2023-09-04 Thread via GitHub


HyukjinKwon closed pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] 
Refine docstrings of `coalesce/repartition/repartitionByRange`
URL: https://github.com/apache/spark/pull/42770


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

2023-09-04 Thread via GitHub


HyukjinKwon commented on PR #42770:
URL: https://github.com/apache/spark/pull/42770#issuecomment-1705847362

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zekai-li commented on pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function

2023-09-04 Thread via GitHub


zekai-li commented on PR #42529:
URL: https://github.com/apache/spark/pull/42529#issuecomment-1705837097

   > Hi, @zekai-li .
   > 
   > It seems that you didn't address the previous review comment yet. WDYT?
   > 
   > [#42529 
(comment)](https://github.com/apache/spark/pull/42529#discussion_r1306701075)
   Sorry, I didn't notice. This is something you need to decide for 
equalsIgnoreCase
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zekai-li closed pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function

2023-09-04 Thread via GitHub


zekai-li closed pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file 
system uri comparison function
URL: https://github.com/apache/spark/pull/42529


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zekai-li commented on pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function

2023-09-04 Thread via GitHub


zekai-li commented on PR #42529:
URL: https://github.com/apache/spark/pull/42529#issuecomment-1705835097

   > Hi, @zekai-li .
   > 
   > It seems that you didn't address the previous review comment yet. WDYT?
   > 
   > [#42529 
(comment)](https://github.com/apache/spark/pull/42529#discussion_r1306701075)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zekai-li commented on a diff in pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function

2023-09-04 Thread via GitHub


zekai-li commented on code in PR #42529:
URL: https://github.com/apache/spark/pull/42529#discussion_r1315276443


##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:
##
@@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging {
   return false
 }
 
-val srcAuthority = srcUri.getAuthority()
-val dstAuthority = dstUri.getAuthority()
-if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) {
+val srcUserInfo = srcUri.getUserInfo()
+val dstUserInfo = dstUri.getUserInfo()
+if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) {

Review Comment:
   Last pr 
discussion:https://github.com/apache/spark/pull/19885#discussion_r154819072  



##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:
##
@@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging {
   return false
 }
 
-val srcAuthority = srcUri.getAuthority()
-val dstAuthority = dstUri.getAuthority()
-if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) {
+val srcUserInfo = srcUri.getUserInfo()
+val dstUserInfo = dstUri.getUserInfo()
+if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) {

Review Comment:
   Last pr 
discussion:https://github.com/apache/spark/pull/19885#discussion_r154819072  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zekai-li commented on a diff in pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function

2023-09-04 Thread via GitHub


zekai-li commented on code in PR #42529:
URL: https://github.com/apache/spark/pull/42529#discussion_r1315274264


##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:
##
@@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging {
   return false
 }
 
-val srcAuthority = srcUri.getAuthority()
-val dstAuthority = dstUri.getAuthority()
-if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) {
+val srcUserInfo = srcUri.getUserInfo()
+val dstUserInfo = dstUri.getUserInfo()
+if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) {

Review Comment:
   We don't need to do this adaptation for the user!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zekai-li commented on a diff in pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function

2023-09-04 Thread via GitHub


zekai-li commented on code in PR #42529:
URL: https://github.com/apache/spark/pull/42529#discussion_r1315273750


##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:
##
@@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging {
   return false
 }
 
-val srcAuthority = srcUri.getAuthority()
-val dstAuthority = dstUri.getAuthority()
-if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) {
+val srcUserInfo = srcUri.getUserInfo()
+val dstUserInfo = dstUri.getUserInfo()
+if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) {

Review Comment:
   srcUserInfo == null && dstUserInfo ! = null is fine in this 
case。equalsIgnoreCase I'm not sure if this is correct, but I'm going to follow 
the same logic as before。But the use of equal is more rigorous。



##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:
##
@@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging {
   return false
 }
 
-val srcAuthority = srcUri.getAuthority()
-val dstAuthority = dstUri.getAuthority()
-if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) {
+val srcUserInfo = srcUri.getUserInfo()
+val dstUserInfo = dstUri.getUserInfo()
+if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) {

Review Comment:
   srcUserInfo == null && dstUserInfo ! = null is fine in this 
case。equalsIgnoreCase I'm not sure if this is correct, but I'm going to follow 
the same logic as before。But the use of equal is more rigorous。



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


zhengruifeng commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315255478


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   is it a bug in Pandas that might be fixed in the future?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #42804: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

2023-09-04 Thread via GitHub


wangyum commented on code in PR #42804:
URL: https://github.com/apache/spark/pull/42804#discussion_r1315255094


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -234,10 +234,17 @@ abstract class BinaryArithmetic extends BinaryOperator
 case _ => super.checkInputDataTypes()
   }
 
-  override def dataType: DataType = (left.dataType, right.dataType) match {
-case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
-  resultDecimalType(p1, s1, p2, s2)
-case _ => left.dataType
+  override def dataType: DataType = (left, right) match {
+case (_, r: AttributeReference) if !r.dataType.isInstanceOf[DecimalType] =>
+  left.dataType
+case (l: AttributeReference, _) if !l.dataType.isInstanceOf[DecimalType] =>
+  left.dataType
+case (_, _) =>
+  (left.dataType, right.dataType) match {
+case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+  resultDecimalType(p1, s1, p2, s2)
+case _ => left.dataType
+  }

Review Comment:
   Can we just change it to lazy val? For example:
   ```scala
 private lazy val internalDataType: DataType = {
   (left.dataType, right.dataType) match {
 case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
   resultDecimalType(p1, s1, p2, s2)
 case _ => left.dataType
   }
 }
   
 override def dataType: DataType = internalDataType
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality

2023-09-04 Thread via GitHub


itholic commented on code in PR #42787:
URL: https://github.com/apache/spark/pull/42787#discussion_r1315254128


##
python/pyspark/pandas/indexes/base.py:
##
@@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index":
 sdf_other = 
other._internal.spark_frame.select(other._internal.index_spark_columns)
 sdf_appended = sdf_self.union(sdf_other)
 
-# names should be kept when MultiIndex, but Index wouldn't keep its 
name.
-if isinstance(self, MultiIndex):
-index_names = self._internal.index_names
-else:
-index_names = None
-
 internal = InternalFrame(
 spark_frame=sdf_appended,
 index_spark_columns=[
 scol_for(sdf_appended, col) for col in 
self._internal.index_spark_column_names
 ],
-index_names=index_names,
+index_names=None,

Review Comment:
   We can simply set the `index_names` to `None` to follow the behavior of 
Pandas, since Pandas doesn't keep the name of `MultiIndex` when computing the 
`append` from Pandas 2.0.0. (See 
https://github.com/pandas-dev/pandas/pull/48288 more detail)
   
   cc @zhengruifeng @HyukjinKwon as CI passed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info

2023-09-04 Thread via GitHub


itholic commented on PR #40436:
URL: https://github.com/apache/spark/pull/40436#issuecomment-1705803526

   Thanks, @dzhigimont :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zzzzming95 commented on pull request #42804: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

2023-09-04 Thread via GitHub


ming95 commented on PR #42804:
URL: https://github.com/apache/spark/pull/42804#issuecomment-1705803286

   @ulysses-you  cc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #40436:
URL: https://github.com/apache/spark/pull/40436#issuecomment-1705801677

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info

2023-09-04 Thread via GitHub


zhengruifeng closed pull request #40436: [SPARK-42619][PS] Add `show_counts` 
parameter for DataFrame.info
URL: https://github.com/apache/spark/pull/40436


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42808:
URL: https://github.com/apache/spark/pull/42808#issuecomment-1705793177

   CI link: https://github.com/zhengruifeng/spark/actions/runs/6072210495
   
   all tests passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42808:
URL: https://github.com/apache/spark/pull/42808#issuecomment-1705789957

   cc @HyukjinKwon @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`

2023-09-04 Thread via GitHub


zhengruifeng opened a new pull request, #42808:
URL: https://github.com/apache/spark/pull/42808

   ### What changes were proposed in this pull request?
   Replace `LastNotNull` with `Last(ignoreNulls=True)`
   
   
   ### Why are the changes needed?
   
   https://github.com/apache/spark/pull/36127 introduced a PS dedicated 
expression `LastNotNull`, which was actually not needed and can be replaced 
with built-in `Last`
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   
   ### How was this patch tested?
   CI
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Hisoka-X commented on pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect

2023-09-04 Thread via GitHub


Hisoka-X commented on PR #42800:
URL: https://github.com/apache/spark/pull/42800#issuecomment-1705788141

   Thanks @LuciferYang @zhengruifeng @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42794: [SPARK-45066][SQL][PYTHON][CONNECT] Make function `repeat` accept column-type `n`

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42794:
URL: https://github.com/apache/spark/pull/42794#issuecomment-1705777234

   @dongjoon-hyun thanks for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42791: [SPARK-45064][PYTHON][CONNECT] Add the missing `scale` parameter in `ceil/ceiling`

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42791:
URL: https://github.com/apache/spark/pull/42791#issuecomment-1705777103

   @dongjoon-hyun thanks for reivew


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42767: [SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals

2023-09-04 Thread via GitHub


zhengruifeng commented on PR #42767:
URL: https://github.com/apache/spark/pull/42767#issuecomment-1705776902

   @HyukjinKwon @dongjoon-hyun thanks for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #41808: [SPARK-44162][CORE] Support G1GC in spark metrics

2023-09-04 Thread via GitHub


dongjoon-hyun commented on PR #41808:
URL: https://github.com/apache/spark/pull/41808#issuecomment-1705759712

   Thank you, @Hisoka-X , @mridulm and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42791: [SPARK-45064][PYTHON][CONNECT] Add the missing `scale` parameter in `ceil/ceiling`

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42791: [SPARK-45064][PYTHON][CONNECT] Add 
the missing `scale` parameter in `ceil/ceiling`
URL: https://github.com/apache/spark/pull/42791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42767: [SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42767: [SPARK-45047][PYTHON][CONNECT] 
`DataFrame.groupBy` support ordinals
URL: https://github.com/apache/spark/pull/42767


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect

2023-09-04 Thread via GitHub


dongjoon-hyun commented on PR #42800:
URL: https://github.com/apache/spark/pull/42800#issuecomment-1705734102

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] 
Remove `try_reflect` problem filter rule in connect
URL: https://github.com/apache/spark/pull/42800


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agubichev commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses

2023-09-04 Thread via GitHub


agubichev commented on code in PR #42778:
URL: https://github.com/apache/spark/pull/42778#discussion_r1315214615


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] {
 }
   }
 
+  // Returns false if the parent 'w1' window function order by any of the 
window
+  // expressions produced by 'w2'.
+  private def compatibleOrderBy(w1: Window, w2: Window): Boolean = {
+val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute)
+val parentOrder = w1.orderSpec.flatMap(x => x.references)
+childWindowExprs.intersect(parentOrder).isEmpty
+  }
+
   private def windowsCompatible(w1: Window, w2: Window): Boolean = {
 w1.references.intersect(w2.windowOutputSet).isEmpty &&

Review Comment:
   It's a subtle bug.
   `references` by definition do not include output attributes (see 
https://github.com/apache/spark/blob/798fce3b571907ee52058004cc38c2e8dbc4b016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L91).
   So if the top Window function orders by C and projects (outputs) C, then C 
is not included into references.
   Therefore, we need that extra check.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agubichev commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses

2023-09-04 Thread via GitHub


agubichev commented on code in PR #42778:
URL: https://github.com/apache/spark/pull/42778#discussion_r1315214615


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] {
 }
   }
 
+  // Returns false if the parent 'w1' window function order by any of the 
window
+  // expressions produced by 'w2'.
+  private def compatibleOrderBy(w1: Window, w2: Window): Boolean = {
+val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute)
+val parentOrder = w1.orderSpec.flatMap(x => x.references)
+childWindowExprs.intersect(parentOrder).isEmpty
+  }
+
   private def windowsCompatible(w1: Window, w2: Window): Boolean = {
 w1.references.intersect(w2.windowOutputSet).isEmpty &&

Review Comment:
   It's a subtle bug.
   `references` by definition do not include output attributes (see 
https://github.com/apache/spark/blob/798fce3b571907ee52058004cc38c2e8dbc4b016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L91).
   So if the top Window function orders by C and projects (outputs) C, then C 
is not included into references.
   Therefore, we need that extra check.
   
   (and we can't just take all attributes in the Window operator either, since 
PARTITION BY clauses do need to intersect)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agubichev commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses

2023-09-04 Thread via GitHub


agubichev commented on code in PR #42778:
URL: https://github.com/apache/spark/pull/42778#discussion_r1315214615


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] {
 }
   }
 
+  // Returns false if the parent 'w1' window function order by any of the 
window
+  // expressions produced by 'w2'.
+  private def compatibleOrderBy(w1: Window, w2: Window): Boolean = {
+val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute)
+val parentOrder = w1.orderSpec.flatMap(x => x.references)
+childWindowExprs.intersect(parentOrder).isEmpty
+  }
+
   private def windowsCompatible(w1: Window, w2: Window): Boolean = {
 w1.references.intersect(w2.windowOutputSet).isEmpty &&

Review Comment:
   It's a subtle bug.
   `references` by definition do not include output attributes (see 
https://github.com/apache/spark/blob/798fce3b571907ee52058004cc38c2e8dbc4b016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L91).
   So if the top Window function order by C and projects (outputs) C, then C is 
not included into references.
   Therefore, we need that extra check.
   
   (and we can't just take all attributes in the Window operator either, since 
PARTITION BY clauses do need to intersect)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42794: [SPARK-45066][SQL][PYTHON][CONNECT] Make function `repeat` accept column-type `n`

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42794: [SPARK-45066][SQL][PYTHON][CONNECT] 
Make function `repeat` accept column-type `n`
URL: https://github.com/apache/spark/pull/42794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42801: [SPARK-45070][SQL][DOCS] Describe the binary and datetime formats of `to_char`/`to_varchar`

2023-09-04 Thread via GitHub


dongjoon-hyun commented on code in PR #42801:
URL: https://github.com/apache/spark/pull/42801#discussion_r1315213044


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -4284,12 +4285,22 @@ object functions {
*   prints '+' for positive values but 'MI' prints a space. 'PR': 
Only allowed at the
*   end of the format string; specifies that the result string will be 
wrapped by angle
*   brackets if the input value is negative. 
+   *   If `e` is a datetime, `format` shall be a valid datetime pattern, see
+   *   https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime
 Patterns.
+   *   If `e` is a binary, it is converted to a string in one of the formats:
+   *   
+   * 'base64': a base 64 string.
+   * 'hex': a string in the hexadecimal format.
+   * 'utf-8': the input binary is decoded to UTF-8 string.
+   *   

Review Comment:
   To @MaxGekk , it seems our Linter CI doesn't allow this kind of 
human-friendly formatting in the comment.
   
   - https://github.com/MaxGekk/spark/actions/runs/6076537960/job/16484751098
   
   ```
   The scalafmt check failed on connector/connect at following occurrences:
   
   Requires formatting: functions.scala
   
   Before submitting your change, please make sure to format your code using 
the following command:
   ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false 
-Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl 
connector/connect/common -pl connector/connect/server -pl 
connector/connect/client/jvm
   Error: Process completed with exit code 1.
   ```
   
   The recommended style is something like the following. Please fix the style 
by run the above command line.
   ```
   -   *   brackets if the input value is negative. 
   -   *   If `e` is a datetime, `format` shall be a valid datetime pattern, see
   -   *   https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime
 Patterns.
   -   *   If `e` is a binary, it is converted to a string in one of the 
formats:
   -   *   
   -   * 'base64': a base 64 string.
   -   * 'hex': a string in the hexadecimal format.
   -   * 'utf-8': the input binary is decoded to UTF-8 string.
   -   *   
   +   *   brackets if the input value is negative.  If `e` is a 
datetime, `format` shall be
   +   *   a valid datetime pattern, see https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime
   +   *   Patterns. If `e` is a binary, it is converted to a string in one 
of the formats: 
   +   *   'base64': a base 64 string. 'hex': a string in the 
hexadecimal format.
   +   *   'utf-8': the input binary is decoded to UTF-8 string. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42801: [SPARK-45070][SQL][DOCS] Describe the binary and datetime formats of `to_char`/`to_varchar`

2023-09-04 Thread via GitHub


dongjoon-hyun commented on code in PR #42801:
URL: https://github.com/apache/spark/pull/42801#discussion_r1315213044


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -4284,12 +4285,22 @@ object functions {
*   prints '+' for positive values but 'MI' prints a space. 'PR': 
Only allowed at the
*   end of the format string; specifies that the result string will be 
wrapped by angle
*   brackets if the input value is negative. 
+   *   If `e` is a datetime, `format` shall be a valid datetime pattern, see
+   *   https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime
 Patterns.
+   *   If `e` is a binary, it is converted to a string in one of the formats:
+   *   
+   * 'base64': a base 64 string.
+   * 'hex': a string in the hexadecimal format.
+   * 'utf-8': the input binary is decoded to UTF-8 string.
+   *   

Review Comment:
   To @MaxGekk , it seems our Linter CI doesn't allow this kind of formatting 
in the comment.
   
   - https://github.com/MaxGekk/spark/actions/runs/6076537960/job/16484751098
   
   ```
   The scalafmt check failed on connector/connect at following occurrences:
   
   Requires formatting: functions.scala
   
   Before submitting your change, please make sure to format your code using 
the following command:
   ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false 
-Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl 
connector/connect/common -pl connector/connect/server -pl 
connector/connect/client/jvm
   Error: Process completed with exit code 1.
   ```
   
   The recommended style is something like the following. Please fix the style 
by run the above command line.
   ```
   -   *   brackets if the input value is negative. 
   -   *   If `e` is a datetime, `format` shall be a valid datetime pattern, see
   -   *   https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime
 Patterns.
   -   *   If `e` is a binary, it is converted to a string in one of the 
formats:
   -   *   
   -   * 'base64': a base 64 string.
   -   * 'hex': a string in the hexadecimal format.
   -   * 'utf-8': the input binary is decoded to UTF-8 string.
   -   *   
   +   *   brackets if the input value is negative.  If `e` is a 
datetime, `format` shall be
   +   *   a valid datetime pattern, see https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime
   +   *   Patterns. If `e` is a binary, it is converted to a string in one 
of the formats: 
   +   *   'base64': a base 64 string. 'hex': a string in the 
hexadecimal format.
   +   *   'utf-8': the input binary is decoded to UTF-8 string. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by`

2023-09-04 Thread via GitHub


dongjoon-hyun commented on PR #42789:
URL: https://github.com/apache/spark/pull/42789#issuecomment-1705721562

   Merged to master for Apache Spark 4.0.0. Thank you, @LuciferYang and 
@zhengruifeng .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by`

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine 
docstring of `max_by/min_by`
URL: https://github.com/apache/spark/pull/42789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42749: [SPARK-45031][INFRA] Choose the right merge code path and merge hash for reopened PRs

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42749: [SPARK-45031][INFRA] Choose the right 
merge code path and merge hash for reopened PRs
URL: https://github.com/apache/spark/pull/42749


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread via GitHub


sadikovi commented on PR #42667:
URL: https://github.com/apache/spark/pull/42667#issuecomment-1705718755

   Great, thanks. I will follow up on your comment regarding the possible 
regression in the parsing benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread via GitHub


sadikovi commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315208847


##
sql/core/benchmarks/JsonBenchmark-results.txt:
##
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
 

 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON schema inferring:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding3720   3843 
121  1.3 743.9   1.0X
-UTF-8 is set   5412   5455 
 45  0.91082.4   0.7X
+No encoding2084   2134 
 46  2.4 416.8   1.0X
+UTF-8 is set   3077   3093 
 14  1.6 615.3   0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a short column: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding3234   3254 
 33  1.5 646.7   1.0X
-UTF-8 is set   4847   4868 
 21  1.0 969.5   0.7X
+No encoding2854   2863 
  8  1.8 570.8   1.0X
+UTF-8 is set   4066   4066 
  1  1.2 813.1   0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a wide column:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding5702   5794 
101  0.25702.1   1.0X
-UTF-8 is set   9526   9607 
 73  0.19526.1   0.6X
+No encoding3348   3368 
 26  0.33347.8   1.0X
+UTF-8 is set   5215   5239 
 22  0.25214.7   0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 select wide row:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding   18318  18448 
199  0.0  366367.7   1.0X
-UTF-8 is set  19791  19887 
 99  0.0  395817.1   0.9X
+No encoding   11046  11102 
 54  0.0  220928.4   1.0X
+UTF-8 is set  12135  12181 
 54  0.0  242697.4   0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Select a subset of 10 columns:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Select 10 columns   

[GitHub] [spark] dongjoon-hyun closed pull request #42792: [SPARK-44940][SQL][3.4] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42792: [SPARK-44940][SQL][3.4] Improve 
performance of JSON parsing when "spark.sql.json.enablePartialResults" is 
enabled
URL: https://github.com/apache/spark/pull/42792


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42790: [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42790: [SPARK-44940][SQL][3.5] Improve 
performance of JSON parsing when "spark.sql.json.enablePartialResults" is 
enabled
URL: https://github.com/apache/spark/pull/42790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread via GitHub


dongjoon-hyun commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315206808


##
sql/core/benchmarks/JsonBenchmark-results.txt:
##
@@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing
 

 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 JSON schema inferring:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding3720   3843 
121  1.3 743.9   1.0X
-UTF-8 is set   5412   5455 
 45  0.91082.4   0.7X
+No encoding2084   2134 
 46  2.4 416.8   1.0X
+UTF-8 is set   3077   3093 
 14  1.6 615.3   0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a short column: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding3234   3254 
 33  1.5 646.7   1.0X
-UTF-8 is set   4847   4868 
 21  1.0 969.5   0.7X
+No encoding2854   2863 
  8  1.8 570.8   1.0X
+UTF-8 is set   4066   4066 
  1  1.2 813.1   0.7X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 count a wide column:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding5702   5794 
101  0.25702.1   1.0X
-UTF-8 is set   9526   9607 
 73  0.19526.1   0.6X
+No encoding3348   3368 
 26  0.33347.8   1.0X
+UTF-8 is set   5215   5239 
 22  0.25214.7   0.6X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 select wide row:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding   18318  18448 
199  0.0  366367.7   1.0X
-UTF-8 is set  19791  19887 
 99  0.0  395817.1   0.9X
+No encoding   11046  11102 
 54  0.0  220928.4   1.0X
+UTF-8 is set  12135  12181 
 54  0.0  242697.4   0.9X
 
 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 
5.4.0-1045-aws
+Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
 Select a subset of 10 columns:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Select 10 columns  

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread via GitHub


dongjoon-hyun commented on code in PR #42667:
URL: https://github.com/apache/spark/pull/42667#discussion_r1315205770


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##
@@ -589,12 +605,25 @@ class JacksonParser(
 throw BadRecordException(
   record = () => recordLiteral(record),
   partialResults = () => Array(row),
-  cause)
+  convertCauseForPartialResult(cause))
   case PartialResultArrayException(rows, cause) =>
 throw BadRecordException(
   record = () => recordLiteral(record),
   partialResults = () => rows,
   cause)
+  // These exceptions should never be thrown outside of JacksonParser.
+  // They are used for the control flow in the parser. We add them here 
for completeness
+  // since they also indicate a bad record.

Review Comment:
   Thank you for adding this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42757: [SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution

2023-09-04 Thread via GitHub


dongjoon-hyun closed pull request #42757: [SPARK-45036][SQL] SPJ: Simplify the 
logic to handle partially clustered distribution
URL: https://github.com/apache/spark/pull/42757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42757: [SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution

2023-09-04 Thread via GitHub


dongjoon-hyun commented on PR #42757:
URL: https://github.com/apache/spark/pull/42757#issuecomment-1705698868

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gbloisi-openaire commented on a diff in pull request #42634: [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments

2023-09-04 Thread via GitHub


gbloisi-openaire commented on code in PR #42634:
URL: https://github.com/apache/spark/pull/42634#discussion_r1315180662


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala:
##
@@ -156,4 +158,17 @@ object JavaTypeInference {
   .filterNot(_.getName == "declaringClass")
   .filter(_.getReadMethod != null)
   }
+
+  @tailrec
+  def getClassHierarchyTypeArguments(cls: Class[_],

Review Comment:
   At a second thought  this recursive getClassHierarchyTypeArguments is not 
required as JavaTypeUtils.getTypeArguments(cls, classOf[Object]) already 
traverses the inheritance hierarchy till Object base class collecting type 
variables. So I just removed it.
   To be clear what we are doing here is to collect type variable information 
about base class (which in turn could be derived from another base class, 
forming the hierarchy the requires to be traversed).
   This is required for both provided class and nested beans that extend one or 
more generic base classes in their inheritance hierarchy (tests provided).
   Interfaces are not required to be processed because they  declare abstract 
methods only, so the definition of the method (and its actual return type)  
will come from the class implementing the interface.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #42807: [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes

2023-09-04 Thread via GitHub


hvanhovell commented on code in PR #42807:
URL: https://github.com/apache/spark/pull/42807#discussion_r1315168784


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala:
##
@@ -263,6 +265,31 @@ class ReplE2ESuite extends RemoteSparkSession with 
BeforeAndAfterEach {
 assertContains("Array[org.apache.spark.sql.Row] = Array([id1,1], [id2,16], 
[id3,25])", output)
   }
 
+  test("Single Cell Compilation") {

Review Comment:
   This is more of a test to test the test :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #42807: [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes

2023-09-04 Thread via GitHub


hvanhovell commented on PR #42807:
URL: https://github.com/apache/spark/pull/42807#issuecomment-1705586484

   cc @vsevolodstep-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell opened a new pull request, #42807: [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes

2023-09-04 Thread via GitHub


hvanhovell opened a new pull request, #42807:
URL: https://github.com/apache/spark/pull/42807

   ### What changes were proposed in this pull request?
   Ammonite places all user code inside Helper classes which are nested inside 
the class it creates for each command. This PR adds a custom code class wrapper 
for the Ammonite REPL. It makes sure the Helper classes generated by ammonite 
are always registered as an outer scope immediately. This way we can 
instantiate classes defined inside the Helper class, even when we execute Spark 
code as part of the Helper's constructor.
   
   ### Why are the changes needed?
   When you currently define a class and execute a Spark command using that 
class inside the same cell/line this will fail with an NullPointerException. 
The reason for that is that we cannot resolve the outer scope needed to 
instantiate the class. This PR fixes that issue. The following code will now 
execute successfully (include the curly braces):
   ```scala
   {
 case class Thing(val value: String)
 val r = (0 to 10).map( value => Thing(value.toString) )
 spark.createDataFrame(r)
   }
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   I added more tests to the `ReplE2ESuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski commented on pull request #42806: [SPARK-44833][CONNECT] Fix sending Reattach too fast after Execute

2023-09-04 Thread via GitHub


juliuszsompolski commented on PR #42806:
URL: https://github.com/apache/spark/pull/42806#issuecomment-1705547797

   @hvanhovell @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >