[GitHub] [spark] dongjoon-hyun commented on pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals
dongjoon-hyun commented on PR #42809: URL: https://github.com/apache/spark/pull/42809#issuecomment-1706021217 Thank you. Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals
dongjoon-hyun closed pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals URL: https://github.com/apache/spark/pull/42809 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function
zhengruifeng commented on PR #42812: URL: https://github.com/apache/spark/pull/42812#issuecomment-1706010985 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function
zhengruifeng closed pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function URL: https://github.com/apache/spark/pull/42812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`
zhengruifeng commented on code in PR #42798: URL: https://github.com/apache/spark/pull/42798#discussion_r1315417026 ## python/pyspark/pandas/groupby.py: ## @@ -3534,7 +3534,12 @@ def _reduce_for_stat_function( for label in psdf._internal.column_labels: psser = psdf._psser_for(label) input_scol = psser._dtype_op.nan_to_null(psser).spark.column -output_scol = sfun(input_scol) +if sfun.__name__ == "sum" and isinstance( +psdf._internal.spark_type_for(label), StringType +): +output_scol = F.concat_ws("", F.collect_list(input_scol)) Review Comment: since string columns are computed together with numerical ones, I think we have to compute strings' sum in an aggregation way: ``` F.concat_ws("", F.array_sort( F.collect_list(F.struct(NATURAL_ORDER_COLUMN_NAME, input_scol)) ) ``` For struct type, array_sort sort elements by first field then second field, IIRC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect
HyukjinKwon closed pull request #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect URL: https://github.com/apache/spark/pull/42813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect
HyukjinKwon commented on PR #42813: URL: https://github.com/apache/spark/pull/42813#issuecomment-1706007913 I manually tested this. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42797: [SPARK-45068][SQL] Make function output column name consistent in case
panbingkun commented on PR #42797: URL: https://github.com/apache/spark/pull/42797#issuecomment-1706004673 cc @HyukjinKwon @zhengruifeng @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
cloud-fan commented on code in PR #42810: URL: https://github.com/apache/spark/pull/42810#discussion_r1315406872 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -213,7 +213,9 @@ private[sql] object CatalogV2Util { // enforced by the parser). On the other hand, commands that drop the default value such // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this string to empty. if (update.newDefaultValue().nonEmpty) { - Some(field.withCurrentDefaultValue(update.newDefaultValue())) + val result = field.withCurrentDefaultValue(update.newDefaultValue()) + ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN") Review Comment: can we check AlterTable in `DataSourceV2Strategy` as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]
cloud-fan commented on code in PR #42109: URL: https://github.com/apache/spark/pull/42109#discussion_r1315376064 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -3219,12 +3219,63 @@ }, "sqlState" : "0A000" }, + "UNSUPPORTED_TABLE_OPERATION" : { Review Comment: This is too general, but what we actually report is relation type mismatch. How about ``` "UNEXPECTED_RELATION_TYPE": " is a . '' expects a ." "subClass": { "WITHOUT_SUGGESTION": "", "WITH_ALTER_COMMAND_SUGGESTION": "Please use ALTER instead." } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]
cloud-fan commented on code in PR #42109: URL: https://github.com/apache/spark/pull/42109#discussion_r1315365663 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala: ## @@ -44,7 +44,7 @@ case class UnresolvedNamespace(multipartIdentifier: Seq[String]) extends Unresol case class UnresolvedTable( multipartIdentifier: Seq[String], commandName: String, -relationTypeMismatchHint: Option[String]) extends UnresolvedLeafNode +hint: Boolean = false) extends UnresolvedLeafNode Review Comment: This is not a good name, as `hint` is too vague. How about `suggestAlternative: Boolean`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]
cloud-fan commented on code in PR #42109: URL: https://github.com/apache/spark/pull/42109#discussion_r1315365663 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala: ## @@ -44,7 +44,7 @@ case class UnresolvedNamespace(multipartIdentifier: Seq[String]) extends Unresol case class UnresolvedTable( multipartIdentifier: Seq[String], commandName: String, -relationTypeMismatchHint: Option[String]) extends UnresolvedLeafNode +hint: Boolean = false) extends UnresolvedLeafNode Review Comment: This is not a good name, as `hint` is too vague. How about `suggestAlterCommand: Boolean`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
Hisoka-X commented on code in PR #42810: URL: https://github.com/apache/spark/pull/42810#discussion_r1315353214 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -213,7 +213,9 @@ private[sql] object CatalogV2Util { // enforced by the parser). On the other hand, commands that drop the default value such // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this string to empty. if (update.newDefaultValue().nonEmpty) { - Some(field.withCurrentDefaultValue(update.newDefaultValue())) + val result = field.withCurrentDefaultValue(update.newDefaultValue()) + ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN") Review Comment: DataSourceV2Strategy.scala:180, also use `ResolveDefaultColumns.analyze` too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
Hisoka-X commented on code in PR #42810: URL: https://github.com/apache/spark/pull/42810#discussion_r1315353214 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -213,7 +213,9 @@ private[sql] object CatalogV2Util { // enforced by the parser). On the other hand, commands that drop the default value such // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this string to empty. if (update.newDefaultValue().nonEmpty) { - Some(field.withCurrentDefaultValue(update.newDefaultValue())) + val result = field.withCurrentDefaultValue(update.newDefaultValue()) + ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN") Review Comment: DataSourceV2Strategy.scala:180 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`
itholic commented on code in PR #42798: URL: https://github.com/apache/spark/pull/42798#discussion_r1315352479 ## python/pyspark/pandas/groupby.py: ## @@ -3534,7 +3534,12 @@ def _reduce_for_stat_function( for label in psdf._internal.column_labels: psser = psdf._psser_for(label) input_scol = psser._dtype_op.nan_to_null(psser).spark.column -output_scol = sfun(input_scol) +if sfun.__name__ == "sum" and isinstance( +psdf._internal.spark_type_for(label), StringType +): +output_scol = F.concat_ws("", F.collect_list(input_scol)) Review Comment: I think maybe this is different case from `head`? In head, we do `sdf = sdf.orderBy(NATURAL_ORDER_COLUMN_NAME)` and compute the `sdf.limit(n)`, so that we can keep the order because `DataFrame.limit` doesn't shuffle the data. But in this case, it shuffles the data again when computing the `collect_list` even after sorting the DataFrame by natural order in advance, so I think the order would not be guaranteed. Please let me know if I missed something?? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]
cloud-fan commented on code in PR #42109: URL: https://github.com/apache/spark/pull/42109#discussion_r1315350627 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1134,20 +1134,21 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor case v: ResolvedPersistentView => val nameParts = v.catalog.name() +: v.identifier.asMultipartIdentifier throw QueryCompilationErrors.expectTableNotViewError( - nameParts, isTemp = false, cmd, relationTypeMismatchHint, u) + nameParts, isTemp = false, cmd, true, u) Review Comment: why do we pass `true` directly? shall we use `UnresolvedTable#hint`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #42813: [MINOR][PYTHON][TESTS] Add init file to pyspark.ml.tests.connect
HyukjinKwon opened a new pull request, #42813: URL: https://github.com/apache/spark/pull/42813 ### What changes were proposed in this pull request? This PR proposes to add `__init__.py` file to `pyspark.ml.tests.connect` ### Why are the changes needed? To make `pyspark.ml.tests.connect` as a canonical Python package ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42109: [SPARK-44404][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[1009,1010,1013,1015,1016,1278]
cloud-fan commented on code in PR #42109: URL: https://github.com/apache/spark/pull/42109#discussion_r1315349400 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -3389,12 +3389,63 @@ }, "sqlState" : "0A000" }, + "UNSUPPORTED_TABLE_OPERATION" : { +"message" : [ + "The table does not support ." +], +"subClass" : { + "WITHOUT_SUGGESTION" : { +"message" : [ + "" +] + }, + "WITH_SUGGESTION" : { +"message" : [ + "Please use ALTER TABLE instead." +] + } +} + }, + "UNSUPPORTED_TEMP_VIEW_OPERATION" : { Review Comment: Can we remove this? temp view is also view and there is nothing wrong to say `The view abc does not ...` even if `abc` is a temp view -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function
zhengruifeng commented on PR #42812: URL: https://github.com/apache/spark/pull/42812#issuecomment-1705926359 CI linke: https://github.com/zhengruifeng/spark/actions/runs/6078712199 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42812: [SPARK-45076][PS] Switch to built-in `repeat` function
zhengruifeng commented on PR #42812: URL: https://github.com/apache/spark/pull/42812#issuecomment-1705925083 cc @itholic @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #42812: [SPARK-45076][PS] Switch to built-in `repeat` function
zhengruifeng opened a new pull request, #42812: URL: https://github.com/apache/spark/pull/42812 ### What changes were proposed in this pull request? Switch to built-in `repeat` function ### Why are the changes needed? https://github.com/apache/spark/pull/42794 make `repeat` support column-typed `n`, so we don't need this PS-specific function any more ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
cloud-fan commented on code in PR #42810: URL: https://github.com/apache/spark/pull/42810#discussion_r1315344801 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -213,7 +213,9 @@ private[sql] object CatalogV2Util { // enforced by the parser). On the other hand, commands that drop the default value such // as "ALTER TABLE t ALTER COLUMN c DROP DEFAULT" will set this string to empty. if (update.newDefaultValue().nonEmpty) { - Some(field.withCurrentDefaultValue(update.newDefaultValue())) + val result = field.withCurrentDefaultValue(update.newDefaultValue()) + ResolveDefaultColumns.analyze(result, "ALTER TABLE ALTER COLUMN") Review Comment: Where/How do we check the default value for CREATE TABLE? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`
itholic commented on code in PR #42798: URL: https://github.com/apache/spark/pull/42798#discussion_r1315344811 ## python/pyspark/pandas/groupby.py: ## @@ -910,7 +910,7 @@ def sum(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameL Review Comment: Yeah, we should update. Thanks for catching this out! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #42811: [SPARK-43241][FOLLOWUP] Add migration guide for behavior change
itholic opened a new pull request, #42811: URL: https://github.com/apache/spark/pull/42811 ### What changes were proposed in this pull request? This follow-ups for https://github.com/apache/spark/pull/42787 to update the migration guide. ### Why are the changes needed? We should mention all the behavior changes in migration guide. ### Does this PR introduce _any_ user-facing change? No. it's documentation update ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #42802: [SPARK-43752][SQL] Support default column value on DataSource V2
Hisoka-X commented on code in PR #42802: URL: https://github.com/apache/spark/pull/42802#discussion_r1315341345 ## sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala: ## @@ -997,669 +997,6 @@ class InsertSuite extends DataSourceTest with SharedSparkSession { } } - test("SPARK-38336 INSERT INTO statements with tables with default columns: positive tests") { -// When the INSERT INTO statement provides fewer values than expected, NULL values are appended -// in their place. -withTable("t") { - sql("create table t(i boolean, s bigint) using parquet") - sql("insert into t(i) values(true)") - checkAnswer(spark.table("t"), Row(true, null)) -} -// The default value for the DEFAULT keyword is the NULL literal. -withTable("t") { - sql("create table t(i boolean, s bigint) using parquet") - sql("insert into t values(true, default)") - checkAnswer(spark.table("t"), Row(true, null)) -} -// There is a complex expression in the default value. -withTable("t") { - sql("create table t(i boolean, s string default concat('abc', 'def')) using parquet") - sql("insert into t values(true, default)") - checkAnswer(spark.table("t"), Row(true, "abcdef")) -} -// The default value parses correctly and the provided value type is different but coercible. -withTable("t") { - sql("create table t(i boolean, s bigint default 42) using parquet") - sql("insert into t(i) values(false)") - checkAnswer(spark.table("t"), Row(false, 42L)) -} -// There are two trailing default values referenced implicitly by the INSERT INTO statement. -withTable("t") { - sql("create table t(i int, s bigint default 42, x bigint default 43) using parquet") - sql("insert into t(i) values(1)") - checkAnswer(sql("select s + x from t where i = 1"), Seq(85L).map(i => Row(i))) -} -withTable("t") { - sql("create table t(i int, s bigint default 42, x bigint) using parquet") - sql("insert into t(i) values(1)") - checkAnswer(spark.table("t"), Row(1, 42L, null)) -} -// The table has a partitioning column and a default value is injected. -withTable("t") { - sql("create table t(i boolean, s bigint, q int default 42) using parquet partitioned by (i)") - sql("insert into t partition(i='true') values(5, default)") - checkAnswer(spark.table("t"), Row(5, 42, true)) -} -// The table has a partitioning column and a default value is added per an explicit reference. -withTable("t") { - sql("create table t(i boolean, s bigint default 42) using parquet partitioned by (i)") - sql("insert into t partition(i='true') (s) values(default)") - checkAnswer(spark.table("t"), Row(42L, true)) -} -// The default value parses correctly as a constant but non-literal expression. -withTable("t") { - sql("create table t(i boolean, s bigint default 41 + 1) using parquet") - sql("insert into t values(false, default)") - checkAnswer(spark.table("t"), Row(false, 42L)) -} -// Explicit defaults may appear in different positions within the inline table provided as input -// to the INSERT INTO statement. -withTable("t") { - sql("create table t(i boolean default false, s bigint default 42) using parquet") - sql("insert into t(i, s) values(false, default), (default, 42)") - checkAnswer(spark.table("t"), Seq(Row(false, 42L), Row(false, 42L))) -} -// There is an explicit default value provided in the INSERT INTO statement in the VALUES, -// with an alias over the VALUES. -withTable("t") { - sql("create table t(i boolean, s bigint default 42) using parquet") - sql("insert into t select * from values (false, default) as tab(col, other)") - checkAnswer(spark.table("t"), Row(false, 42L)) -} -// The explicit default value arrives first before the other value. -withTable("t") { - sql("create table t(i boolean default false, s bigint) using parquet") - sql("insert into t values (default, 43)") - checkAnswer(spark.table("t"), Row(false, 43L)) -} -// The 'create table' statement provides the default parameter first. -withTable("t") { - sql("create table t(i boolean default false, s bigint) using parquet") - sql("insert into t values (default, 43)") - checkAnswer(spark.table("t"), Row(false, 43L)) -} -// The explicit default value is provided in the wrong order (first instead of second), but -// this is OK because the provided default value evaluates to literal NULL. -withTable("t") { - sql("create table t(i boolean, s bigint default 42) using parquet") - sql("insert into t values (default, 43)") - checkAnswer(spark.table("t"), Row(null, 43L)) -} -// There is an explicit default value provided in the INSERT INTO statement as a SELECT. -// This is supported. -withTable
[GitHub] [spark] Hisoka-X commented on pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
Hisoka-X commented on PR #42810: URL: https://github.com/apache/spark/pull/42810#issuecomment-1705918893 cc @cloud-fan @gengliangwang @dtenedor -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on a diff in pull request #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
Hisoka-X commented on code in PR #42810: URL: https://github.com/apache/spark/pull/42810#discussion_r1315339168 ## sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala: ## @@ -363,6 +363,35 @@ trait AlterTableTests extends SharedSparkSession with QueryErrorsBase { } } + test("SPARK-45075: ALTER COLUMN with invalid default value") { +withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format, ") { + withTable("t") { +sql(s"create table t(i boolean) using $v2Format") +// The default value fails to analyze. +checkError( + exception = intercept[AnalysisException] { +sql("alter table t add column s bigint default badvalue") + }, + errorClass = "INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION", + parameters = Map( +"statement" -> "ALTER TABLE", +"colName" -> "`s`", +"defaultValue" -> "badvalue")) Review Comment: This test case will report error normally before this PR, but we don't have any test case to prove it. ## sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala: ## @@ -363,6 +363,35 @@ trait AlterTableTests extends SharedSparkSession with QueryErrorsBase { } } + test("SPARK-45075: ALTER COLUMN with invalid default value") { +withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format, ") { + withTable("t") { +sql(s"create table t(i boolean) using $v2Format") +// The default value fails to analyze. +checkError( + exception = intercept[AnalysisException] { +sql("alter table t add column s bigint default badvalue") + }, + errorClass = "INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION", + parameters = Map( +"statement" -> "ALTER TABLE", +"colName" -> "`s`", +"defaultValue" -> "badvalue")) + +sql("alter table t add column s bigint default 3L") +checkError( + exception = intercept[AnalysisException] { +sql("alter table t alter column s set default badvalue") + }, + errorClass = "INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION", + parameters = Map( +"statement" -> "ALTER TABLE ALTER COLUMN", +"colName" -> "`s`", +"defaultValue" -> "badvalue")) Review Comment: Add this test case to prove the new change worked. But the more negative case in `sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala` and only worked for V1, after this PR merged, I will move it to `sql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala` in #42802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X opened a new pull request, #42810: [SPARK-45075][SQL] Fix alter table with invalid default value will not report error
Hisoka-X opened a new pull request, #42810: URL: https://github.com/apache/spark/pull/42810 ### What changes were proposed in this pull request? This PR make sure ALTER TABLE ALTER COLUMN with invalid default value on DataSource V2 will report error, before this PR it will alter sucess. ### Why are the changes needed? Fix the error behavior on DataSource V2 with ALTER TABLE statement. ### Does this PR introduce _any_ user-facing change? Yes, the invalid default value will report error. ### How was this patch tested? Add new test. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
itholic commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315337724 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: > shouldn't we also mention this in our migration doc? Hmm.. I didn't mention this as a behavior change since it's a bug fix, but on second thought maybe we'd better to mention in the migration guide anyway. Let me create a follow-up for updating the migration guide. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
zhengruifeng commented on PR #42787: URL: https://github.com/apache/spark/pull/42787#issuecomment-1705906007 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
zhengruifeng closed pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality URL: https://github.com/apache/spark/pull/42787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
zhengruifeng commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315324075 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: ok, I misunderstood the Pandas PR. LTGM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by`
LuciferYang commented on PR #42789: URL: https://github.com/apache/spark/pull/42789#issuecomment-1705894638 Thanks @dongjoon-hyun @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses
cloud-fan commented on code in PR #42778: URL: https://github.com/apache/spark/pull/42778#discussion_r1315322373 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] { } } + // Returns false if the parent 'w1' window function order by any of the window + // expressions produced by 'w2'. + private def compatibleOrderBy(w1: Window, w2: Window): Boolean = { +val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute) +val parentOrder = w1.orderSpec.flatMap(x => x.references) +childWindowExprs.intersect(parentOrder).isEmpty + } + private def windowsCompatible(w1: Window, w2: Window): Boolean = { w1.references.intersect(w2.windowOutputSet).isEmpty && Review Comment: no, the `Window` operator won't exclude C, as it only excludes `windowOutputSet` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses
cloud-fan commented on code in PR #42778: URL: https://github.com/apache/spark/pull/42778#discussion_r1315321463 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TransposeWindowSuite.scala: ## @@ -160,4 +160,18 @@ class TransposeWindowSuite extends PlanTest { comparePlans(optimized, correctAnswer.analyze) } + test("two windows with overlapping project/order by lists") { Review Comment: This tests pass with the latest master branch. ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TransposeWindowSuite.scala: ## @@ -160,4 +160,18 @@ class TransposeWindowSuite extends PlanTest { comparePlans(optimized, correctAnswer.analyze) } + test("two windows with overlapping project/order by lists") { Review Comment: This test passes with the latest master branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`
zhengruifeng commented on PR #42808: URL: https://github.com/apache/spark/pull/42808#issuecomment-1705889593 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`
zhengruifeng closed pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)` URL: https://github.com/apache/spark/pull/42808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`
HyukjinKwon commented on code in PR #42798: URL: https://github.com/apache/spark/pull/42798#discussion_r1315318972 ## python/pyspark/pandas/groupby.py: ## @@ -3534,7 +3534,12 @@ def _reduce_for_stat_function( for label in psdf._internal.column_labels: psser = psdf._psser_for(label) input_scol = psser._dtype_op.nan_to_null(psser).spark.column -output_scol = sfun(input_scol) +if sfun.__name__ == "sum" and isinstance( +psdf._internal.spark_type_for(label), StringType +): +output_scol = F.concat_ws("", F.collect_list(input_scol)) Review Comment: can we sort by natural order? we have `compute.ordered_head` config. We can sort it by natural order, and perform `collect_list`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals
zhengruifeng commented on PR #42809: URL: https://github.com/apache/spark/pull/42809#issuecomment-1705887362 cc @HyukjinKwon @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #42809: [SPARK-45074][PYTHON][CONNECT] `DataFrame.{sort, sortWithinPartitions}` support column ordinals
zhengruifeng opened a new pull request, #42809: URL: https://github.com/apache/spark/pull/42809 ### What changes were proposed in this pull request? `DataFrame.{sort, sortWithinPartitions}` support column ordinals ### Why are the changes needed? for feature parity: SQL: ``` select a, 1, sum(b) from v group by 1, 2 order by 3, 1; ``` DataFrame: ``` df.select("a", sf.lit(1), "b").groupBy(1, 2).agg(sf.sum("b")).sort(3, 1) ``` ### Does this PR introduce _any_ user-facing change? yes, new feature ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`
HyukjinKwon commented on code in PR #42798: URL: https://github.com/apache/spark/pull/42798#discussion_r1315317140 ## python/pyspark/pandas/groupby.py: ## @@ -910,7 +910,7 @@ def sum(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameL Review Comment: I think you gotta fix the log above too since not we support strings too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42796: [SPARK-45067][BUILD] Upgrade slf4j to 2.0.9
LuciferYang commented on PR #42796: URL: https://github.com/apache/spark/pull/42796#issuecomment-1705882880 Thanks @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
HyukjinKwon commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315315397 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: shouldn't we also mention this in our migration doc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #42804: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
ulysses-you commented on code in PR #42804: URL: https://github.com/apache/spark/pull/42804#discussion_r1315314032 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -234,10 +234,17 @@ abstract class BinaryArithmetic extends BinaryOperator case _ => super.checkInputDataTypes() } - override def dataType: DataType = (left.dataType, right.dataType) match { -case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => - resultDecimalType(p1, s1, p2, s2) -case _ => left.dataType + override def dataType: DataType = (left, right) match { +case (_, r: AttributeReference) if !r.dataType.isInstanceOf[DecimalType] => + left.dataType +case (l: AttributeReference, _) if !l.dataType.isInstanceOf[DecimalType] => + left.dataType +case (_, _) => + (left.dataType, right.dataType) match { +case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => + resultDecimalType(p1, s1, p2, s2) +case _ => left.dataType + } Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
itholic commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315309392 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: so I thought it was a bug in Pandas that is fixed in the Pandas 2.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
itholic commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315309089 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: I believe it's an intentional behavior since they mentioned this in ["Bug fixes"](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#bug-fixes) section in their release note? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`
gengliangwang commented on PR #42766: URL: https://github.com/apache/spark/pull/42766#issuecomment-1705873204 @LuciferYang Got it, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`
LuciferYang commented on PR #42766: URL: https://github.com/apache/spark/pull/42766#issuecomment-1705871748 @gengliangwang The `connector` module has already been set to false, and the `core` module is changed to false in the current PR. The streaming module is a historical legacy and it seems to be intentionally retained(https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`
gengliangwang commented on PR #42766: URL: https://github.com/apache/spark/pull/42766#issuecomment-1705864189 BTW, there were multiple places setting `shadeTestJar` as true. @LuciferYang could you update the unnecessary ones as well? ``` $ git grep shadeTestJar connector/protobuf/pom.xml: false core/pom.xml: true streaming/pom.xml: true ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #42766: [SPARK-45046][BUILD] Set `shadeTestJar` of `core` module to `false`
gengliangwang commented on PR #42766: URL: https://github.com/apache/spark/pull/42766#issuecomment-1705863504 > @gengliangwang Do you still remember why changed shadeTestJar to false? I guess it was a mistake... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`
itholic commented on PR #42798: URL: https://github.com/apache/spark/pull/42798#issuecomment-1705851281 @zhengruifeng I think the problem is that the Pandas compute the concat without sorting, so the result can be difficult when the index is not sorted as below: ## Problem **Pandas** ```python >>> pdf A B 4 a 1 3 b 2 2 c 3 >>> pdf.sum() Aabc B 6 dtype: object ``` **Pandas API on Spark** ```python >>> psdf A B 4 a 1 3 b 2 2 c 3 >>> psdf.sum() Acba # we internally sorted the index, so the result is different from Pandas B 6 dtype: object ``` ## Solution I think for now we can pick the one of three ways below: 1. We can document the warning note as below: ``` The result for string type column is non-deterministic since the implementation depends on `collect_list` API from PySpark which is non-deterministic as well. ``` 2. We can `collect_list` both value and index, and sort by the indices before `concat_ws` as you suggested, and document the warning note as below: ``` The result for string type column can be different from Pandas when the index is not sorted, since we always sort the indexes before computing since the implementation depends on `collect_list` API from PySpark which is non-deterministic. ``` 3. We don't support the string type column like so far, and add a note that why we don't support the string type column as below: ``` String type column is not support for now, because it might yield non-deterministic results unlike in Pandas. ``` WDYT? Also cc @HyukjinKwon, @ueshin @xinrong-meng , What strategy do we take for this situation? I believe that the same rules should apply to similar cases that already exist or may arise in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`
HyukjinKwon closed pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange` URL: https://github.com/apache/spark/pull/42770 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`
HyukjinKwon commented on PR #42770: URL: https://github.com/apache/spark/pull/42770#issuecomment-1705847362 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zekai-li commented on pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function
zekai-li commented on PR #42529: URL: https://github.com/apache/spark/pull/42529#issuecomment-1705837097 > Hi, @zekai-li . > > It seems that you didn't address the previous review comment yet. WDYT? > > [#42529 (comment)](https://github.com/apache/spark/pull/42529#discussion_r1306701075) Sorry, I didn't notice. This is something you need to decide for equalsIgnoreCase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zekai-li closed pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function
zekai-li closed pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function URL: https://github.com/apache/spark/pull/42529 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zekai-li commented on pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function
zekai-li commented on PR #42529: URL: https://github.com/apache/spark/pull/42529#issuecomment-1705835097 > Hi, @zekai-li . > > It seems that you didn't address the previous review comment yet. WDYT? > > [#42529 (comment)](https://github.com/apache/spark/pull/42529#discussion_r1306701075) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zekai-li commented on a diff in pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function
zekai-li commented on code in PR #42529: URL: https://github.com/apache/spark/pull/42529#discussion_r1315276443 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging { return false } -val srcAuthority = srcUri.getAuthority() -val dstAuthority = dstUri.getAuthority() -if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) { +val srcUserInfo = srcUri.getUserInfo() +val dstUserInfo = dstUri.getUserInfo() +if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) { Review Comment: Last pr discussion:https://github.com/apache/spark/pull/19885#discussion_r154819072 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging { return false } -val srcAuthority = srcUri.getAuthority() -val dstAuthority = dstUri.getAuthority() -if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) { +val srcUserInfo = srcUri.getUserInfo() +val dstUserInfo = dstUri.getUserInfo() +if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) { Review Comment: Last pr discussion:https://github.com/apache/spark/pull/19885#discussion_r154819072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zekai-li commented on a diff in pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function
zekai-li commented on code in PR #42529: URL: https://github.com/apache/spark/pull/42529#discussion_r1315274264 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging { return false } -val srcAuthority = srcUri.getAuthority() -val dstAuthority = dstUri.getAuthority() -if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) { +val srcUserInfo = srcUri.getUserInfo() +val dstUserInfo = dstUri.getUserInfo() +if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) { Review Comment: We don't need to do this adaptation for the user! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zekai-li commented on a diff in pull request #42529: [SPARK-44845][YARN][DEPLOY] Fix file system uri comparison function
zekai-li commented on code in PR #42529: URL: https://github.com/apache/spark/pull/42529#discussion_r1315273750 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging { return false } -val srcAuthority = srcUri.getAuthority() -val dstAuthority = dstUri.getAuthority() -if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) { +val srcUserInfo = srcUri.getUserInfo() +val dstUserInfo = dstUri.getUserInfo() +if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) { Review Comment: srcUserInfo == null && dstUserInfo ! = null is fine in this case。equalsIgnoreCase I'm not sure if this is correct, but I'm going to follow the same logic as before。But the use of equal is more rigorous。 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: ## @@ -1618,9 +1618,9 @@ private[spark] object Client extends Logging { return false } -val srcAuthority = srcUri.getAuthority() -val dstAuthority = dstUri.getAuthority() -if (srcAuthority != null && !srcAuthority.equalsIgnoreCase(dstAuthority)) { +val srcUserInfo = srcUri.getUserInfo() +val dstUserInfo = dstUri.getUserInfo() +if (srcUserInfo != null && !srcUserInfo.equalsIgnoreCase(dstUserInfo)) { Review Comment: srcUserInfo == null && dstUserInfo ! = null is fine in this case。equalsIgnoreCase I'm not sure if this is correct, but I'm going to follow the same logic as before。But the use of equal is more rigorous。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
zhengruifeng commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315255478 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: is it a bug in Pandas that might be fixed in the future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #42804: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
wangyum commented on code in PR #42804: URL: https://github.com/apache/spark/pull/42804#discussion_r1315255094 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -234,10 +234,17 @@ abstract class BinaryArithmetic extends BinaryOperator case _ => super.checkInputDataTypes() } - override def dataType: DataType = (left.dataType, right.dataType) match { -case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => - resultDecimalType(p1, s1, p2, s2) -case _ => left.dataType + override def dataType: DataType = (left, right) match { +case (_, r: AttributeReference) if !r.dataType.isInstanceOf[DecimalType] => + left.dataType +case (l: AttributeReference, _) if !l.dataType.isInstanceOf[DecimalType] => + left.dataType +case (_, _) => + (left.dataType, right.dataType) match { +case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => + resultDecimalType(p1, s1, p2, s2) +case _ => left.dataType + } Review Comment: Can we just change it to lazy val? For example: ```scala private lazy val internalDataType: DataType = { (left.dataType, right.dataType) match { case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) => resultDecimalType(p1, s1, p2, s2) case _ => left.dataType } } override def dataType: DataType = internalDataType ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42787: [SPARK-43241][PS] `MultiIndex.append` not checking names for equality
itholic commented on code in PR #42787: URL: https://github.com/apache/spark/pull/42787#discussion_r1315254128 ## python/pyspark/pandas/indexes/base.py: ## @@ -1917,18 +1917,12 @@ def append(self, other: "Index") -> "Index": sdf_other = other._internal.spark_frame.select(other._internal.index_spark_columns) sdf_appended = sdf_self.union(sdf_other) -# names should be kept when MultiIndex, but Index wouldn't keep its name. -if isinstance(self, MultiIndex): -index_names = self._internal.index_names -else: -index_names = None - internal = InternalFrame( spark_frame=sdf_appended, index_spark_columns=[ scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names ], -index_names=index_names, +index_names=None, Review Comment: We can simply set the `index_names` to `None` to follow the behavior of Pandas, since Pandas doesn't keep the name of `MultiIndex` when computing the `append` from Pandas 2.0.0. (See https://github.com/pandas-dev/pandas/pull/48288 more detail) cc @zhengruifeng @HyukjinKwon as CI passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info
itholic commented on PR #40436: URL: https://github.com/apache/spark/pull/40436#issuecomment-1705803526 Thanks, @dzhigimont :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zzzzming95 commented on pull request #42804: [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
ming95 commented on PR #42804: URL: https://github.com/apache/spark/pull/42804#issuecomment-1705803286 @ulysses-you cc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info
zhengruifeng commented on PR #40436: URL: https://github.com/apache/spark/pull/40436#issuecomment-1705801677 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info
zhengruifeng closed pull request #40436: [SPARK-42619][PS] Add `show_counts` parameter for DataFrame.info URL: https://github.com/apache/spark/pull/40436 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`
zhengruifeng commented on PR #42808: URL: https://github.com/apache/spark/pull/42808#issuecomment-1705793177 CI link: https://github.com/zhengruifeng/spark/actions/runs/6072210495 all tests passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`
zhengruifeng commented on PR #42808: URL: https://github.com/apache/spark/pull/42808#issuecomment-1705789957 cc @HyukjinKwon @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #42808: [SPARK-45073][PS][CONNECT] Replace `LastNotNull` with `Last(ignoreNulls=True)`
zhengruifeng opened a new pull request, #42808: URL: https://github.com/apache/spark/pull/42808 ### What changes were proposed in this pull request? Replace `LastNotNull` with `Last(ignoreNulls=True)` ### Why are the changes needed? https://github.com/apache/spark/pull/36127 introduced a PS dedicated expression `LastNotNull`, which was actually not needed and can be replaced with built-in `Last` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect
Hisoka-X commented on PR #42800: URL: https://github.com/apache/spark/pull/42800#issuecomment-1705788141 Thanks @LuciferYang @zhengruifeng @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42794: [SPARK-45066][SQL][PYTHON][CONNECT] Make function `repeat` accept column-type `n`
zhengruifeng commented on PR #42794: URL: https://github.com/apache/spark/pull/42794#issuecomment-1705777234 @dongjoon-hyun thanks for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42791: [SPARK-45064][PYTHON][CONNECT] Add the missing `scale` parameter in `ceil/ceiling`
zhengruifeng commented on PR #42791: URL: https://github.com/apache/spark/pull/42791#issuecomment-1705777103 @dongjoon-hyun thanks for reivew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42767: [SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals
zhengruifeng commented on PR #42767: URL: https://github.com/apache/spark/pull/42767#issuecomment-1705776902 @HyukjinKwon @dongjoon-hyun thanks for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #41808: [SPARK-44162][CORE] Support G1GC in spark metrics
dongjoon-hyun commented on PR #41808: URL: https://github.com/apache/spark/pull/41808#issuecomment-1705759712 Thank you, @Hisoka-X , @mridulm and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42791: [SPARK-45064][PYTHON][CONNECT] Add the missing `scale` parameter in `ceil/ceiling`
dongjoon-hyun closed pull request #42791: [SPARK-45064][PYTHON][CONNECT] Add the missing `scale` parameter in `ceil/ceiling` URL: https://github.com/apache/spark/pull/42791 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42767: [SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals
dongjoon-hyun closed pull request #42767: [SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals URL: https://github.com/apache/spark/pull/42767 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect
dongjoon-hyun commented on PR #42800: URL: https://github.com/apache/spark/pull/42800#issuecomment-1705734102 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect
dongjoon-hyun closed pull request #42800: [SPARK-45059][CONNECT][FOLLOWUP] Remove `try_reflect` problem filter rule in connect URL: https://github.com/apache/spark/pull/42800 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agubichev commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses
agubichev commented on code in PR #42778: URL: https://github.com/apache/spark/pull/42778#discussion_r1315214615 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] { } } + // Returns false if the parent 'w1' window function order by any of the window + // expressions produced by 'w2'. + private def compatibleOrderBy(w1: Window, w2: Window): Boolean = { +val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute) +val parentOrder = w1.orderSpec.flatMap(x => x.references) +childWindowExprs.intersect(parentOrder).isEmpty + } + private def windowsCompatible(w1: Window, w2: Window): Boolean = { w1.references.intersect(w2.windowOutputSet).isEmpty && Review Comment: It's a subtle bug. `references` by definition do not include output attributes (see https://github.com/apache/spark/blob/798fce3b571907ee52058004cc38c2e8dbc4b016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L91). So if the top Window function orders by C and projects (outputs) C, then C is not included into references. Therefore, we need that extra check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agubichev commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses
agubichev commented on code in PR #42778: URL: https://github.com/apache/spark/pull/42778#discussion_r1315214615 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] { } } + // Returns false if the parent 'w1' window function order by any of the window + // expressions produced by 'w2'. + private def compatibleOrderBy(w1: Window, w2: Window): Boolean = { +val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute) +val parentOrder = w1.orderSpec.flatMap(x => x.references) +childWindowExprs.intersect(parentOrder).isEmpty + } + private def windowsCompatible(w1: Window, w2: Window): Boolean = { w1.references.intersect(w2.windowOutputSet).isEmpty && Review Comment: It's a subtle bug. `references` by definition do not include output attributes (see https://github.com/apache/spark/blob/798fce3b571907ee52058004cc38c2e8dbc4b016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L91). So if the top Window function orders by C and projects (outputs) C, then C is not included into references. Therefore, we need that extra check. (and we can't just take all attributes in the Window operator either, since PARTITION BY clauses do need to intersect) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agubichev commented on a diff in pull request #42778: [SPARK-45055] [SQL] Do not transpose windows if they conflict on ORDER BY / PROJECT clauses
agubichev commented on code in PR #42778: URL: https://github.com/apache/spark/pull/42778#discussion_r1315214615 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -1305,11 +1305,20 @@ object TransposeWindow extends Rule[LogicalPlan] { } } + // Returns false if the parent 'w1' window function order by any of the window + // expressions produced by 'w2'. + private def compatibleOrderBy(w1: Window, w2: Window): Boolean = { +val childWindowExprs = w2.windowExpressions.map(x => x.toAttribute) +val parentOrder = w1.orderSpec.flatMap(x => x.references) +childWindowExprs.intersect(parentOrder).isEmpty + } + private def windowsCompatible(w1: Window, w2: Window): Boolean = { w1.references.intersect(w2.windowOutputSet).isEmpty && Review Comment: It's a subtle bug. `references` by definition do not include output attributes (see https://github.com/apache/spark/blob/798fce3b571907ee52058004cc38c2e8dbc4b016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L91). So if the top Window function order by C and projects (outputs) C, then C is not included into references. Therefore, we need that extra check. (and we can't just take all attributes in the Window operator either, since PARTITION BY clauses do need to intersect) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42794: [SPARK-45066][SQL][PYTHON][CONNECT] Make function `repeat` accept column-type `n`
dongjoon-hyun closed pull request #42794: [SPARK-45066][SQL][PYTHON][CONNECT] Make function `repeat` accept column-type `n` URL: https://github.com/apache/spark/pull/42794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42801: [SPARK-45070][SQL][DOCS] Describe the binary and datetime formats of `to_char`/`to_varchar`
dongjoon-hyun commented on code in PR #42801: URL: https://github.com/apache/spark/pull/42801#discussion_r1315213044 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -4284,12 +4285,22 @@ object functions { * prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the * end of the format string; specifies that the result string will be wrapped by angle * brackets if the input value is negative. + * If `e` is a datetime, `format` shall be a valid datetime pattern, see + * https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime Patterns. + * If `e` is a binary, it is converted to a string in one of the formats: + * + * 'base64': a base 64 string. + * 'hex': a string in the hexadecimal format. + * 'utf-8': the input binary is decoded to UTF-8 string. + * Review Comment: To @MaxGekk , it seems our Linter CI doesn't allow this kind of human-friendly formatting in the comment. - https://github.com/MaxGekk/spark/actions/runs/6076537960/job/16484751098 ``` The scalafmt check failed on connector/connect at following occurrences: Requires formatting: functions.scala Before submitting your change, please make sure to format your code using the following command: ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect/common -pl connector/connect/server -pl connector/connect/client/jvm Error: Process completed with exit code 1. ``` The recommended style is something like the following. Please fix the style by run the above command line. ``` - * brackets if the input value is negative. - * If `e` is a datetime, `format` shall be a valid datetime pattern, see - * https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime Patterns. - * If `e` is a binary, it is converted to a string in one of the formats: - * - * 'base64': a base 64 string. - * 'hex': a string in the hexadecimal format. - * 'utf-8': the input binary is decoded to UTF-8 string. - * + * brackets if the input value is negative. If `e` is a datetime, `format` shall be + * a valid datetime pattern, see https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime + * Patterns. If `e` is a binary, it is converted to a string in one of the formats: + * 'base64': a base 64 string. 'hex': a string in the hexadecimal format. + * 'utf-8': the input binary is decoded to UTF-8 string. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42801: [SPARK-45070][SQL][DOCS] Describe the binary and datetime formats of `to_char`/`to_varchar`
dongjoon-hyun commented on code in PR #42801: URL: https://github.com/apache/spark/pull/42801#discussion_r1315213044 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -4284,12 +4285,22 @@ object functions { * prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the * end of the format string; specifies that the result string will be wrapped by angle * brackets if the input value is negative. + * If `e` is a datetime, `format` shall be a valid datetime pattern, see + * https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime Patterns. + * If `e` is a binary, it is converted to a string in one of the formats: + * + * 'base64': a base 64 string. + * 'hex': a string in the hexadecimal format. + * 'utf-8': the input binary is decoded to UTF-8 string. + * Review Comment: To @MaxGekk , it seems our Linter CI doesn't allow this kind of formatting in the comment. - https://github.com/MaxGekk/spark/actions/runs/6076537960/job/16484751098 ``` The scalafmt check failed on connector/connect at following occurrences: Requires formatting: functions.scala Before submitting your change, please make sure to format your code using the following command: ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect/common -pl connector/connect/server -pl connector/connect/client/jvm Error: Process completed with exit code 1. ``` The recommended style is something like the following. Please fix the style by run the above command line. ``` - * brackets if the input value is negative. - * If `e` is a datetime, `format` shall be a valid datetime pattern, see - * https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime Patterns. - * If `e` is a binary, it is converted to a string in one of the formats: - * - * 'base64': a base 64 string. - * 'hex': a string in the hexadecimal format. - * 'utf-8': the input binary is decoded to UTF-8 string. - * + * brackets if the input value is negative. If `e` is a datetime, `format` shall be + * a valid datetime pattern, see https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";>Datetime + * Patterns. If `e` is a binary, it is converted to a string in one of the formats: + * 'base64': a base 64 string. 'hex': a string in the hexadecimal format. + * 'utf-8': the input binary is decoded to UTF-8 string. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by`
dongjoon-hyun commented on PR #42789: URL: https://github.com/apache/spark/pull/42789#issuecomment-1705721562 Merged to master for Apache Spark 4.0.0. Thank you, @LuciferYang and @zhengruifeng . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by`
dongjoon-hyun closed pull request #42789: [SPARK-45063][PYTHON][DOCS] Refine docstring of `max_by/min_by` URL: https://github.com/apache/spark/pull/42789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42749: [SPARK-45031][INFRA] Choose the right merge code path and merge hash for reopened PRs
dongjoon-hyun closed pull request #42749: [SPARK-45031][INFRA] Choose the right merge code path and merge hash for reopened PRs URL: https://github.com/apache/spark/pull/42749 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
sadikovi commented on PR #42667: URL: https://github.com/apache/spark/pull/42667#issuecomment-1705718755 Great, thanks. I will follow up on your comment regarding the possible regression in the parsing benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
sadikovi commented on code in PR #42667: URL: https://github.com/apache/spark/pull/42667#discussion_r1315208847 ## sql/core/benchmarks/JsonBenchmark-results.txt: ## @@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz JSON schema inferring:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding3720 3843 121 1.3 743.9 1.0X -UTF-8 is set 5412 5455 45 0.91082.4 0.7X +No encoding2084 2134 46 2.4 416.8 1.0X +UTF-8 is set 3077 3093 14 1.6 615.3 0.7X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz count a short column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding3234 3254 33 1.5 646.7 1.0X -UTF-8 is set 4847 4868 21 1.0 969.5 0.7X +No encoding2854 2863 8 1.8 570.8 1.0X +UTF-8 is set 4066 4066 1 1.2 813.1 0.7X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding5702 5794 101 0.25702.1 1.0X -UTF-8 is set 9526 9607 73 0.19526.1 0.6X +No encoding3348 3368 26 0.33347.8 1.0X +UTF-8 is set 5215 5239 22 0.25214.7 0.6X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz select wide row: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding 18318 18448 199 0.0 366367.7 1.0X -UTF-8 is set 19791 19887 99 0.0 395817.1 0.9X +No encoding 11046 11102 54 0.0 220928.4 1.0X +UTF-8 is set 12135 12181 54 0.0 242697.4 0.9X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz Select a subset of 10 columns:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Select 10 columns
[GitHub] [spark] dongjoon-hyun closed pull request #42792: [SPARK-44940][SQL][3.4] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
dongjoon-hyun closed pull request #42792: [SPARK-44940][SQL][3.4] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled URL: https://github.com/apache/spark/pull/42792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42790: [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
dongjoon-hyun closed pull request #42790: [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled URL: https://github.com/apache/spark/pull/42790 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
dongjoon-hyun commented on code in PR #42667: URL: https://github.com/apache/spark/pull/42667#discussion_r1315206808 ## sql/core/benchmarks/JsonBenchmark-results.txt: ## @@ -3,121 +3,125 @@ Benchmark for performance of JSON parsing Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz JSON schema inferring:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding3720 3843 121 1.3 743.9 1.0X -UTF-8 is set 5412 5455 45 0.91082.4 0.7X +No encoding2084 2134 46 2.4 416.8 1.0X +UTF-8 is set 3077 3093 14 1.6 615.3 0.7X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz count a short column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding3234 3254 33 1.5 646.7 1.0X -UTF-8 is set 4847 4868 21 1.0 969.5 0.7X +No encoding2854 2863 8 1.8 570.8 1.0X +UTF-8 is set 4066 4066 1 1.2 813.1 0.7X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding5702 5794 101 0.25702.1 1.0X -UTF-8 is set 9526 9607 73 0.19526.1 0.6X +No encoding3348 3368 26 0.33347.8 1.0X +UTF-8 is set 5215 5239 22 0.25214.7 0.6X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz select wide row: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding 18318 18448 199 0.0 366367.7 1.0X -UTF-8 is set 19791 19887 99 0.0 395817.1 0.9X +No encoding 11046 11102 54 0.0 220928.4 1.0X +UTF-8 is set 12135 12181 54 0.0 242697.4 0.9X Preparing data for benchmarking ... -OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1037-azure -Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws +Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz Select a subset of 10 columns:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Select 10 columns
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42667: [SPARK-44940][SQL] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
dongjoon-hyun commented on code in PR #42667: URL: https://github.com/apache/spark/pull/42667#discussion_r1315205770 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala: ## @@ -589,12 +605,25 @@ class JacksonParser( throw BadRecordException( record = () => recordLiteral(record), partialResults = () => Array(row), - cause) + convertCauseForPartialResult(cause)) case PartialResultArrayException(rows, cause) => throw BadRecordException( record = () => recordLiteral(record), partialResults = () => rows, cause) + // These exceptions should never be thrown outside of JacksonParser. + // They are used for the control flow in the parser. We add them here for completeness + // since they also indicate a bad record. Review Comment: Thank you for adding this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42757: [SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution
dongjoon-hyun closed pull request #42757: [SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution URL: https://github.com/apache/spark/pull/42757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42757: [SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution
dongjoon-hyun commented on PR #42757: URL: https://github.com/apache/spark/pull/42757#issuecomment-1705698868 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gbloisi-openaire commented on a diff in pull request #42634: [SPARK-44910][SQL] Encoders.bean does not support superclasses with generic type arguments
gbloisi-openaire commented on code in PR #42634: URL: https://github.com/apache/spark/pull/42634#discussion_r1315180662 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala: ## @@ -156,4 +158,17 @@ object JavaTypeInference { .filterNot(_.getName == "declaringClass") .filter(_.getReadMethod != null) } + + @tailrec + def getClassHierarchyTypeArguments(cls: Class[_], Review Comment: At a second thought this recursive getClassHierarchyTypeArguments is not required as JavaTypeUtils.getTypeArguments(cls, classOf[Object]) already traverses the inheritance hierarchy till Object base class collecting type variables. So I just removed it. To be clear what we are doing here is to collect type variable information about base class (which in turn could be derived from another base class, forming the hierarchy the requires to be traversed). This is required for both provided class and nested beans that extend one or more generic base classes in their inheritance hierarchy (tests provided). Interfaces are not required to be processed because they declare abstract methods only, so the definition of the method (and its actual return type) will come from the class implementing the interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #42807: [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes
hvanhovell commented on code in PR #42807: URL: https://github.com/apache/spark/pull/42807#discussion_r1315168784 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala: ## @@ -263,6 +265,31 @@ class ReplE2ESuite extends RemoteSparkSession with BeforeAndAfterEach { assertContains("Array[org.apache.spark.sql.Row] = Array([id1,1], [id2,16], [id3,25])", output) } + test("Single Cell Compilation") { Review Comment: This is more of a test to test the test :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #42807: [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes
hvanhovell commented on PR #42807: URL: https://github.com/apache/spark/pull/42807#issuecomment-1705586484 cc @vsevolodstep-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell opened a new pull request, #42807: [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes
hvanhovell opened a new pull request, #42807: URL: https://github.com/apache/spark/pull/42807 ### What changes were proposed in this pull request? Ammonite places all user code inside Helper classes which are nested inside the class it creates for each command. This PR adds a custom code class wrapper for the Ammonite REPL. It makes sure the Helper classes generated by ammonite are always registered as an outer scope immediately. This way we can instantiate classes defined inside the Helper class, even when we execute Spark code as part of the Helper's constructor. ### Why are the changes needed? When you currently define a class and execute a Spark command using that class inside the same cell/line this will fail with an NullPointerException. The reason for that is that we cannot resolve the outer scope needed to instantiate the class. This PR fixes that issue. The following code will now execute successfully (include the curly braces): ```scala { case class Thing(val value: String) val r = (0 to 10).map( value => Thing(value.toString) ) spark.createDataFrame(r) } ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added more tests to the `ReplE2ESuite`. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on pull request #42806: [SPARK-44833][CONNECT] Fix sending Reattach too fast after Execute
juliuszsompolski commented on PR #42806: URL: https://github.com/apache/spark/pull/42806#issuecomment-1705547797 @hvanhovell @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org