[PR] [SPARK-48332][BUILD][TESTS] Upgrade `jdbc` related test dependencies [spark]
panbingkun opened a new pull request, #46653: URL: https://github.com/apache/spark/pull/46653 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset [spark]
github-actions[bot] closed pull request #43473: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset URL: https://github.com/apache/spark/pull/43473 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]
github-actions[bot] commented on PR #44622: URL: https://github.com/apache/spark/pull/44622#issuecomment-2119040251 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46971][SQL] When the `compression` is null, a `NullPointException` should not be thrown [spark]
github-actions[bot] commented on PR #45015: URL: https://github.com/apache/spark/pull/45015#issuecomment-2119040244 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] assorted copy edits to migration instructions [spark]
github-actions[bot] closed pull request #45048: assorted copy edits to migration instructions URL: https://github.com/apache/spark/pull/45048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support [spark]
srielau commented on PR #46652: URL: https://github.com/apache/spark/pull/46652#issuecomment-2119020457 @gengliangwang @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support [spark]
srielau opened a new pull request, #46652: URL: https://github.com/apache/spark/pull/46652 ### What changes were proposed in this pull request? We separate enablement of WITH SCHEMA ... clause from the change in default from SCHEMA BINDING to SCHEMA COMPENSATION. This allows user to upgrade in two steps: 1. Enable the feature, and deal with DESCRIBE EXTENDED. 2. Get their affairs in order by ALTER VIEW to SCHEMA BINDING for those views they aim to keep in that mode 3. Switch the default. ### Why are the changes needed? It allows customers to upgrade more safely. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added more tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1605752402 ## python/pyspark/sql/tests/connect/test_parity_arrow.py: ## @@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self): def test_createDataFrame_fallback_enabled(self): super().test_createDataFrame_fallback_enabled() -def test_createDataFrame_with_map_type(self): -self.check_createDataFrame_with_map_type(True) +def test_createDataFrame_pandas_with_map_type(self): +self.check_createDataFrame_pandas_with_map_type(True) + +def test_createDataFrame_pandas_with_struct_type(self): +self.check_createDataFrame_pandas_with_struct_type(True) + +def test_createDataFrame_arrow_with_struct_type(self): Review Comment: Ah, right. Done in 90ea328. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2118994013 What I would really like to try is to move from this implementation to an approach that will have the collation-support logic moved to the PartialAggregation stage, by moving logic to `Mode.merge` and `Mode.update`. I would use a modified open hash map for that with hashing based on the collation key and with a separate map to map from collation key to one of the actual values observed that maps to that collation key (which experimentation has shown could work). But as it has already been a couple weeks of development on this, I believe we should, for this PR, confine all the collation logic in the stage that can't be serialized and deserialized -- the `eval` stage. And I should try what I have described above in a PR raised after we have merged the approach that has already been tested (i.e. this PR). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1605752402 ## python/pyspark/sql/tests/connect/test_parity_arrow.py: ## @@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self): def test_createDataFrame_fallback_enabled(self): super().test_createDataFrame_fallback_enabled() -def test_createDataFrame_with_map_type(self): -self.check_createDataFrame_with_map_type(True) +def test_createDataFrame_pandas_with_map_type(self): +self.check_createDataFrame_pandas_with_map_type(True) + +def test_createDataFrame_pandas_with_struct_type(self): +self.check_createDataFrame_pandas_with_struct_type(True) + +def test_createDataFrame_arrow_with_struct_type(self): Review Comment: Ah, right. Done. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1605775356 ## python/pyspark/sql/tests/typing/test_session.yml: ## @@ -51,25 +51,6 @@ spark.createDataFrame(["foo", "bar"], "string") -- case: createDataFrameScalarsInvalid Review Comment: I added tests in 40072d0 to ensure we still cover these error conditions somewhere in the tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]
GideonPotok commented on PR #46597: URL: https://github.com/apache/spark/pull/46597#issuecomment-2118991669 > > since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.? > > In my local tests, I found that Mode performs a byte-by-byte comparison for structs, which does not consider collation. So that is still outstanding. Good catch! > > @uros-db There are several strategies we might adopt to handle structs with collation fields. I am looking into implementations. It is potentially straightforward though have some gotchas. > > Do you feel I should solve for that in a separate PR or in this one? I assume you prefer that this get solve in this PR and not a follow-up PR, right? @uros-db Added implementation for mode to support structs with fields with the various collations. Performance is not great, so far. ``` [info] collation unit benchmarks - mode - 30105 elements: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] - [info] UTF8_BINARY_LCASE - mode - 30105 elements 31 32 1 9.8 102.3 1.0X [info] UNICODE - mode - 30105 elements1 1 0240.4 4.2 24.6X [info] UTF8_BINARY - mode - 30105 elements1 1 0239.1 4.2 24.5X [info] UNICODE_CI - mode - 30105 elements57 59 2 5.3 189.9 0.5X ``` I will add the benchmark results from GHA once I get your feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]
chaoqin-li1123 opened a new pull request, #46651: URL: https://github.com/apache/spark/pull/46651 ### What changes were proposed in this pull request? Fix the python streaming data source timeout issue for large trigger interval For python streaming source, keep the long running worker archaetecture but set the socket timeout to be infinity to avoid timeout error. For python streaming sink, since StreamingWrite is also created per microbatch in scala side, long running worker cannot be attached to s StreamingWrite instance. Therefore we abandon the long running worker architecture, simply call commit() or abort() and exit the worker and allow spark to reuse worker for us. ### Why are the changes needed? Currently we run long running python worker process for python streaming source and sink to perform planning, commit and abort in driver side. Testing indicate that current implementation cause connection timeout error when streaming query has large trigger interval. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add integration test ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
HyukjinKwon commented on PR #46649: URL: https://github.com/apache/spark/pull/46649#issuecomment-2118804792 Sure, that sounds like more localized fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1605775356 ## python/pyspark/sql/tests/typing/test_session.yml: ## @@ -51,25 +51,6 @@ spark.createDataFrame(["foo", "bar"], "string") -- case: createDataFrameScalarsInvalid Review Comment: I added tests in 40072d0 to ensure we still cover these error conditions somewhere in the tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48329][SQL] Turn on `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default [spark]
superdiaodiao opened a new pull request, #46650: URL: https://github.com/apache/spark/pull/46650 ### What changes were proposed in this pull request? The SPJ(Storage-Partitioned Join) feature flag `spark.sql.sources.v2.bucketing.enabled` and `spark.sql.sources.v2.bucketing.pushPartValues.enabled` is set to `true` ### Why are the changes needed? The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' has proven valuable for most use cases. We should take advantage of 4.0 release and change the value to true. ### Does this PR introduce _any_ user-facing change? No ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] assorted copy edits to migration instructions [spark]
elharo commented on PR #45048: URL: https://github.com/apache/spark/pull/45048#issuecomment-2118783045 Instead iof having a bot autoclose PRs, perhaops one should review them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1605752402 ## python/pyspark/sql/tests/connect/test_parity_arrow.py: ## @@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self): def test_createDataFrame_fallback_enabled(self): super().test_createDataFrame_fallback_enabled() -def test_createDataFrame_with_map_type(self): -self.check_createDataFrame_with_map_type(True) +def test_createDataFrame_pandas_with_map_type(self): +self.check_createDataFrame_pandas_with_map_type(True) + +def test_createDataFrame_pandas_with_struct_type(self): +self.check_createDataFrame_pandas_with_struct_type(True) + +def test_createDataFrame_arrow_with_struct_type(self): Review Comment: Ah, right. Done in 292d3c8. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
uros-db commented on PR #46649: URL: https://github.com/apache/spark/pull/46649#issuecomment-2118751764 @HyukjinKwon I believe you could also use: `AbstractMapType(StringTypeAnyCollation, StringTypeAnyCollation)` for `inputTypes` in `RaiseError` (misc.scala) instead of `MapType(StringType, StringType)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
uros-db commented on PR #46649: URL: https://github.com/apache/spark/pull/46649#issuecomment-2118751278 It was my understanding that this wouldn't be a problem, since this second parameter (MapType) is only used internally in Spark to raise errors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
HyukjinKwon commented on PR #46649: URL: https://github.com/apache/spark/pull/46649#issuecomment-2118744355 For a bit of more context, the test fails as below: ``` org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "raise_error(USER_RAISED_EXCEPTION, map(errorMessage, 'aa' collate UTF8_BINARY_LCASE))" due to data type mismatch: The second parameter requires the "MAP" type, however "map(errorMessage, 'aa' collate UTF8_BINARY_LCASE)" has the type "MAP". SQLSTATE: 42K09; line 1 pos 7; 'Project [unresolvedalias(raise_error(cast(USER_RAISED_EXCEPTION as string collate UTF8_BINARY_LCASE), map(errorMessage, aa), NullType))] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7$adapted(CheckAnalysis.scala:302) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243) at scala.collection.immutable.Vector.foreach(Vector.scala:2124) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:243) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:302) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:302) at scala.collection.immutable.List.foreach(List.scala:334) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:302) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:216) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:216) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:198) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:192) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:190) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:161) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:192) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:214) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:393) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:212) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:92) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:225) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:225) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:224) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:73) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$3(Dataset.scala:118) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:115) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:660) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:651) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:681) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
HyukjinKwon commented on PR #46649: URL: https://github.com/apache/spark/pull/46649#issuecomment-2118743892 cc @cloud-fan and @uros-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
HyukjinKwon commented on code in PR #46649: URL: https://github.com/apache/spark/pull/46649#discussion_r1605735539 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: ## @@ -969,6 +969,8 @@ object TypeCoercion extends TypeCoercionBase { // Note that ret is nullable to avoid typing a lot of Some(...) in this local scope. // We wrap immediately an Option after this. @Nullable val ret: DataType = (inType, expectedType) match { + case (_: StringType, _: StringType) => expectedType.defaultConcreteType Review Comment: This seems to be already working in ANSI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]
HyukjinKwon opened a new pull request, #46649: URL: https://github.com/apache/spark/pull/46649 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/46461 that fixes the CI failure when ANSI is off: ``` [info] - Support RaiseError misc expression with collation *** FAILED *** (21 milliseconds) [info] Expected exception org.apache.spark.SparkRuntimeException to be thrown, but org.apache.spark.sql.catalyst.ExtendedAnalysisException was thrown (CollationSQLExpressionsSuite.scala:991) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) [info] at org.scalatest.Assertions.intercept(Assertions.scala:756) [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) [info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1564) [info] at org.apache.spark.sql.CollationSQLExpressionsSuite.$anonfun$new$124(CollationSQLExpressionsSuite.scala:991) [info] at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56) [info] at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38) [info] at org.apache.spark.sql.CollationSQLExpressionsSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(CollationSQLExpressionsSuite.scala:30) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:248) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:246) [info] at org.apache.spark.sql.CollationSQLExpressionsSuite.withSQLConf(CollationSQLExpressionsSuite.scala:30) [info] at org.apache.spark.sql.CollationSQLExpressionsSuite.$anonfun$new$123(CollationSQLExpressionsSuite.scala:988) [info] at scala.collection.immutable.List.foreach(List.scala:334) [info] at org.apache.spark.sql.CollationSQLExpressionsSuite.$anonfun$new$122(CollationSQLExpressionsSuite.scala:987) ``` ### Why are the changes needed? CI is broken https://github.com/apache/spark/actions/runs/9136253329 ### Does this PR introduce _any_ user-facing change? Yeah, it will implicitly casts collated strings. ### How was this patch tested? Manually ran the test with ANSI disabled. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
HyukjinKwon commented on PR #46529: URL: https://github.com/apache/spark/pull/46529#issuecomment-2118683709 cc @zhengruifeng @ueshin @xinrong-meng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
HyukjinKwon commented on code in PR #46529: URL: https://github.com/apache/spark/pull/46529#discussion_r1605696794 ## python/pyspark/sql/tests/connect/test_parity_arrow.py: ## @@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self): def test_createDataFrame_fallback_enabled(self): super().test_createDataFrame_fallback_enabled() -def test_createDataFrame_with_map_type(self): -self.check_createDataFrame_with_map_type(True) +def test_createDataFrame_pandas_with_map_type(self): +self.check_createDataFrame_pandas_with_map_type(True) + +def test_createDataFrame_pandas_with_struct_type(self): +self.check_createDataFrame_pandas_with_struct_type(True) + +def test_createDataFrame_arrow_with_struct_type(self): Review Comment: We can remove this if that takes no argument. It will be inherited, and run the tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan closed pull request #46280: [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE URL: https://github.com/apache/spark/pull/46280 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on PR #46280: URL: https://github.com/apache/spark/pull/46280#issuecomment-2118674278 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org