[PR] [SPARK-48332][BUILD][TESTS] Upgrade `jdbc` related test dependencies [spark]

2024-05-18 Thread via GitHub


panbingkun opened a new pull request, #46653:
URL: https://github.com/apache/spark/pull/46653

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset [spark]

2024-05-18 Thread via GitHub


github-actions[bot] closed pull request #43473: [SPARK-43829][CONNECT] Improve 
SparkConnectPlanner by reuse Dataset and avoid construct new Dataset
URL: https://github.com/apache/spark/pull/43473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46617][SQL] Create-table-if-not-exists should not silently overwrite existing data-files [spark]

2024-05-18 Thread via GitHub


github-actions[bot] commented on PR #44622:
URL: https://github.com/apache/spark/pull/44622#issuecomment-2119040251

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46971][SQL] When the `compression` is null, a `NullPointException` should not be thrown [spark]

2024-05-18 Thread via GitHub


github-actions[bot] commented on PR #45015:
URL: https://github.com/apache/spark/pull/45015#issuecomment-2119040244

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] assorted copy edits to migration instructions [spark]

2024-05-18 Thread via GitHub


github-actions[bot] closed pull request #45048: assorted copy edits to 
migration instructions
URL: https://github.com/apache/spark/pull/45048


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support [spark]

2024-05-18 Thread via GitHub


srielau commented on PR #46652:
URL: https://github.com/apache/spark/pull/46652#issuecomment-2119020457

   @gengliangwang @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48031] Decompose viewSchemaMode config, add SHOW CREATE TABLE support [spark]

2024-05-18 Thread via GitHub


srielau opened a new pull request, #46652:
URL: https://github.com/apache/spark/pull/46652

   
   
   ### What changes were proposed in this pull request?
   
   We separate enablement of WITH SCHEMA ... clause from the change in default 
from SCHEMA BINDING to SCHEMA COMPENSATION.
   This allows user to upgrade in two steps:
   1. Enable the feature, and deal with DESCRIBE EXTENDED.
   2. Get their affairs in order by ALTER VIEW to SCHEMA BINDING for those 
views they aim to keep in that mode
   3. Switch the default.
 
   
   
   ### Why are the changes needed?
   
   It allows customers to upgrade more safely.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes
   
   ### How was this patch tested?
   
   Added more tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1605752402


##
python/pyspark/sql/tests/connect/test_parity_arrow.py:
##
@@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self):
 def test_createDataFrame_fallback_enabled(self):
 super().test_createDataFrame_fallback_enabled()
 
-def test_createDataFrame_with_map_type(self):
-self.check_createDataFrame_with_map_type(True)
+def test_createDataFrame_pandas_with_map_type(self):
+self.check_createDataFrame_pandas_with_map_type(True)
+
+def test_createDataFrame_pandas_with_struct_type(self):
+self.check_createDataFrame_pandas_with_struct_type(True)
+
+def test_createDataFrame_arrow_with_struct_type(self):

Review Comment:
   Ah, right. Done in 90ea328. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-18 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2118994013

What I would really like to try is to move from this implementation to an 
approach that will have the collation-support logic moved to the 
PartialAggregation stage, by moving logic to `Mode.merge` and `Mode.update`. I 
would use a modified open hash map for that with hashing based on the collation 
key and with a separate map to map from collation key to one of the actual 
values observed that maps to that collation key (which experimentation has 
shown could work).
   
   But as it has already been a couple weeks of development on this, I believe 
we should, for this PR, confine all the collation logic in the stage that can't 
be serialized and deserialized -- the `eval` stage. And I should try what I 
have described above in a PR raised after we have merged the approach that has 
already been tested (i.e. this PR).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1605752402


##
python/pyspark/sql/tests/connect/test_parity_arrow.py:
##
@@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self):
 def test_createDataFrame_fallback_enabled(self):
 super().test_createDataFrame_fallback_enabled()
 
-def test_createDataFrame_with_map_type(self):
-self.check_createDataFrame_with_map_type(True)
+def test_createDataFrame_pandas_with_map_type(self):
+self.check_createDataFrame_pandas_with_map_type(True)
+
+def test_createDataFrame_pandas_with_struct_type(self):
+self.check_createDataFrame_pandas_with_struct_type(True)
+
+def test_createDataFrame_arrow_with_struct_type(self):

Review Comment:
   Ah, right. Done. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1605775356


##
python/pyspark/sql/tests/typing/test_session.yml:
##
@@ -51,25 +51,6 @@
 spark.createDataFrame(["foo", "bar"], "string")
 
 
-- case: createDataFrameScalarsInvalid

Review Comment:
   I added tests in 40072d0 to ensure we still cover these error conditions 
somewhere in the tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [spark]

2024-05-18 Thread via GitHub


GideonPotok commented on PR #46597:
URL: https://github.com/apache/spark/pull/46597#issuecomment-2118991669

   > > since Mode expression works with any child expression, and you 
special-cased handling Strings, how do we handle Array(String) and 
Struct(String), etc.?
   > 
   > In my local tests, I found that Mode performs a byte-by-byte comparison 
for structs, which does not consider collation. So that is still outstanding. 
Good catch!
   > 
   > @uros-db There are several strategies we might adopt to handle structs 
with collation fields. I am looking into implementations. It is potentially 
straightforward though have some gotchas.
   > 
   > Do you feel I should solve for that in a separate PR or in this one? I 
assume you prefer that this get solve in this PR and not a follow-up PR, right?
   
   @uros-db 
   
   Added implementation for mode to support structs with fields with the 
various collations. Performance is not great, so far.
 
   ```
   [info] collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 
-
   [info] UTF8_BINARY_LCASE - mode - 30105 elements 31  
   32   1  9.8 102.3   1.0X
   [info] UNICODE - mode - 30105 elements1  
1   0240.4   4.2  24.6X
   [info] UTF8_BINARY - mode - 30105 elements1  
1   0239.1   4.2  24.5X
   [info] UNICODE_CI - mode - 30105 elements57  
   59   2  5.3 189.9   0.5X
   ```
   
   I will add the benchmark results from GHA once I get your feedback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48330][SS][PYTHON] Fix the python streaming data source timeout issue for large trigger interval [spark]

2024-05-18 Thread via GitHub


chaoqin-li1123 opened a new pull request, #46651:
URL: https://github.com/apache/spark/pull/46651

   
   
   ### What changes were proposed in this pull request?
   Fix the python streaming data source timeout issue for large trigger interval
   For python streaming source, keep the long running worker archaetecture but 
set the socket timeout to be infinity to avoid timeout error.
   For python streaming sink, since StreamingWrite is also created per 
microbatch in scala side, long running worker cannot be attached to s 
StreamingWrite instance. Therefore we abandon the long running worker 
architecture, simply call commit() or abort() and exit the worker and allow 
spark to reuse worker for us.
   
   
   ### Why are the changes needed?
   Currently we run long running python worker process for python streaming 
source and sink to perform planning, commit and abort in driver side. Testing 
indicate that current implementation cause connection timeout error when 
streaming query has large trigger interval.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   add integration test
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


HyukjinKwon commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2118804792

   Sure, that sounds like more localized fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1605775356


##
python/pyspark/sql/tests/typing/test_session.yml:
##
@@ -51,25 +51,6 @@
 spark.createDataFrame(["foo", "bar"], "string")
 
 
-- case: createDataFrameScalarsInvalid

Review Comment:
   I added tests in 40072d0 to ensure we still cover these error conditions 
somewhere in the tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48329][SQL] Turn on `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default [spark]

2024-05-18 Thread via GitHub


superdiaodiao opened a new pull request, #46650:
URL: https://github.com/apache/spark/pull/46650

   
   
   
   ### What changes were proposed in this pull request?
   
   The SPJ(Storage-Partitioned Join) feature flag 
`spark.sql.sources.v2.bucketing.enabled` and 
`spark.sql.sources.v2.bucketing.pushPartValues.enabled` is set to `true`
   
   ### Why are the changes needed?
   
   The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
has proven valuable for most use cases.  We should take advantage of 4.0 
release and change the value to true.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] assorted copy edits to migration instructions [spark]

2024-05-18 Thread via GitHub


elharo commented on PR #45048:
URL: https://github.com/apache/spark/pull/45048#issuecomment-2118783045

   Instead iof having a bot autoclose PRs, perhaops one should review them? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1605752402


##
python/pyspark/sql/tests/connect/test_parity_arrow.py:
##
@@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self):
 def test_createDataFrame_fallback_enabled(self):
 super().test_createDataFrame_fallback_enabled()
 
-def test_createDataFrame_with_map_type(self):
-self.check_createDataFrame_with_map_type(True)
+def test_createDataFrame_pandas_with_map_type(self):
+self.check_createDataFrame_pandas_with_map_type(True)
+
+def test_createDataFrame_pandas_with_struct_type(self):
+self.check_createDataFrame_pandas_with_struct_type(True)
+
+def test_createDataFrame_arrow_with_struct_type(self):

Review Comment:
   Ah, right. Done in 292d3c8. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


uros-db commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2118751764

   @HyukjinKwon I believe you could also use: 
`AbstractMapType(StringTypeAnyCollation, StringTypeAnyCollation)` for 
`inputTypes` in `RaiseError` (misc.scala) instead of `MapType(StringType, 
StringType)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


uros-db commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2118751278

   It was my understanding that this wouldn't be a problem, since this second 
parameter (MapType) is only used internally in Spark to raise errors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


HyukjinKwon commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2118744355

   For a bit of more context, the test fails as below:
   
   ```
   org.apache.spark.sql.AnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve 
"raise_error(USER_RAISED_EXCEPTION, map(errorMessage, 'aa' collate 
UTF8_BINARY_LCASE))" due to data type mismatch: The second parameter requires 
the "MAP" type, however "map(errorMessage, 'aa' collate 
UTF8_BINARY_LCASE)" has the type "MAP". SQLSTATE: 42K09; line 1 pos 7;
   'Project [unresolvedalias(raise_error(cast(USER_RAISED_EXCEPTION as string 
collate UTF8_BINARY_LCASE), map(errorMessage, aa), NullType))]
   +- OneRowRelation
   
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7(CheckAnalysis.scala:315)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$7$adapted(CheckAnalysis.scala:302)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243)
at scala.collection.immutable.Vector.foreach(Vector.scala:2124)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:243)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:302)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:302)
at scala.collection.immutable.List.foreach(List.scala:334)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:302)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:216)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:216)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:198)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:192)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:190)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:192)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:214)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:393)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:212)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:92)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:225)
at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:225)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:224)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:89)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:73)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$3(Dataset.scala:118)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:115)
at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:660)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:651)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:681)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 

Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


HyukjinKwon commented on PR #46649:
URL: https://github.com/apache/spark/pull/46649#issuecomment-2118743892

   cc @cloud-fan and @uros-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


HyukjinKwon commented on code in PR #46649:
URL: https://github.com/apache/spark/pull/46649#discussion_r1605735539


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -969,6 +969,8 @@ object TypeCoercion extends TypeCoercionBase {
 // Note that ret is nullable to avoid typing a lot of Some(...) in this 
local scope.
 // We wrap immediately an Option after this.
 @Nullable val ret: DataType = (inType, expectedType) match {
+  case (_: StringType, _: StringType) => expectedType.defaultConcreteType

Review Comment:
   This seems to be already working in ANSI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-44838][SQL][FOLLOW-UP] Fix the test for raise_error by using default type for strings [spark]

2024-05-18 Thread via GitHub


HyukjinKwon opened a new pull request, #46649:
URL: https://github.com/apache/spark/pull/46649

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/46461 that 
fixes the CI failure when ANSI is off:
   
   ```
   [info] - Support RaiseError misc expression with collation *** FAILED *** 
(21 milliseconds)
   [info]   Expected exception org.apache.spark.SparkRuntimeException to be 
thrown, but org.apache.spark.sql.catalyst.ExtendedAnalysisException was thrown 
(CollationSQLExpressionsSuite.scala:991)
   [info]   org.scalatest.exceptions.TestFailedException:
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
   [info]   at org.scalatest.Assertions.intercept(Assertions.scala:756)
   [info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1564)
   [info]   at 
org.apache.spark.sql.CollationSQLExpressionsSuite.$anonfun$new$124(CollationSQLExpressionsSuite.scala:991)
   [info]   at 
org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56)
   [info]   at 
org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38)
   [info]   at 
org.apache.spark.sql.CollationSQLExpressionsSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(CollationSQLExpressionsSuite.scala:30)
   [info]   at 
org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:248)
   [info]   at 
org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:246)
   [info]   at 
org.apache.spark.sql.CollationSQLExpressionsSuite.withSQLConf(CollationSQLExpressionsSuite.scala:30)
   [info]   at 
org.apache.spark.sql.CollationSQLExpressionsSuite.$anonfun$new$123(CollationSQLExpressionsSuite.scala:988)
   [info]   at scala.collection.immutable.List.foreach(List.scala:334)
   [info]   at 
org.apache.spark.sql.CollationSQLExpressionsSuite.$anonfun$new$122(CollationSQLExpressionsSuite.scala:987)
   ```
   
   ### Why are the changes needed?
   
   CI is broken https://github.com/apache/spark/actions/runs/9136253329
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yeah, it will implicitly casts collated strings. 
   
   ### How was this patch tested?
   
   Manually ran the test with ANSI disabled.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


HyukjinKwon commented on PR #46529:
URL: https://github.com/apache/spark/pull/46529#issuecomment-2118683709

   cc @zhengruifeng @ueshin @xinrong-meng FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-18 Thread via GitHub


HyukjinKwon commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1605696794


##
python/pyspark/sql/tests/connect/test_parity_arrow.py:
##
@@ -31,8 +31,17 @@ def test_createDataFrame_fallback_disabled(self):
 def test_createDataFrame_fallback_enabled(self):
 super().test_createDataFrame_fallback_enabled()
 
-def test_createDataFrame_with_map_type(self):
-self.check_createDataFrame_with_map_type(True)
+def test_createDataFrame_pandas_with_map_type(self):
+self.check_createDataFrame_pandas_with_map_type(True)
+
+def test_createDataFrame_pandas_with_struct_type(self):
+self.check_createDataFrame_pandas_with_struct_type(True)
+
+def test_createDataFrame_arrow_with_struct_type(self):

Review Comment:
   We can remove this if that takes no argument. It will be inherited, and run 
the tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-18 Thread via GitHub


cloud-fan closed pull request #46280: [SPARK-48175][SQL][PYTHON] Store 
collation information in metadata and not in type for SER/DE
URL: https://github.com/apache/spark/pull/46280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-18 Thread via GitHub


cloud-fan commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2118674278

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org