Re: [PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46632: [SPARK-48317][PYTHON][CONNECT][TESTS] 
Enable `test_udtf_with_analyze_using_archive` and 
`test_udtf_with_analyze_using_file`
URL: https://github.com/apache/spark/pull/46632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46632:
URL: https://github.com/apache/spark/pull/46632#issuecomment-2116752726

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48319][PYTHON][CONNECT][TESTS] Test `assert_true` and `raise_error` with the same error class as Spark Classic [spark]

2024-05-16 Thread via GitHub


zhengruifeng opened a new pull request, #46633:
URL: https://github.com/apache/spark/pull/46633

   ### What changes were proposed in this pull request?
   Test `assert_true` and `raise_error` with the same error class as Spark 
Classic
   
   
   ### Why are the changes needed?
   
https://github.com/apache/spark/commit/578931678f5a6d6b33ebdae4bf866871e46fbb83 
made `assert_true` and `raise_error` in Spark Connect throw 
`SparkRuntimeException`, then the error is the same as Spark Classic
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test only
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


uros-db commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1604352238


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, EqualNullSafe, EqualTo, Lower}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteCollationJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput {
+case j @ Join(_, _, _, Some(condition), _) =>
+  val newCondition = condition transform {
+case EqualTo(l: AttributeReference, r: AttributeReference) =>
+  (l.dataType, r.dataType) match {
+case (st: StringType, _: StringType) =>

Review Comment:
   edit: updated PR title, description, and corresponding Jira ticket title to 
reflect this
   also, created new ticket to handle complex types (ArrayType, StructType, 
etc.) https://issues.apache.org/jira/browse/SPARK-48318



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48312][SQL] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46622:
URL: https://github.com/apache/spark/pull/46622#discussion_r1604351546


##
sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala:
##
@@ -49,6 +49,9 @@ sealed class Metadata private[types] (private[types] val map: 
Map[String, Any])
   /** Tests whether this Metadata contains a binding for a key. */
   def contains(key: String): Boolean = map.contains(key)
 
+  /** Tests whether this Metadata is empty. */
+  def isEmpty: Boolean = map.isEmpty

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


uros-db commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1604346112


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, EqualNullSafe, EqualTo, Lower}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteCollationJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput {
+case j @ Join(_, _, _, Some(condition), _) =>
+  val newCondition = condition transform {
+case EqualTo(l: AttributeReference, r: AttributeReference) =>
+  (l.dataType, r.dataType) match {
+case (st: StringType, _: StringType) =>

Review Comment:
   will add support for complex types in a future PR, for now - only String 
should be supported



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46631: [MINOR][PYTHON][TESTS] Call 
`test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` 
URL: https://github.com/apache/spark/pull/46631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46631:
URL: https://github.com/apache/spark/pull/46631#issuecomment-2116583644

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` [spark]

2024-05-16 Thread via GitHub


zhengruifeng commented on PR #46630:
URL: https://github.com/apache/spark/pull/46630#issuecomment-2116579906

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` [spark]

2024-05-16 Thread via GitHub


zhengruifeng closed pull request #46630: 
[SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable 
`DataFrameObservationParityTests.test_observe_str`
URL: https://github.com/apache/spark/pull/46630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn [spark]

2024-05-16 Thread via GitHub


TakawaAkirayo commented on PR #46182:
URL: https://github.com/apache/spark/pull/46182#issuecomment-2116573625

   Gently ping @grundprinzip if anything else needs to be provided from my side 
:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]

2024-05-16 Thread via GitHub


pan3793 commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2116553658

   > ... supposedly the workaround will be removed when the Yarn side upgrades 
their J2EE?
   
   I suppose not. According to the discussion in 
https://github.com/apache/spark/pull/31642 
`org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter` should be part of 
`hadoop-client-api` but actually not, and
   
   > Agree that in the long term we should either: 1) consider to re-implement 
the logic in Spark which allows us to get away from server-side dependency in 
Hadoop ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]

2024-05-16 Thread via GitHub


HiuKwok commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2116550519

   The patch makes sense to me, and supposedly the workaround will be removed 
when the Yarn side upgrades their J2EE?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46629: [SPARK-48316][PS][CONNECT][TESTS] Fix 
comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition
URL: https://github.com/apache/spark/pull/46629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46629:
URL: https://github.com/apache/spark/pull/46629#issuecomment-2116548415

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1604289643


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -104,6 +107,8 @@
 
 
 class DataFrame(ParentDataFrame):
+_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if 
os.cpu_count() else 8)

Review Comment:
   Oops



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-16 Thread via GitHub


zhengruifeng commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1604267324


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -104,6 +107,8 @@
 
 
 class DataFrame(ParentDataFrame):
+_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if 
os.cpu_count() else 8)

Review Comment:
   where is this used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon opened a new pull request, #46632:
URL: https://github.com/apache/spark/pull/46632

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable the tests `test_udtf_with_analyze_using_archive` 
and `test_udtf_with_analyze_using_file`.
   
   ### Why are the changes needed?
   
   To make sure on the test coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]

2024-05-16 Thread via GitHub


zml1206 commented on PR #46499:
URL: https://github.com/apache/spark/pull/46499#issuecomment-2116500596

   `with` is a good idea, thank you very much @cloud-fan . Close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]

2024-05-16 Thread via GitHub


zml1206 closed pull request #46499: [SPARK-48213][SQL] Do not push down 
predicate if non-cheap expression exceed reused limit
URL: https://github.com/apache/spark/pull/46499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48306][SQL] Improve UDT in error message [spark]

2024-05-16 Thread via GitHub


yaooqinn commented on PR #46616:
URL: https://github.com/apache/spark/pull/46616#issuecomment-2116498446

   Merged to master. Thank you @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48306][SQL] Improve UDT in error message [spark]

2024-05-16 Thread via GitHub


yaooqinn closed pull request #46616: [SPARK-48306][SQL] Improve UDT in error 
message
URL: https://github.com/apache/spark/pull/46616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48113][CONNECT] Allow Plugins to integrate with Spark Connect [spark]

2024-05-16 Thread via GitHub


nchammas commented on PR #46364:
URL: https://github.com/apache/spark/pull/46364#issuecomment-2116496398

   Looks like the docs I am looking for are in #45340.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon opened a new pull request, #46631:
URL: https://github.com/apache/spark/pull/46631

   ### What changes were proposed in this pull request?
   
   This PR fixes the test `test_apply_schema_to_row` to call 
`test_apply_schema_to_row` instead of `test_apply_schema_to_dict_and_rows`. It 
was a mistake.
   
   ### Why are the changes needed?
   
   To avoid a mistake when it's enabled in the future.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL][FOLLOWUP] Update the error message [spark]

2024-05-16 Thread via GitHub


zhengruifeng commented on PR #46628:
URL: https://github.com/apache/spark/pull/46628#issuecomment-2116489481

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL][FOLLOWUP] Update the error message [spark]

2024-05-16 Thread via GitHub


zhengruifeng closed pull request #46628: [SPARK-48301][SQL][FOLLOWUP] Update 
the error message
URL: https://github.com/apache/spark/pull/46628


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-41625][PYTHON][CONNECT][TESTS] Enable `DataFrameObservationParityTests.test_observe_str` [spark]

2024-05-16 Thread via GitHub


HyukjinKwon opened a new pull request, #46630:
URL: https://github.com/apache/spark/pull/46630

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable 
`DataFrameObservationParityTests.test_observe_str`.
   
   ### Why are the changes needed?
   
   To make sure on the test coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46629:
URL: https://github.com/apache/spark/pull/46629#issuecomment-2116480337

   cc @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48316][PS][CONNECT][TESTS] Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub


HyukjinKwon opened a new pull request, #46629:
URL: https://github.com/apache/spark/pull/46629

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable `SparkFrameMethodsParityTests.test_coalesce` and 
`SparkFrameMethodsParityTests.test_repartition` in Spark Connect by avoiding 
RDD usage in the test.
   
   ### Why are the changes needed?
   
   To make sure on the test coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-16 Thread via GitHub


panbingkun commented on PR #46551:
URL: https://github.com/apache/spark/pull/46551#issuecomment-2116479316

   1.with `maven local cache`
   https://github.com/panbingkun/spark/actions/runs/9109019204
   https://github.com/apache/spark/assets/15246973/439b6397-21bb-427d-a740-481755b54a02;>
   
   2.without `maven local cache`
   https://github.com/panbingkun/spark/actions/runs/9107211572
   https://github.com/apache/spark/assets/15246973/0238e450-5e1d-4405-afec-8fd086760b82;>
   
   https://github.com/panbingkun/spark/actions/runs/9114668535
   https://github.com/apache/spark/assets/15246973/d0f5ae74-06df-44d4-8388-c5adfe02ce2a;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46621: [SPARK-48310][PYTHON][CONNECT] Cached 
properties must return copies
URL: https://github.com/apache/spark/pull/46621


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46621:
URL: https://github.com/apache/spark/pull/46621#issuecomment-2116465949

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on code in PR #46622:
URL: https://github.com/apache/spark/pull/46622#discussion_r1604233127


##
sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala:
##
@@ -49,6 +49,9 @@ sealed class Metadata private[types] (private[types] val map: 
Map[String, Any])
   /** Tests whether this Metadata contains a binding for a key. */
   def contains(key: String): Boolean = map.contains(key)
 
+  /** Tests whether this Metadata is empty. */
+  def isEmpty: Boolean = map.isEmpty

Review Comment:
   Can you add `@since 4.0.0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV options [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on code in PR #46626:
URL: https://github.com/apache/spark/pull/46626#discussion_r1604232424


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -149,7 +149,12 @@ class CSVOptions(
 parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
 
   // A language tag in IETF BCP 47 format
-  val locale: Locale = 
parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  val locale: Locale = try {
+parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  } catch {
+case _: NullPointerException =>

Review Comment:
   and this issue should exist in almost all other options. Would probably be 
best to fix them together



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV options [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on code in PR #46626:
URL: https://github.com/apache/spark/pull/46626#discussion_r1604232182


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -149,7 +149,12 @@ class CSVOptions(
 parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
 
   // A language tag in IETF BCP 47 format
-  val locale: Locale = 
parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  val locale: Locale = try {
+parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  } catch {
+case _: NullPointerException =>

Review Comment:
   Can we check if the value is `null`, and throw an exception instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]

2024-05-16 Thread via GitHub


panbingkun commented on PR #46612:
URL: https://github.com/apache/spark/pull/46612#issuecomment-2116460422

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [spark]

2024-05-16 Thread via GitHub


Kimahriman commented on PR #46627:
URL: https://github.com/apache/spark/pull/46627#issuecomment-2116427914

   @HeartSaVioR since you added the file caching originally back in the day


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [spark]

2024-05-16 Thread via GitHub


Kimahriman opened a new pull request, #46627:
URL: https://github.com/apache/spark/pull/46627

   
   
   ### What changes were proposed in this pull request?
   
   Files don't need to be cached for reuse in `FileStreamSource` when using 
`Trigger.AvailableNow` because all files are already cached for the lifetime of 
the query in `allFilesForTriggerAvailableNow`.
   
   ### Why are the changes needed?
   
   As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR 
to address https://github.com/apache/spark/pull/45362), the hard coded cap of 
10k files being cached can cause problems when using a maxFilesPerTrigger > 
10k. It causes every other batch to be 10k files, which can greatly limit the 
throughput of a new streaming trying to catch up.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Every other streaming batch won't be 10k files if using Trigger.AvailableNow 
and maxFilesPerTrigger greater than 10k.
   
   ### How was this patch tested?
   
   New UT
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub


zhengruifeng commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1604205064


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."

Review Comment:
   sounds good, we don't have PROCEDURE in OSS, will update in a followup PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][Spark 44646] Reduce usage of log4j core [spark]

2024-05-16 Thread via GitHub


github-actions[bot] closed pull request #45001: [WIP][Spark 44646] Reduce usage 
of log4j core
URL: https://github.com/apache/spark/pull/45001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42789][SQL] Rewrite multiple GetJsonObject that consumes same JSON to single JsonTuple [spark]

2024-05-16 Thread via GitHub


github-actions[bot] closed pull request #45020: [SPARK-42789][SQL] Rewrite 
multiple GetJsonObject that consumes same JSON to single JsonTuple
URL: https://github.com/apache/spark/pull/45020


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46571: [SPARK-48268][CORE] Add a configuration 
for SparkContext.setCheckpointDir
URL: https://github.com/apache/spark/pull/46571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46571:
URL: https://github.com/apache/spark/pull/46571#issuecomment-2116375998

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46614: [SPARK-48031][SQL][FOLLOW-UP] Use 
ANSI-enabled cast in view lookup test
URL: https://github.com/apache/spark/pull/46614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test [spark]

2024-05-16 Thread via GitHub


HyukjinKwon commented on PR #46614:
URL: https://github.com/apache/spark/pull/46614#issuecomment-2116375642

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]

2024-05-16 Thread via GitHub


dependabot[bot] commented on PR #46625:
URL: https://github.com/apache/spark/pull/46625#issuecomment-2116374205

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]

2024-05-16 Thread via GitHub


HyukjinKwon closed pull request #46625: Bump rexml from 3.2.6 to 3.2.8 in /docs
URL: https://github.com/apache/spark/pull/46625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-43815] Wrap NPE with AnalysisException in csv options [spark]

2024-05-16 Thread via GitHub


michaelzhan-db opened a new pull request, #46626:
URL: https://github.com/apache/spark/pull/46626

   
   
   ### What changes were proposed in this pull request?
   
   When user sets `locale` to be `null`, a NPE is raised. Instead, replace the 
NPE with a more understandable user facing error message.
   
   ### Why are the changes needed?
   
   Improves usability of csv options.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added tests pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub


gengliangwang commented on PR #46623:
URL: https://github.com/apache/spark/pull/46623#issuecomment-2116274470

   @michaelzhan-db there are merge conflicts against branch-3.5. Please create 
a new PR for the backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub


gengliangwang closed pull request #46623: [SPARK-48294][SQL] Handle lowercase 
in nestedTypeMissingElementTypeError
URL: https://github.com/apache/spark/pull/46623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub


gengliangwang commented on PR #46623:
URL: https://github.com/apache/spark/pull/46623#issuecomment-2116272913

   Thanks, merging to master and branch 3.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]

2024-05-16 Thread via GitHub


dependabot[bot] opened a new pull request, #46625:
URL: https://github.com/apache/spark/pull/46625

   Bumps [rexml](https://github.com/ruby/rexml) from 3.2.6 to 3.2.8.
   
   Release notes
   Sourced from https://github.com/ruby/rexml/releases;>rexml's 
releases.
   
   REXML 3.2.8 - 2024-05-16
   Fixes
   
   Suppressed a warning
   
   REXML 3.2.7 - 2024-05-16
   Improvements
   
   
   Improve parse performance by using StringScanner.
   
   
   https://redirect.github.com/ruby/rexml/issues/106;>GH-106
   
   
   https://redirect.github.com/ruby/rexml/issues/107;>GH-107
   
   
   https://redirect.github.com/ruby/rexml/issues/108;>GH-108
   
   
   https://redirect.github.com/ruby/rexml/issues/109;>GH-109
   
   
   https://redirect.github.com/ruby/rexml/issues/112;>GH-112
   
   
   https://redirect.github.com/ruby/rexml/issues/113;>GH-113
   
   
   https://redirect.github.com/ruby/rexml/issues/114;>GH-114
   
   
   https://redirect.github.com/ruby/rexml/issues/115;>GH-115
   
   
   https://redirect.github.com/ruby/rexml/issues/116;>GH-116
   
   
   https://redirect.github.com/ruby/rexml/issues/117;>GH-117
   
   
   https://redirect.github.com/ruby/rexml/issues/118;>GH-118
   
   
   https://redirect.github.com/ruby/rexml/issues/119;>GH-119
   
   
   https://redirect.github.com/ruby/rexml/issues/121;>GH-121
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Improved parse performance when an attribute has many 
s.
   
   https://redirect.github.com/ruby/rexml/issues/124;>GH-124
   
   
   
   Fixes
   
   
   XPath: Fixed a bug of normalize_space(array).
   
   
   https://redirect.github.com/ruby/rexml/issues/110;>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/111;>GH-111
   
   
   Patch by flatisland.
   
   
   
   
   XPath: Fixed a bug that wrong position is used with nested path.
   
   
   https://redirect.github.com/ruby/rexml/issues/110;>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/122;>GH-122
   
   
   Reported by jcavalieri.
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Fixed a bug that an exception message can't be generated for
   invalid encoding XML.
   
   
   
   
   ... (truncated)
   
   
   Changelog
   Sourced from https://github.com/ruby/rexml/blob/master/NEWS.md;>rexml's 
changelog.
   
   3.2.8 - 2024-05-16 {#version-3-2-8}
   Fixes
   
   Suppressed a warning
   
   3.2.7 - 2024-05-16 {#version-3-2-7}
   Improvements
   
   
   Improve parse performance by using StringScanner.
   
   
   https://redirect.github.com/ruby/rexml/issues/106;>GH-106
   
   
   https://redirect.github.com/ruby/rexml/issues/107;>GH-107
   
   
   https://redirect.github.com/ruby/rexml/issues/108;>GH-108
   
   
   https://redirect.github.com/ruby/rexml/issues/109;>GH-109
   
   
   https://redirect.github.com/ruby/rexml/issues/112;>GH-112
   
   
   https://redirect.github.com/ruby/rexml/issues/113;>GH-113
   
   
   https://redirect.github.com/ruby/rexml/issues/114;>GH-114
   
   
   https://redirect.github.com/ruby/rexml/issues/115;>GH-115
   
   
   https://redirect.github.com/ruby/rexml/issues/116;>GH-116
   
   
   https://redirect.github.com/ruby/rexml/issues/117;>GH-117
   
   
   https://redirect.github.com/ruby/rexml/issues/118;>GH-118
   
   
   https://redirect.github.com/ruby/rexml/issues/119;>GH-119
   
   
   https://redirect.github.com/ruby/rexml/issues/121;>GH-121
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Improved parse performance when an attribute has many 
s.
   
   https://redirect.github.com/ruby/rexml/issues/124;>GH-124
   
   
   
   Fixes
   
   
   XPath: Fixed a bug of normalize_space(array).
   
   
   https://redirect.github.com/ruby/rexml/issues/110;>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/111;>GH-111
   
   
   Patch by flatisland.
   
   
   
   
   XPath: Fixed a bug that wrong position is used with nested path.
   
   
   https://redirect.github.com/ruby/rexml/issues/110;>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/122;>GH-122
   
   
   Reported by jcavalieri.
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Fixed a bug that an exception message can't be generated for
   
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/ruby/rexml/commit/1cf37bab79d61d6183bbda8bf525ed587012b718;>1cf37ba
 Add 3.2.8 entry
   https://github.com/ruby/rexml/commit/b67081caa807fad48d31983137b7ed8711e7f0df;>b67081c
 Remove an unused variable (https://redirect.github.com/ruby/rexml/issues/128;>#128)
   https://github.com/ruby/rexml/commit/94e180e939baff8f7e328a287bb96ebbd99db6eb;>94e180e
 Suppress a warning
   https://github.com/ruby/rexml/commit/d574ba5fe1c40adbafbf16e47533f4eb32b43e60;>d574ba5
 ci: install only gems required for running tests (https://redirect.github.com/ruby/rexml/issues/129;>#129)
   https://github.com/ruby/rexml/commit/4670f8fc187c89d0504d027ea997959287143453;>4670f8f
 Add missing Thanks section
   

Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]

2024-05-16 Thread via GitHub


chaoqin-li1123 commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1603979413


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -59,8 +59,17 @@ Start by creating a new subclass of :class:`DataSource`. 
Define the source name,
 def reader(self, schema: StructType):
 return FakeDataSourceReader(schema, self.options)
 
+def streamReader(self, schema: StructType):

Review Comment:
   I prefer not to duplicate the DataSource code. We already document that 
developer only need to implement the corresponding method for a certain 
capacity.



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -33,9 +33,15 @@ To create a custom Python data source, you'll need to 
subclass the :class:`DataS
 
 This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
 
-**Step 1: Define the Data Source**
+**Define the Data Source**
 
-Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+Start by creating a new subclass of :class:`DataSource` with the source name, 
schema.
+
+In order to read from the data source in a batch query, reader() method need 
to be defined.
+
+In order to read from the data source in a streaming query, streamReader() or 
simpleStreamReader() method need to be defined.
+
+In order to write to the data source in a streaming query, streamWriter() 
method need to be defined.

Review Comment:
   Table added.



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
 row.append(value)
 yield tuple(row)
 
+Implementing Streaming Reader and Writer for Python Data Source
+---
+**Implement the Stream Reader**
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+def __init__(self, schema, options):
+self.current = 0
+
+def initialOffset(self) -> dict:
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def latestOffset(self) -> dict:
+"""
+Return the current latest offset that the next microbatch will 
read to.
+"""
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start: dict, end: dict):
+"""
+Plans the partitioning of the current microbatch defined by start 
and end offset,
+it needs to return a sequence of :class:`InputPartition` object.
+"""
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end: dict):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+def read(self, partition) -> Iterator[Tuple]:
+"""
+Takes a partition as an input and read an iterator of tuples from 
the data source.
+"""
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+**Implement the Simple Stream Reader**
+
+If the data source has low throughput and doesn't require partitioning, you 
can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
+
+One of simpleStreamReader() and streamReader() must be implemented for 
readable streaming data source. And simpleStreamReader() will only be invoked 
when streamReader() is not implemented.
+
+This is the same dummy streaming reader that generate 2 rows every batch 
implemented with SimpleDataSourceStreamReader interface.
+
+.. code-block:: python
+
+class SimpleStreamReader(SimpleDataSourceStreamReader):
+def initialOffset(self):
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Takes start offset as an input, return an iterator of tuples and 
the start offset of next read.
+"""
+start_idx = start["offset"]
+it = iter([(i,) for i in range(start_idx, start_idx + 2)])
+return (it, {"offset": 

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-16 Thread via GitHub


mridulm commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1603972872


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1317,6 +1317,16 @@ package object config {
   s" be less than or equal to 
${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
   .createWithDefault(64 * 1024 * 1024)
 
+  private[spark] val CHECKPOINT_DIR =
+ConfigBuilder("spark.checkpoint.dir")
+  .doc(
+"Equivalent with SparkContext.setCheckpointDir. If set, the path 
becomes" +
+  "the default directory for checkpointing. It can be overwritten by" +
+  "SparkContext.setCheckpointDir.")

Review Comment:
   nit:
```suggestion
   "Set the default directory for checkpointing. It can be overwritten 
by " +
 "SparkContext.setCheckpointDir.")
   ```
   
   (I missed this in my earlier pass, my bad)



##
docs/configuration.md:
##
@@ -1795,6 +1795,16 @@ Apart from these, the following properties are also 
available, and may be useful
   
   0.6.0
 
+
+  spark.checkpoint.dir
+  (none)
+  
+Equivalent with SparkContext.setCheckpointDir. If set, the path becomes
+the default directory for checkpointing. It can be overwritten by
+SparkContext.setCheckpointDir.

Review Comment:
   Same as above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]

2024-05-16 Thread via GitHub


mridulm commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2116089785

   +CC @tgravescs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Python ds preview [spark]

2024-05-16 Thread via GitHub


chaoqin-li1123 closed pull request #46624: Python ds preview
URL: https://github.com/apache/spark/pull/46624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] Python ds preview [spark]

2024-05-16 Thread via GitHub


chaoqin-li1123 opened a new pull request, #46624:
URL: https://github.com/apache/spark/pull/46624

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub


srielau commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1603902920


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."

Review Comment:
   That works too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* [spark]

2024-05-16 Thread via GitHub


gengliangwang closed pull request #46615: [SPARK-48291][CORE][FOLLOWUP] Rename 
Java *LoggerSuite* as *SparkLoggerSuite*
URL: https://github.com/apache/spark/pull/46615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* [spark]

2024-05-16 Thread via GitHub


gengliangwang commented on PR #46615:
URL: https://github.com/apache/spark/pull/46615#issuecomment-2115976061

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]

2024-05-16 Thread via GitHub


allisonwang-db commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1602361195


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -33,9 +33,15 @@ To create a custom Python data source, you'll need to 
subclass the :class:`DataS
 
 This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
 
-**Step 1: Define the Data Source**
+**Define the Data Source**
 
-Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+Start by creating a new subclass of :class:`DataSource` with the source name, 
schema.
+
+In order to read from the data source in a batch query, reader() method need 
to be defined.
+
+In order to read from the data source in a streaming query, streamReader() or 
simpleStreamReader() method need to be defined.
+
+In order to write to the data source in a streaming query, streamWriter() 
method need to be defined.

Review Comment:
   Do you think it's more clear to have a markdown table here? streaming/batch 
x read/write.



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
 row.append(value)
 yield tuple(row)
 
+Implementing Streaming Reader and Writer for Python Data Source
+---
+**Implement the Stream Reader**
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+def __init__(self, schema, options):
+self.current = 0
+
+def initialOffset(self) -> dict:
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def latestOffset(self) -> dict:
+"""
+Return the current latest offset that the next microbatch will 
read to.
+"""
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start: dict, end: dict):
+"""
+Plans the partitioning of the current microbatch defined by start 
and end offset,
+it needs to return a sequence of :class:`InputPartition` object.
+"""
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end: dict):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+def read(self, partition) -> Iterator[Tuple]:
+"""
+Takes a partition as an input and read an iterator of tuples from 
the data source.
+"""
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+**Implement the Simple Stream Reader**
+
+If the data source has low throughput and doesn't require partitioning, you 
can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
+
+One of simpleStreamReader() and streamReader() must be implemented for 
readable streaming data source. And simpleStreamReader() will only be invoked 
when streamReader() is not implemented.
+
+This is the same dummy streaming reader that generate 2 rows every batch 
implemented with SimpleDataSourceStreamReader interface.
+
+.. code-block:: python
+
+class SimpleStreamReader(SimpleDataSourceStreamReader):
+def initialOffset(self):
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Takes start offset as an input, return an iterator of tuples and 
the start offset of next read.
+"""
+start_idx = start["offset"]
+it = iter([(i,) for i in range(start_idx, start_idx + 2)])
+return (it, {"offset": start_idx + 2})
+
+def readBetweenOffsets(self, start: dict, end: dict) -> 
Iterator[Tuple]:
+"""
+Takes start and end offset as input and read an iterator of data 
deterministically.
+This is called whe query replay batches during restart or after 
failure.
+"""
+start_idx = start["offset"]
+end_idx = end["offset"]
+return iter([(i,) for i in 

Re: [PR] [SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the original WithCTE node [spark]

2024-05-16 Thread via GitHub


amaliujia commented on code in PR #46617:
URL: https://github.com/apache/spark/pull/46617#discussion_r1603865663


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala:
##
@@ -74,34 +70,33 @@ case class InlineCTE(alwaysInline: Boolean = false) extends 
Rule[LogicalPlan] {
*
* @param plan The plan to collect the CTEs from
* @param cteMap A mutable map that accumulates the CTEs and their reference 
information by CTE
-   *   ids. The value of the map is tuple whose elements are:
-   *   - The CTE definition
-   *   - The number of incoming references to the CTE. This 
includes references from
-   * other CTEs and regular places.
-   *   - A mutable inner map that tracks outgoing references 
(counts) to other CTEs.
+   *   ids.
* @param outerCTEId While collecting the map we use this optional CTE id to 
identify the
*   current outer CTE.
*/
-  def buildCTEMap(
+  private def buildCTEMap(
   plan: LogicalPlan,
-  cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])],
+  cteMap: mutable.Map[Long, CTEReferenceInfo],
   outerCTEId: Option[Long] = None): Unit = {
 plan match {
   case WithCTE(child, cteDefs) =>
 cteDefs.foreach { cteDef =>
-  cteMap(cteDef.id) = (cteDef, 0, 
mutable.Map.empty.withDefaultValue(0))
+  cteMap(cteDef.id) = CTEReferenceInfo(
+cteDef = cteDef,
+refCount = 0,
+outgoingRefs = mutable.Map.empty.withDefaultValue(0),
+shouldInline = true
+  )
 }
 cteDefs.foreach { cteDef =>
   buildCTEMap(cteDef, cteMap, Some(cteDef.id))
 }
 buildCTEMap(child, cteMap, outerCTEId)
 
   case ref: CTERelationRef =>
-val (cteDef, refCount, refMap) = cteMap(ref.cteId)
-cteMap(ref.cteId) = (cteDef, refCount + 1, refMap)
+cteMap(ref.cteId) = cteMap(ref.cteId).withRefCountIncreased(1)
 outerCTEId.foreach { cteId =>
-  val (_, _, outerRefMap) = cteMap(cteId)
-  outerRefMap(ref.cteId) += 1
+  cteMap(cteId).recordOutgoingReference(ref.cteId)

Review Comment:
   I personally like to not hide the `+1` implementation details here as I had 
troubles to understand it.
   
   Or the function name can be `IncreaseOutgoingReferenceByOne`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48294] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub


michaelzhan-db opened a new pull request, #46623:
URL: https://github.com/apache/spark/pull/46623

   
   
   ### What changes were proposed in this pull request?
   
   Handle lowercase values inside of nestTypeMissingElementTypeError to prevent 
match errors.
   
   ### Why are the changes needed?
   
   The previous match error was not user-friendly. Now it gives an actionable 
`INCOMPLETE_TYPE_DEFINITION` error.
   
   ### Does this PR introduce _any_ user-facing change?
   
   N/A
   
   ### How was this patch tested?
   
   Newly added tests pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) [spark]

2024-05-16 Thread via GitHub


mkaravel commented on code in PR #46589:
URL: https://github.com/apache/spark/pull/46589#discussion_r1603752983


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // search left to right (note: the start code point is inclusive)

Review Comment:
   ```suggestion
 // Search left to right (note: the start code point is inclusive).
   ```
   Please use regular sentences as comments.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // search left to right (note: the start code point is inclusive)
+  int matchLength = -1;
   while (count > 0) {
-idx = lowercaseString.find(lowercaseDelimiter, idx + 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
-  }
-  if (idx == 0) {
-return UTF8String.EMPTY_UTF8;
+matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 
1);
+if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter
+else return string; // cannot find enough delimiters in the string

Review Comment:
   ```suggestion
   else return string; // Cannot find enough delimiters in the string.
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // search left to right (note: the start code point is inclusive)
+  int matchLength = -1;
   while (count > 0) {
-idx = lowercaseString.find(lowercaseDelimiter, idx + 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
-  }
-  if (idx == 0) {
-return UTF8String.EMPTY_UTF8;
+matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 
1);
+if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter
+else return string; // cannot find enough delimiters in the string
   }
-  byte[] bytes = new byte[idx];
-  copyMemory(string.getBaseObject(), string.getBaseOffset(), bytes, 
BYTE_ARRAY_OFFSET, idx);
-  return UTF8String.fromBytes(bytes);
-
+  if (matchLength == 0) return UTF8String.EMPTY_UTF8;
+  return string.substring(0, matchLength);
 } else {
-  int idx = string.numBytes() - delimiter.numBytes() + 1;
+  // search right to left (note: the end code point is exclusive)
+  int matchLength = string.numChars() + 1;
   count = -count;
   while (count > 0) {
-idx = lowercaseString.rfind(lowercaseDelimiter, idx - 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
+matchLength = lowercaseRFind(string, lowercaseDelimiter, matchLength - 
1);
+if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter
+else return string; // cannot find enough delimiters in the string
   }
-  if (idx + delimiter.numBytes() == string.numBytes()) {
-return UTF8String.EMPTY_UTF8;
-  }
-  int size = string.numBytes() - delimiter.numBytes() - idx;
-  byte[] bytes = new byte[size];
-  copyMemory(string.getBaseObject(), string.getBaseOffset() + idx + 
delimiter.numBytes(),
-bytes, BYTE_ARRAY_OFFSET, size);
-  return UTF8String.fromBytes(bytes);
+  if (matchLength == string.numChars()) return UTF8String.EMPTY_UTF8;
+  return string.substring(matchLength, string.numChars());

Review Comment:
   ```suggestion
 return string.substring(matchLength, string.numChars());
   ```
   Same here.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 

Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-16 Thread via GitHub


mkaravel commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1603716669


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)

Review Comment:
   Let's specify that this is 0-based.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with
+   * the specified prefix, with respect to the UTF8_BINARY_LCASE collation.
+   * The method assumes that the prefix is already lowercased. The method only
+   * considers the part of target string that starts from the specified 
position.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the end position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
suffix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string
+   * in the target string from the specified position (0-based index),
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the pattern string is already lowercased prior to method call.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target, if not 
found, -1 returned.

Review Comment:
   ```suggestion
  * @return the position of the first occurrence of pattern in target, if 
not found, `MATCH_NOT_FOUND` returned.
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to 

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603733508


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala:
##
@@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite {
 
 assert(result === expected)
   }
+
+  test("schema with collation should not change during ser/de") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val nestedStruct = StructType(
+  StructField("nested", simpleStruct) :: Nil)
+
+val caseInsensitiveNames = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) ::
+  StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val specialCharsInName = StructType(
+  StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val arrayInSchema = StructType(
+  StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) 
:: Nil)
+
+val mapInSchema = StructType(
+  StructField("mapField",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val mapWithKeyInNameInSchema = StructType(
+  StructField("name.key", StringType) ::
+  StructField("name",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val arrayInMapInNestedSchema = StructType(
+  StructField("arrInMap",
+MapType(StringType(UNICODE_COLLATION_ID),
+ArrayType(StringType(UNICODE_COLLATION_ID :: Nil)
+
+val nestedArrayInMap = StructType(
+  StructField("nestedArrayInMap",
+ArrayType(MapType(StringType(UNICODE_COLLATION_ID),
+  ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val schemaWithMultipleFields = StructType(
+  simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ 
mapInSchema.fields ++
+mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ 
nestedArrayInMap.fields)
+
+Seq(
+  simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, 
arrayInSchema,
+  mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, 
arrayInMapInNestedSchema,
+  schemaWithMultipleFields)
+  .foreach { schema =>
+val json = schema.json
+val parsed = DataType.fromJson(json)
+assert(parsed === schema)
+  }
+  }
+
+  test("non string field has collation metadata") {

Review Comment:
   `StructTypeSuite` is used just to validate `toJson` value, deserialization 
is tested in `DataTypeSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603732173


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)
+  val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", 
collationsMap)
+  val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", 
collationsMap)

Review Comment:
   good catch! After some refactoring earlier this is no longer needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub


allisonwang-db commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1603731266


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."

Review Comment:
   How about: "Cannot create a routine with both IF NOT EXISTS and REPLACE 
specified."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603727528


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   So you also think we should put optional here? Even intellij complains when 
it is used in a field.
   
   https://github.com/apache/spark/assets/154237371/ea7bd5c5-43cb-4826-aae5-bf76588b6198;>
   
   Also, you can read below that it's intended to be used for method return type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub


agubichev commented on PR #46622:
URL: https://github.com/apache/spark/pull/46622#issuecomment-2115745170

   @cloud-fan PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603651070


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()

Review Comment:
   for writing statistics - version change invalidates them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub


vladimirg-db commented on PR #46622:
URL: https://github.com/apache/spark/pull/46622#issuecomment-2115620509

   @agubichev, hi! Here's the improvement fix for the ultra-wide views


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48159][SQL] Extending support for collated strings on datetime expressions [spark]

2024-05-16 Thread via GitHub


nebojsa-db commented on PR #46618:
URL: https://github.com/apache/spark/pull/46618#issuecomment-2115535153

   @cloud-fan  Please review :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48309][YARN]Stop am retry, in situations where some errors and… [spark]

2024-05-16 Thread via GitHub


guixiaowen opened a new pull request, #46620:
URL: https://github.com/apache/spark/pull/46620

   … retries may not be successful
   
   
   
   ### What changes were proposed in this pull request?
   In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our 
production environment, it is configured as 2 If the first execution fails, AM 
will retry. However, in some scenarios, even attempting a second task may fail.
   
   For example:
   
   org. apache. park. SQL AnalysisException: Table or view not found: 
test.test_x; Line 1 pos 14;
   Project
   +-Unresolved Relationship [bigdata_qa, testx_x], [], false
   
   
   
   Other example:
   Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The 
NameSpace quota (directories and files) of directory/tmp/xxx_file/ is 
exceeded: quota=100 file count=101
   
   Would it be more appropriate to try capturing these exceptions and stopping 
retry?
   
   
   ### Why are the changes needed?
   In some scenarios, even attempting a second task may fail.
   
   
   ### Does this PR introduce _any_ user-facing change?
   The user can throw a SparkStopAMRetryException, and the Application Master 
will catch the exception and stop retry
   
   For examle
   
   val spark = SparkSession
 .builder()
 .appName("Spark SQL basic example")
 .enableHiveSupport()
 .config("spark.some.config.option", "some-value")
 .getOrCreate()
   
   try {
 spark.sql("select * from test.test_x;").show
   } catch {
 case e:AnalysisException => throw new SparkStopAMRetryException("this 
is a test", e)
   } finally {
 spark.stop()
   }
   

   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-16 Thread via GitHub


PetarVasiljevic-DB commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1603513988


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -125,6 +126,29 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 
   override def caseConvert(tableName: String): String = 
tableName.toUpperCase(Locale.ROOT)
 
+  override protected val timestampNTZType: String = "TIMESTAMP WITH LOCAL TIME 
ZONE"

Review Comment:
   Yep, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-05-16 Thread via GitHub


cloud-fan closed pull request #46619: [SPARK-48308][Core] Unify getting data 
schema without partition columns in FileSourceStrategy
URL: https://github.com/apache/spark/pull/46619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on PR #46619:
URL: https://github.com/apache/spark/pull/46619#issuecomment-2115430404

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1603428926


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -125,6 +126,29 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 
   override def caseConvert(tableName: String): String = 
tableName.toUpperCase(Locale.ROOT)
 
+  override protected val timestampNTZType: String = "TIMESTAMP WITH LOCAL TIME 
ZONE"

Review Comment:
   isn't it LTZ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1603426755


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala:
##
@@ -102,4 +105,7 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
   }
 
   override def caseConvert(tableName: String): String = 
tableName.toUpperCase(Locale.ROOT)
+
+  override protected val timestampNTZType: String = "TIMESTAMP WITHOUT TIME 
ZONE"
+  override protected val timestampTZType: String = "TIMESTAMP"

Review Comment:
   do you mean NTZ? TZ usually means WITH TIME ZONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603420522


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, EqualNullSafe, EqualTo, Lower}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteCollationJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput {
+case j @ Join(_, _, _, Some(condition), _) =>
+  val newCondition = condition transform {
+case EqualTo(l: AttributeReference, r: AttributeReference) =>
+  (l.dataType, r.dataType) match {
+case (st: StringType, _: StringType) =>

Review Comment:
   how about nested string type such as array of string, struct of string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603406813


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala:
##
@@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite {
 
 assert(result === expected)
   }
+
+  test("schema with collation should not change during ser/de") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val nestedStruct = StructType(
+  StructField("nested", simpleStruct) :: Nil)
+
+val caseInsensitiveNames = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) ::
+  StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val specialCharsInName = StructType(
+  StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val arrayInSchema = StructType(
+  StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) 
:: Nil)
+
+val mapInSchema = StructType(
+  StructField("mapField",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val mapWithKeyInNameInSchema = StructType(
+  StructField("name.key", StringType) ::
+  StructField("name",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val arrayInMapInNestedSchema = StructType(
+  StructField("arrInMap",
+MapType(StringType(UNICODE_COLLATION_ID),
+ArrayType(StringType(UNICODE_COLLATION_ID :: Nil)
+
+val nestedArrayInMap = StructType(
+  StructField("nestedArrayInMap",
+ArrayType(MapType(StringType(UNICODE_COLLATION_ID),
+  ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val schemaWithMultipleFields = StructType(
+  simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ 
mapInSchema.fields ++
+mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ 
nestedArrayInMap.fields)
+
+Seq(
+  simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, 
arrayInSchema,
+  mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, 
arrayInMapInNestedSchema,
+  schemaWithMultipleFields)
+  .foreach { schema =>
+val json = schema.json
+val parsed = DataType.fromJson(json)
+assert(parsed === schema)
+  }
+  }
+
+  test("non string field has collation metadata") {

Review Comment:
   how do we decide to put test in here or `StructTypeSuite`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603405105


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()

Review Comment:
   when will we use the version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603404296


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()
+case _ =>
+  throw new IllegalStateException(s"Unexpected data type $dt")

Review Comment:
   we should use `SparkException.internalError`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603399289


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)
+  val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", 
collationsMap)
+  val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", 
collationsMap)

Review Comment:
   why do we need `parseDataTypeWithCollation`? It looks correct to just call 
`parseDataType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603393035


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)

Review Comment:
   shall we do this assert in the StructType branch as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603384345


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -110,6 +158,8 @@ public Collation(
   // No Collation can simultaneously support binary equality and lowercase 
equality
   assert(!supportsBinaryEquality || !supportsLowercaseEquality);
 
+  assert(SUPPORTED_PROVIDERS.contains(provider) || provider == null);

Review Comment:
   when `provider` can be null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603382702


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   attributes are not optional... It's a bad practice to leave attributes 
un-initialized, which is null for reference type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603381135


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -1051,6 +1052,153 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("CollationKey generates correct collation key for string") {

Review Comment:
   I think that these tests should be in `CollationExpressionSuite`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603377917


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -784,19 +785,19 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   case _: BroadcastHashJoinExec => ()
 }.nonEmpty)
 
-// Even with hint broadcast, hash join is not used for non-binary collated 
strings.
+// Hash join is also used for non-binary collated strings.

Review Comment:
   We should assert that collationkey was injected into physical plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603376505


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -784,19 +785,19 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   case _: BroadcastHashJoinExec => ()
 }.nonEmpty)
 
-// Even with hint broadcast, hash join is not used for non-binary collated 
strings.
+// Hash join is also used for non-binary collated strings.

Review Comment:
   I think that you should rename the test. It's currently called `test("hash 
based joins not allowed for non-binary collated strings")` which is no longer 
true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603374449


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -397,7 +398,11 @@ trait JoinSelectionHelper extends Logging {
 
   protected def hashJoinSupported
   (leftKeys: Seq[Expression], rightKeys: Seq[Expression]): Boolean = {
-val result = leftKeys.concat(rightKeys).forall(e => 
UnsafeRowUtils.isBinaryStable(e.dataType))
+def hasStableCollationKey(e: Expression): Boolean =
+  e.dataType.isInstanceOf[StringType] || 
e.dataType.isInstanceOf[BinaryType]
+def isHashJoinSupported(e: Expression): Boolean = 
UnsafeRowUtils.isBinaryStable(e.dataType) ||

Review Comment:
   Why do you need this `hasStableCollationKey`? `isBinaryStable` will return 
true for binary, so things should work just work, right?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub


dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603367540


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationKey.scala:
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
ExprCode}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+case class CollationKey(expr: Expression) extends UnaryExpression with 
ExpectsInputTypes {
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation)
+  override def dataType: DataType = BinaryType
+
+  final lazy val collationId: Int = expr.dataType match {
+case st: StringType =>
+  st.collationId
+  }
+
+  override def nullSafeEval(input: Any): Any = input match {
+case str: UTF8String =>
+  CollationFactory.getCollationKeyBytes(str, collationId)
+case _ =>
+  None
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+val collation = CollationFactory.fetchCollation(collationId)
+if (collation.supportsBinaryEquality) {
+  defineCodeGen(ctx, ev, c => s"$c.getBytes()")
+} else if (collation.supportsLowercaseEquality) {
+  defineCodeGen(ctx, ev, c => s"$c.toLowerCase().getBytes()")
+} else {
+  defineCodeGen(ctx, ev, c => s"CollationFactory.fetchCollation" +

Review Comment:
   You can use `getCollationKeyBytes` here?
   Also, IMO, you can just remove `getCollationKeyBytes` from collation factory 
and add the logic into `nullSafeEval`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]

2024-05-16 Thread via GitHub


cloud-fan commented on PR #46499:
URL: https://github.com/apache/spark/pull/46499#issuecomment-2115258124

   I think https://github.com/apache/spark/pull/45802#issuecomment-2101762336 
is a better idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48305][SQL] Add collation support for CurrentLike expressions [spark]

2024-05-16 Thread via GitHub


uros-db commented on PR #46613:
URL: https://github.com/apache/spark/pull/46613#issuecomment-2115241390

   @cloud-fan ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub


zhengruifeng commented on PR #46608:
URL: https://github.com/apache/spark/pull/46608#issuecomment-2115184716

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub


zhengruifeng closed pull request #46608: [SPARK-48301][SQL] Rename 
`CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to 
`CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE`
URL: https://github.com/apache/spark/pull/46608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-05-16 Thread via GitHub


johanl-db opened a new pull request, #46619:
URL: https://github.com/apache/spark/pull/46619

   
   ### What changes were proposed in this pull request?
   Compute the schema of the data without partition columns only once in 
FileSourceStrategy.
   
   ### Why are the changes needed?
   In FileSourceStrategy, the schema of the data excluding partition columns is 
computed 2 times in a slightly different way, using an AttributeSet 
(`partitionSet`) and using the attributes directly (`partitionColumns`) 
   These don't have the exact same semantics, AttributeSet will only use 
expression ids for comparison while comparing with the actual attributes will 
use the name, type, nullability and metadata. We want to use the former here.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48300][SQL] Codegen Support for `from_xml` [spark]

2024-05-16 Thread via GitHub


panbingkun commented on code in PR #46609:
URL: https://github.com/apache/spark/pull/46609#discussion_r1603226466


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala:
##
@@ -104,7 +103,7 @@ case class XmlToStructs(
 if (mode != PermissiveMode && mode != FailFastMode) {
   throw QueryCompilationErrors.parseModeUnsupportedError("from_xml", mode)
 }
-val (parserSchema, actualSchema) = nullableSchema match {

Review Comment:
   The variable `actualSchema` is not used anywhere



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub


srielau commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1603214384


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE ROUTINE with both IF NOT EXISTS and REPLACE is not allowed."

Review Comment:
   ```suggestion
 "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."
   ```



##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE ROUTINE with both IF NOT EXISTS and REPLACE is not allowed."

Review Comment:
   routine is a concept, like "relation". It's not a keyword/object



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48159][SQL] Extending support for collated strings on datetime expressions [spark]

2024-05-16 Thread via GitHub


mihailom-db commented on code in PR #46618:
URL: https://github.com/apache/spark/pull/46618#discussion_r1603211716


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##
@@ -1584,6 +1584,240 @@ class CollationSQLExpressionsSuite
 })
   }
 
+  test("CurrentTimeZone expression with collation") {
+// Supported collations
+Seq("UTF8_BINARY", "UTF8_BINARY_LCASE", "UNICODE", 
"UNICODE_CI").foreach(collationName => {
+  val query =
+s"""
+  |select current_timezone()
+  |""".stripMargin

Review Comment:
   practice is that if a string can go to one line we should do it, and this 
one seems small enough



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >