Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]
uros-db commented on code in PR #46511: URL: https://github.com/apache/spark/pull/46511#discussion_r1604437739 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,143 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found + * when searching for a pattern string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, + * with respect to the UTF8_BINARY_LCASE collation. The method assumes + * that the prefix is already lowercased prior to method call to avoid the + * overhead of calling .toLowerCase() multiple times on the same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with + * the specified prefix, with respect to the UTF8_BINARY_LCASE collation. + * The method assumes that the prefix is already lowercased. The method only + * considers the part of target string that starts from the specified position. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the end position for searching (in the target string) + * @return length of the target substring that ends with the specified suffix in lowercase + */ + public static int lowercaseMatchLengthFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int len = 0; len <= target.numChars() - startPos; ++len) { + if (target.substring(startPos, startPos + len).toLowerCase().equals(lowercasePattern)) { +return len; + } +} Review Comment: Yes, we'll address perf optimizations separately - so I'll create a ticket for that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]
nikolamand-db commented on PR #46180: URL: https://github.com/apache/spark/pull/46180#issuecomment-2116855540 @mkaravel @dbatomic please review again, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]
gengliangwang closed pull request #46612: [SPARK-48303][CORE] Reorganize `LogKeys` URL: https://github.com/apache/spark/pull/46612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]
gengliangwang commented on PR #46612: URL: https://github.com/apache/spark/pull/46612#issuecomment-2116827152 Thanks for the improvement! Merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]
HyukjinKwon closed pull request #46632: [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` URL: https://github.com/apache/spark/pull/46632 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]
HyukjinKwon commented on PR #46632: URL: https://github.com/apache/spark/pull/46632#issuecomment-2116752726 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48319][PYTHON][CONNECT][TESTS] Test `assert_true` and `raise_error` with the same error class as Spark Classic [spark]
zhengruifeng opened a new pull request, #46633: URL: https://github.com/apache/spark/pull/46633 ### What changes were proposed in this pull request? Test `assert_true` and `raise_error` with the same error class as Spark Classic ### Why are the changes needed? https://github.com/apache/spark/commit/578931678f5a6d6b33ebdae4bf866871e46fbb83 made `assert_true` and `raise_error` in Spark Connect throw `SparkRuntimeException`, then the error is the same as Spark Classic ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
uros-db commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1604352238 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala: ## @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, CollationKey, EqualNullSafe, EqualTo, Lower} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.types.StringType + +object RewriteCollationJoin extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput { +case j @ Join(_, _, _, Some(condition), _) => + val newCondition = condition transform { +case EqualTo(l: AttributeReference, r: AttributeReference) => + (l.dataType, r.dataType) match { +case (st: StringType, _: StringType) => Review Comment: edit: updated PR title, description, and corresponding Jira ticket title to reflect this also, created new ticket to handle complex types (ArrayType, StructType, etc.) https://issues.apache.org/jira/browse/SPARK-48318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48312][SQL] Improve Alias.removeNonInheritableMetadata performance [spark]
cloud-fan commented on code in PR #46622: URL: https://github.com/apache/spark/pull/46622#discussion_r1604351546 ## sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala: ## @@ -49,6 +49,9 @@ sealed class Metadata private[types] (private[types] val map: Map[String, Any]) /** Tests whether this Metadata contains a binding for a key. */ def contains(key: String): Boolean = map.contains(key) + /** Tests whether this Metadata is empty. */ + def isEmpty: Boolean = map.isEmpty Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
uros-db commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1604346112 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala: ## @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, CollationKey, EqualNullSafe, EqualTo, Lower} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.types.StringType + +object RewriteCollationJoin extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput { +case j @ Join(_, _, _, Some(condition), _) => + val newCondition = condition transform { +case EqualTo(l: AttributeReference, r: AttributeReference) => + (l.dataType, r.dataType) match { +case (st: StringType, _: StringType) => Review Comment: will add support for complex types in a future PR, for now - only String should be supported -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]
HyukjinKwon closed pull request #46631: [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` URL: https://github.com/apache/spark/pull/46631 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]
HyukjinKwon commented on PR #46631: URL: https://github.com/apache/spark/pull/46631#issuecomment-2116583644 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` [spark]
zhengruifeng commented on PR #46630: URL: https://github.com/apache/spark/pull/46630#issuecomment-2116579906 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` [spark]
zhengruifeng closed pull request #46630: [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` URL: https://github.com/apache/spark/pull/46630 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn [spark]
TakawaAkirayo commented on PR #46182: URL: https://github.com/apache/spark/pull/46182#issuecomment-2116573625 Gently ping @grundprinzip if anything else needs to be provided from my side :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]
pan3793 commented on PR #46611: URL: https://github.com/apache/spark/pull/46611#issuecomment-2116553658 > ... supposedly the workaround will be removed when the Yarn side upgrades their J2EE? I suppose not. According to the discussion in https://github.com/apache/spark/pull/31642 `org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter` should be part of `hadoop-client-api` but actually not, and > Agree that in the long term we should either: 1) consider to re-implement the logic in Spark which allows us to get away from server-side dependency in Hadoop ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]
HiuKwok commented on PR #46611: URL: https://github.com/apache/spark/pull/46611#issuecomment-2116550519 The patch makes sense to me, and supposedly the workaround will be removed when the Yarn side upgrades their J2EE? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]
HyukjinKwon closed pull request #46629: [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition URL: https://github.com/apache/spark/pull/46629 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]
HyukjinKwon commented on PR #46629: URL: https://github.com/apache/spark/pull/46629#issuecomment-2116548415 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]
HyukjinKwon commented on code in PR #46570: URL: https://github.com/apache/spark/pull/46570#discussion_r1604289643 ## python/pyspark/sql/connect/dataframe.py: ## @@ -104,6 +107,8 @@ class DataFrame(ParentDataFrame): +_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if os.cpu_count() else 8) Review Comment: Oops -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]
zhengruifeng commented on code in PR #46570: URL: https://github.com/apache/spark/pull/46570#discussion_r1604267324 ## python/pyspark/sql/connect/dataframe.py: ## @@ -104,6 +107,8 @@ class DataFrame(ParentDataFrame): +_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if os.cpu_count() else 8) Review Comment: where is this used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]
HyukjinKwon opened a new pull request, #46632: URL: https://github.com/apache/spark/pull/46632 ### What changes were proposed in this pull request? This PR proposes to enable the tests `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file`. ### Why are the changes needed? To make sure on the test coverage ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]
zml1206 commented on PR #46499: URL: https://github.com/apache/spark/pull/46499#issuecomment-2116500596 `with` is a good idea, thank you very much @cloud-fan . Close it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]
zml1206 closed pull request #46499: [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit URL: https://github.com/apache/spark/pull/46499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48306][SQL] Improve UDT in error message [spark]
yaooqinn commented on PR #46616: URL: https://github.com/apache/spark/pull/46616#issuecomment-2116498446 Merged to master. Thank you @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48306][SQL] Improve UDT in error message [spark]
yaooqinn closed pull request #46616: [SPARK-48306][SQL] Improve UDT in error message URL: https://github.com/apache/spark/pull/46616 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48113][CONNECT] Allow Plugins to integrate with Spark Connect [spark]
nchammas commented on PR #46364: URL: https://github.com/apache/spark/pull/46364#issuecomment-2116496398 Looks like the docs I am looking for are in #45340. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]
HyukjinKwon opened a new pull request, #46631: URL: https://github.com/apache/spark/pull/46631 ### What changes were proposed in this pull request? This PR fixes the test `test_apply_schema_to_row` to call `test_apply_schema_to_row` instead of `test_apply_schema_to_dict_and_rows`. It was a mistake. ### Why are the changes needed? To avoid a mistake when it's enabled in the future. ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? CI in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL][FOLLOWUP] Update the error message [spark]
zhengruifeng commented on PR #46628: URL: https://github.com/apache/spark/pull/46628#issuecomment-2116489481 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL][FOLLOWUP] Update the error message [spark]
zhengruifeng closed pull request #46628: [SPARK-48301][SQL][FOLLOWUP] Update the error message URL: https://github.com/apache/spark/pull/46628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-41625][PYTHON][CONNECT][TESTS] Enable `DataFrameObservationParityTests.test_observe_str` [spark]
HyukjinKwon opened a new pull request, #46630: URL: https://github.com/apache/spark/pull/46630 ### What changes were proposed in this pull request? This PR proposes to enable `DataFrameObservationParityTests.test_observe_str`. ### Why are the changes needed? To make sure on the test coverage ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]
HyukjinKwon commented on PR #46629: URL: https://github.com/apache/spark/pull/46629#issuecomment-2116480337 cc @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48316][PS][CONNECT][TESTS] Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]
HyukjinKwon opened a new pull request, #46629: URL: https://github.com/apache/spark/pull/46629 ### What changes were proposed in this pull request? This PR proposes to enable `SparkFrameMethodsParityTests.test_coalesce` and `SparkFrameMethodsParityTests.test_repartition` in Spark Connect by avoiding RDD usage in the test. ### Why are the changes needed? To make sure on the test coverage ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]
panbingkun commented on PR #46551: URL: https://github.com/apache/spark/pull/46551#issuecomment-2116479316 1.with `maven local cache` https://github.com/panbingkun/spark/actions/runs/9109019204 https://github.com/apache/spark/assets/15246973/439b6397-21bb-427d-a740-481755b54a02";> 2.without `maven local cache` https://github.com/panbingkun/spark/actions/runs/9107211572 https://github.com/apache/spark/assets/15246973/0238e450-5e1d-4405-afec-8fd086760b82";> https://github.com/panbingkun/spark/actions/runs/9114668535 https://github.com/apache/spark/assets/15246973/d0f5ae74-06df-44d4-8388-c5adfe02ce2a";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies [spark]
HyukjinKwon closed pull request #46621: [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies URL: https://github.com/apache/spark/pull/46621 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies [spark]
HyukjinKwon commented on PR #46621: URL: https://github.com/apache/spark/pull/46621#issuecomment-2116465949 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]
HyukjinKwon commented on code in PR #46622: URL: https://github.com/apache/spark/pull/46622#discussion_r1604233127 ## sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala: ## @@ -49,6 +49,9 @@ sealed class Metadata private[types] (private[types] val map: Map[String, Any]) /** Tests whether this Metadata contains a binding for a key. */ def contains(key: String): Boolean = map.contains(key) + /** Tests whether this Metadata is empty. */ + def isEmpty: Boolean = map.isEmpty Review Comment: Can you add `@since 4.0.0` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV options [spark]
HyukjinKwon commented on code in PR #46626: URL: https://github.com/apache/spark/pull/46626#discussion_r1604232424 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -149,7 +149,12 @@ class CSVOptions( parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId)) // A language tag in IETF BCP 47 format - val locale: Locale = parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US) + val locale: Locale = try { +parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US) + } catch { +case _: NullPointerException => Review Comment: and this issue should exist in almost all other options. Would probably be best to fix them together -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV options [spark]
HyukjinKwon commented on code in PR #46626: URL: https://github.com/apache/spark/pull/46626#discussion_r1604232182 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -149,7 +149,12 @@ class CSVOptions( parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId)) // A language tag in IETF BCP 47 format - val locale: Locale = parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US) + val locale: Locale = try { +parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US) + } catch { +case _: NullPointerException => Review Comment: Can we check if the value is `null`, and throw an exception instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]
panbingkun commented on PR #46612: URL: https://github.com/apache/spark/pull/46612#issuecomment-2116460422 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [spark]
Kimahriman commented on PR #46627: URL: https://github.com/apache/spark/pull/46627#issuecomment-2116427914 @HeartSaVioR since you added the file caching originally back in the day -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [spark]
Kimahriman opened a new pull request, #46627: URL: https://github.com/apache/spark/pull/46627 ### What changes were proposed in this pull request? Files don't need to be cached for reuse in `FileStreamSource` when using `Trigger.AvailableNow` because all files are already cached for the lifetime of the query in `allFilesForTriggerAvailableNow`. ### Why are the changes needed? As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address https://github.com/apache/spark/pull/45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up. ### Does this PR introduce _any_ user-facing change? Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]
zhengruifeng commented on code in PR #46608: URL: https://github.com/apache/spark/pull/46608#discussion_r1604205064 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -2675,9 +2675,9 @@ "ANALYZE TABLE(S) ... COMPUTE STATISTICS ... must be either NOSCAN or empty." ] }, - "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : { + "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : { "message" : [ - "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed." + "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed." Review Comment: sounds good, we don't have PROCEDURE in OSS, will update in a followup PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][Spark 44646] Reduce usage of log4j core [spark]
github-actions[bot] closed pull request #45001: [WIP][Spark 44646] Reduce usage of log4j core URL: https://github.com/apache/spark/pull/45001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42789][SQL] Rewrite multiple GetJsonObject that consumes same JSON to single JsonTuple [spark]
github-actions[bot] closed pull request #45020: [SPARK-42789][SQL] Rewrite multiple GetJsonObject that consumes same JSON to single JsonTuple URL: https://github.com/apache/spark/pull/45020 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
HyukjinKwon closed pull request #46571: [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir URL: https://github.com/apache/spark/pull/46571 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
HyukjinKwon commented on PR #46571: URL: https://github.com/apache/spark/pull/46571#issuecomment-2116375998 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test [spark]
HyukjinKwon closed pull request #46614: [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test URL: https://github.com/apache/spark/pull/46614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test [spark]
HyukjinKwon commented on PR #46614: URL: https://github.com/apache/spark/pull/46614#issuecomment-2116375642 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]
dependabot[bot] commented on PR #46625: URL: https://github.com/apache/spark/pull/46625#issuecomment-2116374205 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]
HyukjinKwon closed pull request #46625: Bump rexml from 3.2.6 to 3.2.8 in /docs URL: https://github.com/apache/spark/pull/46625 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-43815] Wrap NPE with AnalysisException in csv options [spark]
michaelzhan-db opened a new pull request, #46626: URL: https://github.com/apache/spark/pull/46626 ### What changes were proposed in this pull request? When user sets `locale` to be `null`, a NPE is raised. Instead, replace the NPE with a more understandable user facing error message. ### Why are the changes needed? Improves usability of csv options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests pass. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]
gengliangwang commented on PR #46623: URL: https://github.com/apache/spark/pull/46623#issuecomment-2116274470 @michaelzhan-db there are merge conflicts against branch-3.5. Please create a new PR for the backport. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]
gengliangwang closed pull request #46623: [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError URL: https://github.com/apache/spark/pull/46623 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]
gengliangwang commented on PR #46623: URL: https://github.com/apache/spark/pull/46623#issuecomment-2116272913 Thanks, merging to master and branch 3.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]
dependabot[bot] opened a new pull request, #46625: URL: https://github.com/apache/spark/pull/46625 Bumps [rexml](https://github.com/ruby/rexml) from 3.2.6 to 3.2.8. Release notes Sourced from https://github.com/ruby/rexml/releases";>rexml's releases. REXML 3.2.8 - 2024-05-16 Fixes Suppressed a warning REXML 3.2.7 - 2024-05-16 Improvements Improve parse performance by using StringScanner. https://redirect.github.com/ruby/rexml/issues/106";>GH-106 https://redirect.github.com/ruby/rexml/issues/107";>GH-107 https://redirect.github.com/ruby/rexml/issues/108";>GH-108 https://redirect.github.com/ruby/rexml/issues/109";>GH-109 https://redirect.github.com/ruby/rexml/issues/112";>GH-112 https://redirect.github.com/ruby/rexml/issues/113";>GH-113 https://redirect.github.com/ruby/rexml/issues/114";>GH-114 https://redirect.github.com/ruby/rexml/issues/115";>GH-115 https://redirect.github.com/ruby/rexml/issues/116";>GH-116 https://redirect.github.com/ruby/rexml/issues/117";>GH-117 https://redirect.github.com/ruby/rexml/issues/118";>GH-118 https://redirect.github.com/ruby/rexml/issues/119";>GH-119 https://redirect.github.com/ruby/rexml/issues/121";>GH-121 Patch by NAITOH Jun. Improved parse performance when an attribute has manyGH-124 Fixes XPath: Fixed a bug of normalize_space(array). https://redirect.github.com/ruby/rexml/issues/110";>GH-110 https://redirect.github.com/ruby/rexml/issues/111";>GH-111 Patch by flatisland. XPath: Fixed a bug that wrong position is used with nested path. https://redirect.github.com/ruby/rexml/issues/110";>GH-110 https://redirect.github.com/ruby/rexml/issues/122";>GH-122 Reported by jcavalieri. Patch by NAITOH Jun. Fixed a bug that an exception message can't be generated for invalid encoding XML. ... (truncated) Changelog Sourced from https://github.com/ruby/rexml/blob/master/NEWS.md";>rexml's changelog. 3.2.8 - 2024-05-16 {#version-3-2-8} Fixes Suppressed a warning 3.2.7 - 2024-05-16 {#version-3-2-7} Improvements Improve parse performance by using StringScanner. https://redirect.github.com/ruby/rexml/issues/106";>GH-106 https://redirect.github.com/ruby/rexml/issues/107";>GH-107 https://redirect.github.com/ruby/rexml/issues/108";>GH-108 https://redirect.github.com/ruby/rexml/issues/109";>GH-109 https://redirect.github.com/ruby/rexml/issues/112";>GH-112 https://redirect.github.com/ruby/rexml/issues/113";>GH-113 https://redirect.github.com/ruby/rexml/issues/114";>GH-114 https://redirect.github.com/ruby/rexml/issues/115";>GH-115 https://redirect.github.com/ruby/rexml/issues/116";>GH-116 https://redirect.github.com/ruby/rexml/issues/117";>GH-117 https://redirect.github.com/ruby/rexml/issues/118";>GH-118 https://redirect.github.com/ruby/rexml/issues/119";>GH-119 https://redirect.github.com/ruby/rexml/issues/121";>GH-121 Patch by NAITOH Jun. Improved parse performance when an attribute has many GH-124 Fixes XPath: Fixed a bug of normalize_space(array). https://redirect.github.com/ruby/rexml/issues/110";>GH-110 https://redirect.github.com/ruby/rexml/issues/111";>GH-111 Patch by flatisland. XPath: Fixed a bug that wrong position is used with nested path. https://redirect.github.com/ruby/rexml/issues/110";>GH-110 https://redirect.github.com/ruby/rexml/issues/122";>GH-122 Reported by jcavalieri. Patch by NAITOH Jun. Fixed a bug that an exception message can't be generated for ... (truncated) Commits https://github.com/ruby/rexml/commit/1cf37bab79d61d6183bbda8bf525ed587012b718";>1cf37ba Add 3.2.8 entry https://github.com/ruby/rexml/commit/b67081caa807fad48d31983137b7ed8711e7f0df";>b67081c Remove an unused variable (https://redirect.github.com/ruby/rexml/issues/128";>#128) https://github.com/ruby/rexml/commit/94e180e939baff8f7e328a287bb96ebbd99db6eb";>94e180e Suppress a warning https://github.com/ruby/rexml/commit/d574ba5fe1c40adbafbf16e47533f4eb32b43e60";>d574ba5 ci: install only gems required for running tests (https://redirect.github.com/ruby/rexml/issues/129";>#129) https://github.com/ruby/rexml/commit/4670f8fc187c89d0504d027ea997959287143453";>4670f8f Add missing Thanks secti
Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]
chaoqin-li1123 commented on code in PR #46139: URL: https://github.com/apache/spark/pull/46139#discussion_r1603979413 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -59,8 +59,17 @@ Start by creating a new subclass of :class:`DataSource`. Define the source name, def reader(self, schema: StructType): return FakeDataSourceReader(schema, self.options) +def streamReader(self, schema: StructType): Review Comment: I prefer not to duplicate the DataSource code. We already document that developer only need to implement the corresponding method for a certain capacity. ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -33,9 +33,15 @@ To create a custom Python data source, you'll need to subclass the :class:`DataS This example demonstrates creating a simple data source to generate synthetic data using the `faker` library. Ensure the `faker` library is installed and accessible in your Python environment. -**Step 1: Define the Data Source** +**Define the Data Source** -Start by creating a new subclass of :class:`DataSource`. Define the source name, schema, and reader logic as follows: +Start by creating a new subclass of :class:`DataSource` with the source name, schema. + +In order to read from the data source in a batch query, reader() method need to be defined. + +In order to read from the data source in a streaming query, streamReader() or simpleStreamReader() method need to be defined. + +In order to write to the data source in a streaming query, streamWriter() method need to be defined. Review Comment: Table added. ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use the `faker` library to p row.append(value) yield tuple(row) +Implementing Streaming Reader and Writer for Python Data Source +--- +**Implement the Stream Reader** + +This is a dummy streaming data reader that generate 2 rows in every microbatch. The streamReader instance has a integer offset that increase by 2 in every microbatch. + +.. code-block:: python + +class RangePartition(InputPartition): +def __init__(self, start, end): +self.start = start +self.end = end + +class FakeStreamReader(DataSourceStreamReader): +def __init__(self, schema, options): +self.current = 0 + +def initialOffset(self) -> dict: +""" +Return the initial start offset of the reader. +""" +return {"offset": 0} + +def latestOffset(self) -> dict: +""" +Return the current latest offset that the next microbatch will read to. +""" +self.current += 2 +return {"offset": self.current} + +def partitions(self, start: dict, end: dict): +""" +Plans the partitioning of the current microbatch defined by start and end offset, +it needs to return a sequence of :class:`InputPartition` object. +""" +return [RangePartition(start["offset"], end["offset"])] + +def commit(self, end: dict): +""" +This is invoked when the query has finished processing data before end offset, this can be used to clean up resource. +""" +pass + +def read(self, partition) -> Iterator[Tuple]: +""" +Takes a partition as an input and read an iterator of tuples from the data source. +""" +start, end = partition.start, partition.end +for i in range(start, end): +yield (i, str(i)) + +**Implement the Simple Stream Reader** + +If the data source has low throughput and doesn't require partitioning, you can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader. + +One of simpleStreamReader() and streamReader() must be implemented for readable streaming data source. And simpleStreamReader() will only be invoked when streamReader() is not implemented. + +This is the same dummy streaming reader that generate 2 rows every batch implemented with SimpleDataSourceStreamReader interface. + +.. code-block:: python + +class SimpleStreamReader(SimpleDataSourceStreamReader): +def initialOffset(self): +""" +Return the initial start offset of the reader. +""" +return {"offset": 0} + +def read(self, start: dict) -> (Iterator[Tuple], dict): +""" +Takes start offset as an input, return an iterator of tuples and the start offset of next read. +""" +start_idx = start["offset"] +it = iter([(i,) for i in range(start_idx, start_idx + 2)]) +return (it, {"offset": start
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
mridulm commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1603972872 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1317,6 +1317,16 @@ package object config { s" be less than or equal to ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") .createWithDefault(64 * 1024 * 1024) + private[spark] val CHECKPOINT_DIR = +ConfigBuilder("spark.checkpoint.dir") + .doc( +"Equivalent with SparkContext.setCheckpointDir. If set, the path becomes" + + "the default directory for checkpointing. It can be overwritten by" + + "SparkContext.setCheckpointDir.") Review Comment: nit: ```suggestion "Set the default directory for checkpointing. It can be overwritten by " + "SparkContext.setCheckpointDir.") ``` (I missed this in my earlier pass, my bad) ## docs/configuration.md: ## @@ -1795,6 +1795,16 @@ Apart from these, the following properties are also available, and may be useful 0.6.0 + + spark.checkpoint.dir + (none) + +Equivalent with SparkContext.setCheckpointDir. If set, the path becomes +the default directory for checkpointing. It can be overwritten by +SparkContext.setCheckpointDir. Review Comment: Same as above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]
mridulm commented on PR #46611: URL: https://github.com/apache/spark/pull/46611#issuecomment-2116089785 +CC @tgravescs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Python ds preview [spark]
chaoqin-li1123 closed pull request #46624: Python ds preview URL: https://github.com/apache/spark/pull/46624 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Python ds preview [spark]
chaoqin-li1123 opened a new pull request, #46624: URL: https://github.com/apache/spark/pull/46624 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]
srielau commented on code in PR #46608: URL: https://github.com/apache/spark/pull/46608#discussion_r1603902920 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -2675,9 +2675,9 @@ "ANALYZE TABLE(S) ... COMPUTE STATISTICS ... must be either NOSCAN or empty." ] }, - "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : { + "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : { "message" : [ - "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed." + "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed." Review Comment: That works too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* [spark]
gengliangwang closed pull request #46615: [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* URL: https://github.com/apache/spark/pull/46615 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* [spark]
gengliangwang commented on PR #46615: URL: https://github.com/apache/spark/pull/46615#issuecomment-2115976061 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]
allisonwang-db commented on code in PR #46139: URL: https://github.com/apache/spark/pull/46139#discussion_r1602361195 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -33,9 +33,15 @@ To create a custom Python data source, you'll need to subclass the :class:`DataS This example demonstrates creating a simple data source to generate synthetic data using the `faker` library. Ensure the `faker` library is installed and accessible in your Python environment. -**Step 1: Define the Data Source** +**Define the Data Source** -Start by creating a new subclass of :class:`DataSource`. Define the source name, schema, and reader logic as follows: +Start by creating a new subclass of :class:`DataSource` with the source name, schema. + +In order to read from the data source in a batch query, reader() method need to be defined. + +In order to read from the data source in a streaming query, streamReader() or simpleStreamReader() method need to be defined. + +In order to write to the data source in a streaming query, streamWriter() method need to be defined. Review Comment: Do you think it's more clear to have a markdown table here? streaming/batch x read/write. ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use the `faker` library to p row.append(value) yield tuple(row) +Implementing Streaming Reader and Writer for Python Data Source +--- +**Implement the Stream Reader** + +This is a dummy streaming data reader that generate 2 rows in every microbatch. The streamReader instance has a integer offset that increase by 2 in every microbatch. + +.. code-block:: python + +class RangePartition(InputPartition): +def __init__(self, start, end): +self.start = start +self.end = end + +class FakeStreamReader(DataSourceStreamReader): +def __init__(self, schema, options): +self.current = 0 + +def initialOffset(self) -> dict: +""" +Return the initial start offset of the reader. +""" +return {"offset": 0} + +def latestOffset(self) -> dict: +""" +Return the current latest offset that the next microbatch will read to. +""" +self.current += 2 +return {"offset": self.current} + +def partitions(self, start: dict, end: dict): +""" +Plans the partitioning of the current microbatch defined by start and end offset, +it needs to return a sequence of :class:`InputPartition` object. +""" +return [RangePartition(start["offset"], end["offset"])] + +def commit(self, end: dict): +""" +This is invoked when the query has finished processing data before end offset, this can be used to clean up resource. +""" +pass + +def read(self, partition) -> Iterator[Tuple]: +""" +Takes a partition as an input and read an iterator of tuples from the data source. +""" +start, end = partition.start, partition.end +for i in range(start, end): +yield (i, str(i)) + +**Implement the Simple Stream Reader** + +If the data source has low throughput and doesn't require partitioning, you can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader. + +One of simpleStreamReader() and streamReader() must be implemented for readable streaming data source. And simpleStreamReader() will only be invoked when streamReader() is not implemented. + +This is the same dummy streaming reader that generate 2 rows every batch implemented with SimpleDataSourceStreamReader interface. + +.. code-block:: python + +class SimpleStreamReader(SimpleDataSourceStreamReader): +def initialOffset(self): +""" +Return the initial start offset of the reader. +""" +return {"offset": 0} + +def read(self, start: dict) -> (Iterator[Tuple], dict): +""" +Takes start offset as an input, return an iterator of tuples and the start offset of next read. +""" +start_idx = start["offset"] +it = iter([(i,) for i in range(start_idx, start_idx + 2)]) +return (it, {"offset": start_idx + 2}) + +def readBetweenOffsets(self, start: dict, end: dict) -> Iterator[Tuple]: +""" +Takes start and end offset as input and read an iterator of data deterministically. +This is called whe query replay batches during restart or after failure. +""" +start_idx = start["offset"] +end_idx = end["offset"] +return iter([(i,) for i in range(start_id
Re: [PR] [SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the original WithCTE node [spark]
amaliujia commented on code in PR #46617: URL: https://github.com/apache/spark/pull/46617#discussion_r1603865663 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala: ## @@ -74,34 +70,33 @@ case class InlineCTE(alwaysInline: Boolean = false) extends Rule[LogicalPlan] { * * @param plan The plan to collect the CTEs from * @param cteMap A mutable map that accumulates the CTEs and their reference information by CTE - * ids. The value of the map is tuple whose elements are: - * - The CTE definition - * - The number of incoming references to the CTE. This includes references from - * other CTEs and regular places. - * - A mutable inner map that tracks outgoing references (counts) to other CTEs. + * ids. * @param outerCTEId While collecting the map we use this optional CTE id to identify the * current outer CTE. */ - def buildCTEMap( + private def buildCTEMap( plan: LogicalPlan, - cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])], + cteMap: mutable.Map[Long, CTEReferenceInfo], outerCTEId: Option[Long] = None): Unit = { plan match { case WithCTE(child, cteDefs) => cteDefs.foreach { cteDef => - cteMap(cteDef.id) = (cteDef, 0, mutable.Map.empty.withDefaultValue(0)) + cteMap(cteDef.id) = CTEReferenceInfo( +cteDef = cteDef, +refCount = 0, +outgoingRefs = mutable.Map.empty.withDefaultValue(0), +shouldInline = true + ) } cteDefs.foreach { cteDef => buildCTEMap(cteDef, cteMap, Some(cteDef.id)) } buildCTEMap(child, cteMap, outerCTEId) case ref: CTERelationRef => -val (cteDef, refCount, refMap) = cteMap(ref.cteId) -cteMap(ref.cteId) = (cteDef, refCount + 1, refMap) +cteMap(ref.cteId) = cteMap(ref.cteId).withRefCountIncreased(1) outerCTEId.foreach { cteId => - val (_, _, outerRefMap) = cteMap(cteId) - outerRefMap(ref.cteId) += 1 + cteMap(cteId).recordOutgoingReference(ref.cteId) Review Comment: I personally like to not hide the `+1` implementation details here as I had troubles to understand it. Or the function name can be `IncreaseOutgoingReferenceByOne`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48294] Handle lowercase in nestedTypeMissingElementTypeError [spark]
michaelzhan-db opened a new pull request, #46623: URL: https://github.com/apache/spark/pull/46623 ### What changes were proposed in this pull request? Handle lowercase values inside of nestTypeMissingElementTypeError to prevent match errors. ### Why are the changes needed? The previous match error was not user-friendly. Now it gives an actionable `INCOMPLETE_TYPE_DEFINITION` error. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Newly added tests pass. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) [spark]
mkaravel commented on code in PR #46589: URL: https://github.com/apache/spark/pull/46589#discussion_r1603752983 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final UTF8String string, return UTF8String.EMPTY_UTF8; } -UTF8String lowercaseString = string.toLowerCase(); UTF8String lowercaseDelimiter = delimiter.toLowerCase(); if (count > 0) { - int idx = -1; + // search left to right (note: the start code point is inclusive) Review Comment: ```suggestion // Search left to right (note: the start code point is inclusive). ``` Please use regular sentences as comments. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final UTF8String string, return UTF8String.EMPTY_UTF8; } -UTF8String lowercaseString = string.toLowerCase(); UTF8String lowercaseDelimiter = delimiter.toLowerCase(); if (count > 0) { - int idx = -1; + // search left to right (note: the start code point is inclusive) + int matchLength = -1; while (count > 0) { -idx = lowercaseString.find(lowercaseDelimiter, idx + 1); -if (idx >= 0) { - count--; -} else { - // can not find enough delim - return string; -} - } - if (idx == 0) { -return UTF8String.EMPTY_UTF8; +matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 1); +if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter +else return string; // cannot find enough delimiters in the string Review Comment: ```suggestion else return string; // Cannot find enough delimiters in the string. ``` ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final UTF8String string, return UTF8String.EMPTY_UTF8; } -UTF8String lowercaseString = string.toLowerCase(); UTF8String lowercaseDelimiter = delimiter.toLowerCase(); if (count > 0) { - int idx = -1; + // search left to right (note: the start code point is inclusive) + int matchLength = -1; while (count > 0) { -idx = lowercaseString.find(lowercaseDelimiter, idx + 1); -if (idx >= 0) { - count--; -} else { - // can not find enough delim - return string; -} - } - if (idx == 0) { -return UTF8String.EMPTY_UTF8; +matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 1); +if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter +else return string; // cannot find enough delimiters in the string } - byte[] bytes = new byte[idx]; - copyMemory(string.getBaseObject(), string.getBaseOffset(), bytes, BYTE_ARRAY_OFFSET, idx); - return UTF8String.fromBytes(bytes); - + if (matchLength == 0) return UTF8String.EMPTY_UTF8; + return string.substring(0, matchLength); } else { - int idx = string.numBytes() - delimiter.numBytes() + 1; + // search right to left (note: the end code point is exclusive) + int matchLength = string.numChars() + 1; count = -count; while (count > 0) { -idx = lowercaseString.rfind(lowercaseDelimiter, idx - 1); -if (idx >= 0) { - count--; -} else { - // can not find enough delim - return string; -} +matchLength = lowercaseRFind(string, lowercaseDelimiter, matchLength - 1); +if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter +else return string; // cannot find enough delimiters in the string } - if (idx + delimiter.numBytes() == string.numBytes()) { -return UTF8String.EMPTY_UTF8; - } - int size = string.numBytes() - delimiter.numBytes() - idx; - byte[] bytes = new byte[size]; - copyMemory(string.getBaseObject(), string.getBaseOffset() + idx + delimiter.numBytes(), -bytes, BYTE_ARRAY_OFFSET, size); - return UTF8String.fromBytes(bytes); + if (matchLength == string.numChars()) return UTF8String.EMPTY_UTF8; + return string.substring(matchLength, string.numChars()); Review Comment: ```suggestion return string.substring(matchLength, string.numChars()); ``` Same here. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final UTF8String string, return UTF8String.EMPTY_UTF8; } -UTF8String lowercaseString = string.toLowerCase(); UTF8
Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]
mkaravel commented on code in PR #46511: URL: https://github.com/apache/spark/pull/46511#discussion_r1603716669 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,143 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found + * when searching for a pattern string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, + * with respect to the UTF8_BINARY_LCASE collation. The method assumes + * that the prefix is already lowercased prior to method call to avoid the + * overhead of calling .toLowerCase() multiple times on the same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) Review Comment: Let's specify that this is 0-based. ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,143 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found + * when searching for a pattern string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, + * with respect to the UTF8_BINARY_LCASE collation. The method assumes + * that the prefix is already lowercased prior to method call to avoid the + * overhead of calling .toLowerCase() multiple times on the same prefix string. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return whether the target string starts with the specified prefix in UTF8_BINARY_LCASE + */ + public static boolean lowercaseMatchFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != MATCH_NOT_FOUND; + } + + /** + * Returns the length of the substring of the target string that starts with + * the specified prefix, with respect to the UTF8_BINARY_LCASE collation. + * The method assumes that the prefix is already lowercased. The method only + * considers the part of target string that starts from the specified position. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the end position for searching (in the target string) + * @return length of the target substring that ends with the specified suffix in lowercase + */ + public static int lowercaseMatchLengthFrom( + final UTF8String target, + final UTF8String lowercasePattern, + int startPos) { +assert startPos >= 0; +for (int len = 0; len <= target.numChars() - startPos; ++len) { + if (target.substring(startPos, startPos + len).toLowerCase().equals(lowercasePattern)) { +return len; + } +} +return MATCH_NOT_FOUND; + } + + /** + * Returns the position of the first occurrence of the pattern string + * in the target string from the specified position (0-based index), + * with respect to the UTF8_BINARY_LCASE collation. The method assumes + * that the pattern string is already lowercased prior to method call. + * + * @param target the string to be searched in + * @param lowercasePattern the string to be searched for + * @param startPos the start position for searching (in the target string) + * @return the position of the first occurrence of pattern in target, if not found, -1 returned. Review Comment: ```suggestion * @return the position of the first occurrence of pattern in target, if not found, `MATCH_NOT_FOUND` returned. ``` ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java: ## @@ -34,6 +34,143 @@ * Utility class for collation-aware UTF8String operations. */ public class CollationAwareUTF8String { + + /** + * The constant value to indicate that the match is not found + * when searching for a pattern string in a target string. + */ + private static final int MATCH_NOT_FOUND = -1; + + /** + * Returns whether the target string starts with the specified prefix, + * with respect to the UTF8_BINARY_LCASE collation. The method assumes + * that the prefix is already lowercased prior to method call to avoid the + * overhead of calling .toLowerCase() multiple times on the same prefix string. + * + * @param target the string to b
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603733508 ## sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala: ## @@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("schema with collation should not change during ser/de") { +val simpleStruct = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val nestedStruct = StructType( + StructField("nested", simpleStruct) :: Nil) + +val caseInsensitiveNames = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: + StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val specialCharsInName = StructType( + StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val arrayInSchema = StructType( + StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapInSchema = StructType( + StructField("mapField", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapWithKeyInNameInSchema = StructType( + StructField("name.key", StringType) :: + StructField("name", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val arrayInMapInNestedSchema = StructType( + StructField("arrInMap", +MapType(StringType(UNICODE_COLLATION_ID), +ArrayType(StringType(UNICODE_COLLATION_ID :: Nil) + +val nestedArrayInMap = StructType( + StructField("nestedArrayInMap", +ArrayType(MapType(StringType(UNICODE_COLLATION_ID), + ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil) + +val schemaWithMultipleFields = StructType( + simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ mapInSchema.fields ++ +mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ nestedArrayInMap.fields) + +Seq( + simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, arrayInSchema, + mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, arrayInMapInNestedSchema, + schemaWithMultipleFields) + .foreach { schema => +val json = schema.json +val parsed = DataType.fromJson(json) +assert(parsed === schema) + } + } + + test("non string field has collation metadata") { Review Comment: `StructTypeSuite` is used just to validate `toJson` value, deserialization is tested in `DataTypeSuite` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603732173 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,35 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", + collationsMap: Map[String, String] = Map.empty): DataType = json match { case JString(name) => - nameToType(name) + collationsMap.get(fieldPath) match { +case Some(collation) => + assertValidTypeForCollations(fieldPath, name, collationsMap) + stringTypeWithCollation(collation) +case _ => nameToType(name) + } case JSortedObject( ("containsNull", JBool(n)), ("elementType", t: JValue), ("type", JString("array"))) => - ArrayType(parseDataType(t), n) + assertValidTypeForCollations(fieldPath, "array", collationsMap) + val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", collationsMap) + ArrayType(elementType, n) case JSortedObject( ("keyType", k: JValue), ("type", JString("map")), ("valueContainsNull", JBool(n)), ("valueType", v: JValue)) => - MapType(parseDataType(k), parseDataType(v), n) + assertValidTypeForCollations(fieldPath, "map", collationsMap) + val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", collationsMap) + val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", collationsMap) Review Comment: good catch! After some refactoring earlier this is no longer needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]
allisonwang-db commented on code in PR #46608: URL: https://github.com/apache/spark/pull/46608#discussion_r1603731266 ## common/utils/src/main/resources/error/error-conditions.json: ## @@ -2675,9 +2675,9 @@ "ANALYZE TABLE(S) ... COMPUTE STATISTICS ... must be either NOSCAN or empty." ] }, - "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : { + "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : { "message" : [ - "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed." + "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed." Review Comment: How about: "Cannot create a routine with both IF NOT EXISTS and REPLACE specified." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603727528 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: So you also think we should put optional here? Even intellij complains when it is used in a field. https://github.com/apache/spark/assets/154237371/ea7bd5c5-43cb-4826-aae5-bf76588b6198";> Also, you can read below that it's intended to be used for method return type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]
agubichev commented on PR #46622: URL: https://github.com/apache/spark/pull/46622#issuecomment-2115745170 @cloud-fan PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603651070 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,61 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = schemaCollationValue(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = schemaCollationValue(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap.toMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def schemaCollationValue(dt: DataType): String = dt match { +case st: StringType => + val collation = CollationFactory.fetchCollation(st.collationId) + collation.identifier().toStringWithoutVersion() Review Comment: for writing statistics - version change invalidates them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]
vladimirg-db commented on PR #46622: URL: https://github.com/apache/spark/pull/46622#issuecomment-2115620509 @agubichev, hi! Here's the improvement fix for the ultra-wide views -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48159][SQL] Extending support for collated strings on datetime expressions [spark]
nebojsa-db commented on PR #46618: URL: https://github.com/apache/spark/pull/46618#issuecomment-2115535153 @cloud-fan Please review :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48309][YARN]Stop am retry, in situations where some errors and… [spark]
guixiaowen opened a new pull request, #46620: URL: https://github.com/apache/spark/pull/46620 … retries may not be successful ### What changes were proposed in this pull request? In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our production environment, it is configured as 2 If the first execution fails, AM will retry. However, in some scenarios, even attempting a second task may fail. For example: org. apache. park. SQL AnalysisException: Table or view not found: test.test_x; Line 1 pos 14; Project +-Unresolved Relationship [bigdata_qa, testx_x], [], false Other example: Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The NameSpace quota (directories and files) of directory/tmp/xxx_file/ is exceeded: quota=100 file count=101 Would it be more appropriate to try capturing these exceptions and stopping retry? ### Why are the changes needed? In some scenarios, even attempting a second task may fail. ### Does this PR introduce _any_ user-facing change? The user can throw a SparkStopAMRetryException, and the Application Master will catch the exception and stop retry For examle val spark = SparkSession .builder() .appName("Spark SQL basic example") .enableHiveSupport() .config("spark.some.config.option", "some-value") .getOrCreate() try { spark.sql("select * from test.test_x;").show } catch { case e:AnalysisException => throw new SparkStopAMRetryException("this is a test", e) } finally { spark.stop() } ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]
PetarVasiljevic-DB commented on code in PR #45537: URL: https://github.com/apache/spark/pull/45537#discussion_r1603513988 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala: ## @@ -125,6 +126,29 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTes override def caseConvert(tableName: String): String = tableName.toUpperCase(Locale.ROOT) + override protected val timestampNTZType: String = "TIMESTAMP WITH LOCAL TIME ZONE" Review Comment: Yep, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]
cloud-fan closed pull request #46619: [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy URL: https://github.com/apache/spark/pull/46619 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]
cloud-fan commented on PR #46619: URL: https://github.com/apache/spark/pull/46619#issuecomment-2115430404 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]
cloud-fan commented on code in PR #45537: URL: https://github.com/apache/spark/pull/45537#discussion_r1603428926 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala: ## @@ -125,6 +126,29 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTes override def caseConvert(tableName: String): String = tableName.toUpperCase(Locale.ROOT) + override protected val timestampNTZType: String = "TIMESTAMP WITH LOCAL TIME ZONE" Review Comment: isn't it LTZ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]
cloud-fan commented on code in PR #45537: URL: https://github.com/apache/spark/pull/45537#discussion_r1603426755 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala: ## @@ -102,4 +105,7 @@ class DB2IntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTest { } override def caseConvert(tableName: String): String = tableName.toUpperCase(Locale.ROOT) + + override protected val timestampNTZType: String = "TIMESTAMP WITHOUT TIME ZONE" + override protected val timestampTZType: String = "TIMESTAMP" Review Comment: do you mean NTZ? TZ usually means WITH TIME ZONE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
cloud-fan commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1603420522 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala: ## @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, CollationKey, EqualNullSafe, EqualTo, Lower} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.types.StringType + +object RewriteCollationJoin extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput { +case j @ Join(_, _, _, Some(condition), _) => + val newCondition = condition transform { +case EqualTo(l: AttributeReference, r: AttributeReference) => + (l.dataType, r.dataType) match { +case (st: StringType, _: StringType) => Review Comment: how about nested string type such as array of string, struct of string? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603406813 ## sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala: ## @@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite { assert(result === expected) } + + test("schema with collation should not change during ser/de") { +val simpleStruct = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val nestedStruct = StructType( + StructField("nested", simpleStruct) :: Nil) + +val caseInsensitiveNames = StructType( + StructField("c1", StringType(UNICODE_COLLATION_ID)) :: + StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val specialCharsInName = StructType( + StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil) + +val arrayInSchema = StructType( + StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapInSchema = StructType( + StructField("mapField", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val mapWithKeyInNameInSchema = StructType( + StructField("name.key", StringType) :: + StructField("name", +MapType(StringType(UNICODE_COLLATION_ID), StringType(UNICODE_COLLATION_ID))) :: Nil) + +val arrayInMapInNestedSchema = StructType( + StructField("arrInMap", +MapType(StringType(UNICODE_COLLATION_ID), +ArrayType(StringType(UNICODE_COLLATION_ID :: Nil) + +val nestedArrayInMap = StructType( + StructField("nestedArrayInMap", +ArrayType(MapType(StringType(UNICODE_COLLATION_ID), + ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil) + +val schemaWithMultipleFields = StructType( + simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ mapInSchema.fields ++ +mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ nestedArrayInMap.fields) + +Seq( + simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, arrayInSchema, + mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, arrayInMapInNestedSchema, + schemaWithMultipleFields) + .foreach { schema => +val json = schema.json +val parsed = DataType.fromJson(json) +assert(parsed === schema) + } + } + + test("non string field has collation metadata") { Review Comment: how do we decide to put test in here or `StructTypeSuite`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603405105 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,61 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = schemaCollationValue(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = schemaCollationValue(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap.toMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def schemaCollationValue(dt: DataType): String = dt match { +case st: StringType => + val collation = CollationFactory.fetchCollation(st.collationId) + collation.identifier().toStringWithoutVersion() Review Comment: when will we use the version? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603404296 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,61 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = schemaCollationValue(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = schemaCollationValue(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) +fieldToCollationMap.toMap + } + + private def isCollatedString(dt: DataType): Boolean = dt match { +case st: StringType => !st.isUTF8BinaryCollation +case _ => false + } + + private def schemaCollationValue(dt: DataType): String = dt match { +case st: StringType => + val collation = CollationFactory.fetchCollation(st.collationId) + collation.identifier().toStringWithoutVersion() +case _ => + throw new IllegalStateException(s"Unexpected data type $dt") Review Comment: we should use `SparkException.internalError` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603399289 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,35 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", + collationsMap: Map[String, String] = Map.empty): DataType = json match { case JString(name) => - nameToType(name) + collationsMap.get(fieldPath) match { +case Some(collation) => + assertValidTypeForCollations(fieldPath, name, collationsMap) + stringTypeWithCollation(collation) +case _ => nameToType(name) + } case JSortedObject( ("containsNull", JBool(n)), ("elementType", t: JValue), ("type", JString("array"))) => - ArrayType(parseDataType(t), n) + assertValidTypeForCollations(fieldPath, "array", collationsMap) + val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", collationsMap) + ArrayType(elementType, n) case JSortedObject( ("keyType", k: JValue), ("type", JString("map")), ("valueContainsNull", JBool(n)), ("valueType", v: JValue)) => - MapType(parseDataType(k), parseDataType(v), n) + assertValidTypeForCollations(fieldPath, "map", collationsMap) + val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", collationsMap) + val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", collationsMap) Review Comment: why do we need `parseDataTypeWithCollation`? It looks correct to just call `parseDataType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603393035 ## sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala: ## @@ -208,22 +206,35 @@ object DataType { } // NOTE: Map fields must be sorted in alphabetical order to keep consistent with the Python side. - private[sql] def parseDataType(json: JValue): DataType = json match { + private[sql] def parseDataType( + json: JValue, + fieldPath: String = "", + collationsMap: Map[String, String] = Map.empty): DataType = json match { case JString(name) => - nameToType(name) + collationsMap.get(fieldPath) match { +case Some(collation) => + assertValidTypeForCollations(fieldPath, name, collationsMap) + stringTypeWithCollation(collation) +case _ => nameToType(name) + } case JSortedObject( ("containsNull", JBool(n)), ("elementType", t: JValue), ("type", JString("array"))) => - ArrayType(parseDataType(t), n) + assertValidTypeForCollations(fieldPath, "array", collationsMap) + val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", collationsMap) + ArrayType(elementType, n) case JSortedObject( ("keyType", k: JValue), ("type", JString("map")), ("valueContainsNull", JBool(n)), ("valueType", v: JValue)) => - MapType(parseDataType(k), parseDataType(v), n) + assertValidTypeForCollations(fieldPath, "map", collationsMap) Review Comment: shall we do this assert in the StructType branch as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603384345 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -110,6 +158,8 @@ public Collation( // No Collation can simultaneously support binary equality and lowercase equality assert(!supportsBinaryEquality || !supportsLowercaseEquality); + assert(SUPPORTED_PROVIDERS.contains(provider) || provider == null); Review Comment: when `provider` can be null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
cloud-fan commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1603382702 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: attributes are not optional... It's a bad practice to leave attributes un-initialized, which is null for reference type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
dbatomic commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1603381135 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -1051,6 +1052,153 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper { } } + test("CollationKey generates correct collation key for string") { Review Comment: I think that these tests should be in `CollationExpressionSuite`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
dbatomic commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1603377917 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -784,19 +785,19 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper { case _: BroadcastHashJoinExec => () }.nonEmpty) -// Even with hint broadcast, hash join is not used for non-binary collated strings. +// Hash join is also used for non-binary collated strings. Review Comment: We should assert that collationkey was injected into physical plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
dbatomic commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1603376505 ## sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala: ## @@ -784,19 +785,19 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper { case _: BroadcastHashJoinExec => () }.nonEmpty) -// Even with hint broadcast, hash join is not used for non-binary collated strings. +// Hash join is also used for non-binary collated strings. Review Comment: I think that you should rename the test. It's currently called `test("hash based joins not allowed for non-binary collated strings")` which is no longer true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
dbatomic commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1603374449 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala: ## @@ -397,7 +398,11 @@ trait JoinSelectionHelper extends Logging { protected def hashJoinSupported (leftKeys: Seq[Expression], rightKeys: Seq[Expression]): Boolean = { -val result = leftKeys.concat(rightKeys).forall(e => UnsafeRowUtils.isBinaryStable(e.dataType)) +def hasStableCollationKey(e: Expression): Boolean = + e.dataType.isInstanceOf[StringType] || e.dataType.isInstanceOf[BinaryType] +def isHashJoinSupported(e: Expression): Boolean = UnsafeRowUtils.isBinaryStable(e.dataType) || Review Comment: Why do you need this `hasStableCollationKey`? `isBinaryStable` will return true for binary, so things should work just work, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
dbatomic commented on code in PR #46599: URL: https://github.com/apache/spark/pull/46599#discussion_r1603367540 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationKey.scala: ## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.internal.types.StringTypeAnyCollation +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +case class CollationKey(expr: Expression) extends UnaryExpression with ExpectsInputTypes { + override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation) + override def dataType: DataType = BinaryType + + final lazy val collationId: Int = expr.dataType match { +case st: StringType => + st.collationId + } + + override def nullSafeEval(input: Any): Any = input match { +case str: UTF8String => + CollationFactory.getCollationKeyBytes(str, collationId) +case _ => + None + } + + override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val collation = CollationFactory.fetchCollation(collationId) +if (collation.supportsBinaryEquality) { + defineCodeGen(ctx, ev, c => s"$c.getBytes()") +} else if (collation.supportsLowercaseEquality) { + defineCodeGen(ctx, ev, c => s"$c.toLowerCase().getBytes()") +} else { + defineCodeGen(ctx, ev, c => s"CollationFactory.fetchCollation" + Review Comment: You can use `getCollationKeyBytes` here? Also, IMO, you can just remove `getCollationKeyBytes` from collation factory and add the logic into `nullSafeEval`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]
cloud-fan commented on PR #46499: URL: https://github.com/apache/spark/pull/46499#issuecomment-2115258124 I think https://github.com/apache/spark/pull/45802#issuecomment-2101762336 is a better idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48305][SQL] Add collation support for CurrentLike expressions [spark]
uros-db commented on PR #46613: URL: https://github.com/apache/spark/pull/46613#issuecomment-2115241390 @cloud-fan ready for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]
zhengruifeng commented on PR #46608: URL: https://github.com/apache/spark/pull/46608#issuecomment-2115184716 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]
zhengruifeng closed pull request #46608: [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` URL: https://github.com/apache/spark/pull/46608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org