date:20240516

Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-16 Thread via GitHub



uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1604437739


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with
+   * the specified prefix, with respect to the UTF8_BINARY_LCASE collation.
+   * The method assumes that the prefix is already lowercased. The method only
+   * considers the part of target string that starts from the specified 
position.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the end position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
suffix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}

Review Comment:
   Yes, we'll address perf optimizations separately - so I'll create a ticket 
for that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-16 Thread via GitHub



nikolamand-db commented on PR #46180:
URL: https://github.com/apache/spark/pull/46180#issuecomment-2116855540

   @mkaravel @dbatomic please review again, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]

2024-05-16 Thread via GitHub



gengliangwang closed pull request #46612: [SPARK-48303][CORE] Reorganize 
`LogKeys`
URL: https://github.com/apache/spark/pull/46612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]

2024-05-16 Thread via GitHub



gengliangwang commented on PR #46612:
URL: https://github.com/apache/spark/pull/46612#issuecomment-2116827152

   Thanks for the improvement! Merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46632: [SPARK-48317][PYTHON][CONNECT][TESTS] 
Enable `test_udtf_with_analyze_using_archive` and 
`test_udtf_with_analyze_using_file`
URL: https://github.com/apache/spark/pull/46632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46632:
URL: https://github.com/apache/spark/pull/46632#issuecomment-2116752726

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48319][PYTHON][CONNECT][TESTS] Test `assert_true` and `raise_error` with the same error class as Spark Classic [spark]

2024-05-16 Thread via GitHub



zhengruifeng opened a new pull request, #46633:
URL: https://github.com/apache/spark/pull/46633

   ### What changes were proposed in this pull request?
   Test `assert_true` and `raise_error` with the same error class as Spark 
Classic
   
   
   ### Why are the changes needed?
   
https://github.com/apache/spark/commit/578931678f5a6d6b33ebdae4bf866871e46fbb83 
made `assert_true` and `raise_error` in Spark Connect throw 
`SparkRuntimeException`, then the error is the same as Spark Classic
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test only
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



uros-db commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1604352238


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, EqualNullSafe, EqualTo, Lower}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteCollationJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput {
+case j @ Join(_, _, _, Some(condition), _) =>
+  val newCondition = condition transform {
+case EqualTo(l: AttributeReference, r: AttributeReference) =>
+  (l.dataType, r.dataType) match {
+case (st: StringType, _: StringType) =>

Review Comment:
   edit: updated PR title, description, and corresponding Jira ticket title to 
reflect this
   also, created new ticket to handle complex types (ArrayType, StructType, 
etc.) https://issues.apache.org/jira/browse/SPARK-48318



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48312][SQL] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46622:
URL: https://github.com/apache/spark/pull/46622#discussion_r1604351546


##
sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala:
##
@@ -49,6 +49,9 @@ sealed class Metadata private[types] (private[types] val map: 
Map[String, Any])
   /** Tests whether this Metadata contains a binding for a key. */
   def contains(key: String): Boolean = map.contains(key)
 
+  /** Tests whether this Metadata is empty. */
+  def isEmpty: Boolean = map.isEmpty

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



uros-db commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1604346112


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, EqualNullSafe, EqualTo, Lower}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteCollationJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput {
+case j @ Join(_, _, _, Some(condition), _) =>
+  val newCondition = condition transform {
+case EqualTo(l: AttributeReference, r: AttributeReference) =>
+  (l.dataType, r.dataType) match {
+case (st: StringType, _: StringType) =>

Review Comment:
   will add support for complex types in a future PR, for now - only String 
should be supported



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46631: [MINOR][PYTHON][TESTS] Call 
`test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` 
URL: https://github.com/apache/spark/pull/46631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46631:
URL: https://github.com/apache/spark/pull/46631#issuecomment-2116583644

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` [spark]

2024-05-16 Thread via GitHub



zhengruifeng commented on PR #46630:
URL: https://github.com/apache/spark/pull/46630#issuecomment-2116579906

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable `DataFrameObservationParityTests.test_observe_str` [spark]

2024-05-16 Thread via GitHub



zhengruifeng closed pull request #46630: 
[SPARK-41625][PYTHON][CONNECT][TESTS][FOLLOW-UP] Enable 
`DataFrameObservationParityTests.test_observe_str`
URL: https://github.com/apache/spark/pull/46630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn [spark]

2024-05-16 Thread via GitHub



TakawaAkirayo commented on PR #46182:
URL: https://github.com/apache/spark/pull/46182#issuecomment-2116573625

   Gently ping @grundprinzip if anything else needs to be provided from my side 
:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]

2024-05-16 Thread via GitHub



pan3793 commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2116553658

   > ... supposedly the workaround will be removed when the Yarn side upgrades 
their J2EE?
   
   I suppose not. According to the discussion in 
https://github.com/apache/spark/pull/31642 
`org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter` should be part of 
`hadoop-client-api` but actually not, and
   
   > Agree that in the long term we should either: 1) consider to re-implement 
the logic in Spark which allows us to get away from server-side dependency in 
Hadoop ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]

2024-05-16 Thread via GitHub



HiuKwok commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2116550519

   The patch makes sense to me, and supposedly the workaround will be removed 
when the Yarn side upgrades their J2EE?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46629: [SPARK-48316][PS][CONNECT][TESTS] Fix 
comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition
URL: https://github.com/apache/spark/pull/46629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46629:
URL: https://github.com/apache/spark/pull/46629#issuecomment-2116548415

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1604289643


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -104,6 +107,8 @@
 
 
 class DataFrame(ParentDataFrame):
+_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if 
os.cpu_count() else 8)

Review Comment:
   Oops



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-16 Thread via GitHub



zhengruifeng commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1604267324


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -104,6 +107,8 @@
 
 
 class DataFrame(ParentDataFrame):
+_release_thread_pool: Optional[ThreadPool] = ThreadPool(os.cpu_count() if 
os.cpu_count() else 8)

Review Comment:
   where is this used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48317][PYTHON][CONNECT][TESTS] Enable `test_udtf_with_analyze_using_archive` and `test_udtf_with_analyze_using_file` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon opened a new pull request, #46632:
URL: https://github.com/apache/spark/pull/46632

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable the tests `test_udtf_with_analyze_using_archive` 
and `test_udtf_with_analyze_using_file`.
   
   ### Why are the changes needed?
   
   To make sure on the test coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]

2024-05-16 Thread via GitHub



zml1206 commented on PR #46499:
URL: https://github.com/apache/spark/pull/46499#issuecomment-2116500596

   `with` is a good idea, thank you very much @cloud-fan . Close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]

2024-05-16 Thread via GitHub



zml1206 closed pull request #46499: [SPARK-48213][SQL] Do not push down 
predicate if non-cheap expression exceed reused limit
URL: https://github.com/apache/spark/pull/46499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48306][SQL] Improve UDT in error message [spark]

2024-05-16 Thread via GitHub



yaooqinn commented on PR #46616:
URL: https://github.com/apache/spark/pull/46616#issuecomment-2116498446

   Merged to master. Thank you @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48306][SQL] Improve UDT in error message [spark]

2024-05-16 Thread via GitHub



yaooqinn closed pull request #46616: [SPARK-48306][SQL] Improve UDT in error 
message
URL: https://github.com/apache/spark/pull/46616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48113][CONNECT] Allow Plugins to integrate with Spark Connect [spark]

2024-05-16 Thread via GitHub



nchammas commented on PR #46364:
URL: https://github.com/apache/spark/pull/46364#issuecomment-2116496398

   Looks like the docs I am looking for are in #45340.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][PYTHON][TESTS] Call `test_apply_schema_to_dict_and_rows` in `test_apply_schema_to_row` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon opened a new pull request, #46631:
URL: https://github.com/apache/spark/pull/46631

   ### What changes were proposed in this pull request?
   
   This PR fixes the test `test_apply_schema_to_row` to call 
`test_apply_schema_to_row` instead of `test_apply_schema_to_dict_and_rows`. It 
was a mistake.
   
   ### Why are the changes needed?
   
   To avoid a mistake when it's enabled in the future.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL][FOLLOWUP] Update the error message [spark]

2024-05-16 Thread via GitHub



zhengruifeng commented on PR #46628:
URL: https://github.com/apache/spark/pull/46628#issuecomment-2116489481

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL][FOLLOWUP] Update the error message [spark]

2024-05-16 Thread via GitHub



zhengruifeng closed pull request #46628: [SPARK-48301][SQL][FOLLOWUP] Update 
the error message
URL: https://github.com/apache/spark/pull/46628


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-41625][PYTHON][CONNECT][TESTS] Enable `DataFrameObservationParityTests.test_observe_str` [spark]

2024-05-16 Thread via GitHub



HyukjinKwon opened a new pull request, #46630:
URL: https://github.com/apache/spark/pull/46630

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable 
`DataFrameObservationParityTests.test_observe_str`.
   
   ### Why are the changes needed?
   
   To make sure on the test coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48316][PS][CONNECT][TESTS] Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46629:
URL: https://github.com/apache/spark/pull/46629#issuecomment-2116480337

   cc @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48316][PS][CONNECT][TESTS] Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition [spark]

2024-05-16 Thread via GitHub



HyukjinKwon opened a new pull request, #46629:
URL: https://github.com/apache/spark/pull/46629

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable `SparkFrameMethodsParityTests.test_coalesce` and 
`SparkFrameMethodsParityTests.test_repartition` in Spark Connect by avoiding 
RDD usage in the test.
   
   ### Why are the changes needed?
   
   To make sure on the test coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   CI in this PR.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-16 Thread via GitHub



panbingkun commented on PR #46551:
URL: https://github.com/apache/spark/pull/46551#issuecomment-2116479316

   1.with `maven local cache`
   https://github.com/panbingkun/spark/actions/runs/9109019204
   https://github.com/apache/spark/assets/15246973/439b6397-21bb-427d-a740-481755b54a02";>
   
   2.without `maven local cache`
   https://github.com/panbingkun/spark/actions/runs/9107211572
   https://github.com/apache/spark/assets/15246973/0238e450-5e1d-4405-afec-8fd086760b82";>
   
   https://github.com/panbingkun/spark/actions/runs/9114668535
   https://github.com/apache/spark/assets/15246973/d0f5ae74-06df-44d4-8388-c5adfe02ce2a";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46621: [SPARK-48310][PYTHON][CONNECT] Cached 
properties must return copies
URL: https://github.com/apache/spark/pull/46621


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48310][PYTHON][CONNECT] Cached properties must return copies [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46621:
URL: https://github.com/apache/spark/pull/46621#issuecomment-2116465949

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on code in PR #46622:
URL: https://github.com/apache/spark/pull/46622#discussion_r1604233127


##
sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala:
##
@@ -49,6 +49,9 @@ sealed class Metadata private[types] (private[types] val map: 
Map[String, Any])
   /** Tests whether this Metadata contains a binding for a key. */
   def contains(key: String): Boolean = map.contains(key)
 
+  /** Tests whether this Metadata is empty. */
+  def isEmpty: Boolean = map.isEmpty

Review Comment:
   Can you add `@since 4.0.0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV options [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on code in PR #46626:
URL: https://github.com/apache/spark/pull/46626#discussion_r1604232424


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -149,7 +149,12 @@ class CSVOptions(
 parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
 
   // A language tag in IETF BCP 47 format
-  val locale: Locale = 
parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  val locale: Locale = try {
+parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  } catch {
+case _: NullPointerException =>

Review Comment:
   and this issue should exist in almost all other options. Would probably be 
best to fix them together



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43815][SQL] Wrap NPE with AnalysisException in CSV options [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on code in PR #46626:
URL: https://github.com/apache/spark/pull/46626#discussion_r1604232182


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -149,7 +149,12 @@ class CSVOptions(
 parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))
 
   // A language tag in IETF BCP 47 format
-  val locale: Locale = 
parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  val locale: Locale = try {
+parameters.get(LOCALE).map(Locale.forLanguageTag).getOrElse(Locale.US)
+  } catch {
+case _: NullPointerException =>

Review Comment:
   Can we check if the value is `null`, and throw an exception instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48303][CORE] Reorganize `LogKeys` [spark]

2024-05-16 Thread via GitHub



panbingkun commented on PR #46612:
URL: https://github.com/apache/spark/pull/46612#issuecomment-2116460422

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [spark]

2024-05-16 Thread via GitHub



Kimahriman commented on PR #46627:
URL: https://github.com/apache/spark/pull/46627#issuecomment-2116427914

   @HeartSaVioR since you added the file caching originally back in the day


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48314] Don't double cache files for FileStreamSource using Trigger.AvailableNow [spark]

2024-05-16 Thread via GitHub



Kimahriman opened a new pull request, #46627:
URL: https://github.com/apache/spark/pull/46627

   
   
   ### What changes were proposed in this pull request?
   
   Files don't need to be cached for reuse in `FileStreamSource` when using 
`Trigger.AvailableNow` because all files are already cached for the lifetime of 
the query in `allFilesForTriggerAvailableNow`.
   
   ### Why are the changes needed?
   
   As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR 
to address https://github.com/apache/spark/pull/45362), the hard coded cap of 
10k files being cached can cause problems when using a maxFilesPerTrigger > 
10k. It causes every other batch to be 10k files, which can greatly limit the 
throughput of a new streaming trying to catch up.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Every other streaming batch won't be 10k files if using Trigger.AvailableNow 
and maxFilesPerTrigger greater than 10k.
   
   ### How was this patch tested?
   
   New UT
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub



zhengruifeng commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1604205064


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."

Review Comment:
   sounds good, we don't have PROCEDURE in OSS, will update in a followup PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][Spark 44646] Reduce usage of log4j core [spark]

2024-05-16 Thread via GitHub



github-actions[bot] closed pull request #45001: [WIP][Spark 44646] Reduce usage 
of log4j core
URL: https://github.com/apache/spark/pull/45001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42789][SQL] Rewrite multiple GetJsonObject that consumes same JSON to single JsonTuple [spark]

2024-05-16 Thread via GitHub



github-actions[bot] closed pull request #45020: [SPARK-42789][SQL] Rewrite 
multiple GetJsonObject that consumes same JSON to single JsonTuple
URL: https://github.com/apache/spark/pull/45020


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46571: [SPARK-48268][CORE] Add a configuration 
for SparkContext.setCheckpointDir
URL: https://github.com/apache/spark/pull/46571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46571:
URL: https://github.com/apache/spark/pull/46571#issuecomment-2116375998

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46614: [SPARK-48031][SQL][FOLLOW-UP] Use 
ANSI-enabled cast in view lookup test
URL: https://github.com/apache/spark/pull/46614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48031][SQL][FOLLOW-UP] Use ANSI-enabled cast in view lookup test [spark]

2024-05-16 Thread via GitHub



HyukjinKwon commented on PR #46614:
URL: https://github.com/apache/spark/pull/46614#issuecomment-2116375642

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]

2024-05-16 Thread via GitHub



dependabot[bot] commented on PR #46625:
URL: https://github.com/apache/spark/pull/46625#issuecomment-2116374205

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]

2024-05-16 Thread via GitHub



HyukjinKwon closed pull request #46625: Bump rexml from 3.2.6 to 3.2.8 in /docs
URL: https://github.com/apache/spark/pull/46625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-43815] Wrap NPE with AnalysisException in csv options [spark]

2024-05-16 Thread via GitHub



michaelzhan-db opened a new pull request, #46626:
URL: https://github.com/apache/spark/pull/46626

   
   
   ### What changes were proposed in this pull request?
   
   When user sets `locale` to be `null`, a NPE is raised. Instead, replace the 
NPE with a more understandable user facing error message.
   
   ### Why are the changes needed?
   
   Improves usability of csv options.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added tests pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub



gengliangwang commented on PR #46623:
URL: https://github.com/apache/spark/pull/46623#issuecomment-2116274470

   @michaelzhan-db there are merge conflicts against branch-3.5. Please create 
a new PR for the backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub



gengliangwang closed pull request #46623: [SPARK-48294][SQL] Handle lowercase 
in nestedTypeMissingElementTypeError
URL: https://github.com/apache/spark/pull/46623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub



gengliangwang commented on PR #46623:
URL: https://github.com/apache/spark/pull/46623#issuecomment-2116272913

   Thanks, merging to master and branch 3.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Bump rexml from 3.2.6 to 3.2.8 in /docs [spark]

2024-05-16 Thread via GitHub



dependabot[bot] opened a new pull request, #46625:
URL: https://github.com/apache/spark/pull/46625

   Bumps [rexml](https://github.com/ruby/rexml) from 3.2.6 to 3.2.8.
   
   Release notes
   Sourced from https://github.com/ruby/rexml/releases";>rexml's 
releases.
   
   REXML 3.2.8 - 2024-05-16
   Fixes
   
   Suppressed a warning
   
   REXML 3.2.7 - 2024-05-16
   Improvements
   
   
   Improve parse performance by using StringScanner.
   
   
   https://redirect.github.com/ruby/rexml/issues/106";>GH-106
   
   
   https://redirect.github.com/ruby/rexml/issues/107";>GH-107
   
   
   https://redirect.github.com/ruby/rexml/issues/108";>GH-108
   
   
   https://redirect.github.com/ruby/rexml/issues/109";>GH-109
   
   
   https://redirect.github.com/ruby/rexml/issues/112";>GH-112
   
   
   https://redirect.github.com/ruby/rexml/issues/113";>GH-113
   
   
   https://redirect.github.com/ruby/rexml/issues/114";>GH-114
   
   
   https://redirect.github.com/ruby/rexml/issues/115";>GH-115
   
   
   https://redirect.github.com/ruby/rexml/issues/116";>GH-116
   
   
   https://redirect.github.com/ruby/rexml/issues/117";>GH-117
   
   
   https://redirect.github.com/ruby/rexml/issues/118";>GH-118
   
   
   https://redirect.github.com/ruby/rexml/issues/119";>GH-119
   
   
   https://redirect.github.com/ruby/rexml/issues/121";>GH-121
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Improved parse performance when an attribute has many 
GH-124
   
   
   
   Fixes
   
   
   XPath: Fixed a bug of normalize_space(array).
   
   
   https://redirect.github.com/ruby/rexml/issues/110";>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/111";>GH-111
   
   
   Patch by flatisland.
   
   
   
   
   XPath: Fixed a bug that wrong position is used with nested path.
   
   
   https://redirect.github.com/ruby/rexml/issues/110";>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/122";>GH-122
   
   
   Reported by jcavalieri.
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Fixed a bug that an exception message can't be generated for
   invalid encoding XML.
   
   
   
   
   ... (truncated)
   
   
   Changelog
   Sourced from https://github.com/ruby/rexml/blob/master/NEWS.md";>rexml's 
changelog.
   
   3.2.8 - 2024-05-16 {#version-3-2-8}
   Fixes
   
   Suppressed a warning
   
   3.2.7 - 2024-05-16 {#version-3-2-7}
   Improvements
   
   
   Improve parse performance by using StringScanner.
   
   
   https://redirect.github.com/ruby/rexml/issues/106";>GH-106
   
   
   https://redirect.github.com/ruby/rexml/issues/107";>GH-107
   
   
   https://redirect.github.com/ruby/rexml/issues/108";>GH-108
   
   
   https://redirect.github.com/ruby/rexml/issues/109";>GH-109
   
   
   https://redirect.github.com/ruby/rexml/issues/112";>GH-112
   
   
   https://redirect.github.com/ruby/rexml/issues/113";>GH-113
   
   
   https://redirect.github.com/ruby/rexml/issues/114";>GH-114
   
   
   https://redirect.github.com/ruby/rexml/issues/115";>GH-115
   
   
   https://redirect.github.com/ruby/rexml/issues/116";>GH-116
   
   
   https://redirect.github.com/ruby/rexml/issues/117";>GH-117
   
   
   https://redirect.github.com/ruby/rexml/issues/118";>GH-118
   
   
   https://redirect.github.com/ruby/rexml/issues/119";>GH-119
   
   
   https://redirect.github.com/ruby/rexml/issues/121";>GH-121
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Improved parse performance when an attribute has many 
GH-124
   
   
   
   Fixes
   
   
   XPath: Fixed a bug of normalize_space(array).
   
   
   https://redirect.github.com/ruby/rexml/issues/110";>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/111";>GH-111
   
   
   Patch by flatisland.
   
   
   
   
   XPath: Fixed a bug that wrong position is used with nested path.
   
   
   https://redirect.github.com/ruby/rexml/issues/110";>GH-110
   
   
   https://redirect.github.com/ruby/rexml/issues/122";>GH-122
   
   
   Reported by jcavalieri.
   
   
   Patch by NAITOH Jun.
   
   
   
   
   Fixed a bug that an exception message can't be generated for
   
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/ruby/rexml/commit/1cf37bab79d61d6183bbda8bf525ed587012b718";>1cf37ba
 Add 3.2.8 entry
   https://github.com/ruby/rexml/commit/b67081caa807fad48d31983137b7ed8711e7f0df";>b67081c
 Remove an unused variable (https://redirect.github.com/ruby/rexml/issues/128";>#128)
   https://github.com/ruby/rexml/commit/94e180e939baff8f7e328a287bb96ebbd99db6eb";>94e180e
 Suppress a warning
   https://github.com/ruby/rexml/commit/d574ba5fe1c40adbafbf16e47533f4eb32b43e60";>d574ba5
 ci: install only gems required for running tests (https://redirect.github.com/ruby/rexml/issues/129";>#129)
   https://github.com/ruby/rexml/commit/4670f8fc187c89d0504d027ea997959287143453";>4670f8f
 Add missing Thanks secti

Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]

2024-05-16 Thread via GitHub



chaoqin-li1123 commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1603979413


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -59,8 +59,17 @@ Start by creating a new subclass of :class:`DataSource`. 
Define the source name,
 def reader(self, schema: StructType):
 return FakeDataSourceReader(schema, self.options)
 
+def streamReader(self, schema: StructType):

Review Comment:
   I prefer not to duplicate the DataSource code. We already document that 
developer only need to implement the corresponding method for a certain 
capacity.



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -33,9 +33,15 @@ To create a custom Python data source, you'll need to 
subclass the :class:`DataS
 
 This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
 
-**Step 1: Define the Data Source**
+**Define the Data Source**
 
-Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+Start by creating a new subclass of :class:`DataSource` with the source name, 
schema.
+
+In order to read from the data source in a batch query, reader() method need 
to be defined.
+
+In order to read from the data source in a streaming query, streamReader() or 
simpleStreamReader() method need to be defined.
+
+In order to write to the data source in a streaming query, streamWriter() 
method need to be defined.

Review Comment:
   Table added.



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
 row.append(value)
 yield tuple(row)
 
+Implementing Streaming Reader and Writer for Python Data Source
+---
+**Implement the Stream Reader**
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+def __init__(self, schema, options):
+self.current = 0
+
+def initialOffset(self) -> dict:
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def latestOffset(self) -> dict:
+"""
+Return the current latest offset that the next microbatch will 
read to.
+"""
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start: dict, end: dict):
+"""
+Plans the partitioning of the current microbatch defined by start 
and end offset,
+it needs to return a sequence of :class:`InputPartition` object.
+"""
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end: dict):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+def read(self, partition) -> Iterator[Tuple]:
+"""
+Takes a partition as an input and read an iterator of tuples from 
the data source.
+"""
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+**Implement the Simple Stream Reader**
+
+If the data source has low throughput and doesn't require partitioning, you 
can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
+
+One of simpleStreamReader() and streamReader() must be implemented for 
readable streaming data source. And simpleStreamReader() will only be invoked 
when streamReader() is not implemented.
+
+This is the same dummy streaming reader that generate 2 rows every batch 
implemented with SimpleDataSourceStreamReader interface.
+
+.. code-block:: python
+
+class SimpleStreamReader(SimpleDataSourceStreamReader):
+def initialOffset(self):
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Takes start offset as an input, return an iterator of tuples and 
the start offset of next read.
+"""
+start_idx = start["offset"]
+it = iter([(i,) for i in range(start_idx, start_idx + 2)])
+return (it, {"offset": start

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-16 Thread via GitHub



mridulm commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1603972872


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1317,6 +1317,16 @@ package object config {
   s" be less than or equal to 
${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
   .createWithDefault(64 * 1024 * 1024)
 
+  private[spark] val CHECKPOINT_DIR =
+ConfigBuilder("spark.checkpoint.dir")
+  .doc(
+"Equivalent with SparkContext.setCheckpointDir. If set, the path 
becomes" +
+  "the default directory for checkpointing. It can be overwritten by" +
+  "SparkContext.setCheckpointDir.")

Review Comment:
   nit:
```suggestion
   "Set the default directory for checkpointing. It can be overwritten 
by " +
 "SparkContext.setCheckpointDir.")
   ```
   
   (I missed this in my earlier pass, my bad)



##
docs/configuration.md:
##
@@ -1795,6 +1795,16 @@ Apart from these, the following properties are also 
available, and may be useful
   
   0.6.0
 
+
+  spark.checkpoint.dir
+  (none)
+  
+Equivalent with SparkContext.setCheckpointDir. If set, the path becomes
+the default directory for checkpointing. It can be overwritten by
+SparkContext.setCheckpointDir.

Review Comment:
   Same as above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48238][BUILD][YARN] Replace AmIpFilter with re-implemented YarnAMIpFilter [spark]

2024-05-16 Thread via GitHub



mridulm commented on PR #46611:
URL: https://github.com/apache/spark/pull/46611#issuecomment-2116089785

   +CC @tgravescs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Python ds preview [spark]

2024-05-16 Thread via GitHub



chaoqin-li1123 closed pull request #46624: Python ds preview
URL: https://github.com/apache/spark/pull/46624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Python ds preview [spark]

2024-05-16 Thread via GitHub



chaoqin-li1123 opened a new pull request, #46624:
URL: https://github.com/apache/spark/pull/46624

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub



srielau commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1603902920


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."

Review Comment:
   That works too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][CORE][FOLLOWUP] Rename Java LoggerSuite as SparkLoggerSuite [spark]

2024-05-16 Thread via GitHub



gengliangwang closed pull request #46615: [SPARK-48291][CORE][FOLLOWUP] Rename 
Java *LoggerSuite* as *SparkLoggerSuite*
URL: https://github.com/apache/spark/pull/46615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][CORE][FOLLOWUP] Rename Java LoggerSuite as SparkLoggerSuite [spark]

2024-05-16 Thread via GitHub



gengliangwang commented on PR #46615:
URL: https://github.com/apache/spark/pull/46615#issuecomment-2115976061

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47920][DOCS][SS][PYTHON] Add doc for python streaming data source API [spark]

2024-05-16 Thread via GitHub



allisonwang-db commented on code in PR #46139:
URL: https://github.com/apache/spark/pull/46139#discussion_r1602361195


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -33,9 +33,15 @@ To create a custom Python data source, you'll need to 
subclass the :class:`DataS
 
 This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
 
-**Step 1: Define the Data Source**
+**Define the Data Source**
 
-Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+Start by creating a new subclass of :class:`DataSource` with the source name, 
schema.
+
+In order to read from the data source in a batch query, reader() method need 
to be defined.
+
+In order to read from the data source in a streaming query, streamReader() or 
simpleStreamReader() method need to be defined.
+
+In order to write to the data source in a streaming query, streamWriter() 
method need to be defined.

Review Comment:
   Do you think it's more clear to have a markdown table here? streaming/batch 
x read/write.



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -84,9 +101,158 @@ Define the reader logic to generate synthetic data. Use 
the `faker` library to p
 row.append(value)
 yield tuple(row)
 
+Implementing Streaming Reader and Writer for Python Data Source
+---
+**Implement the Stream Reader**
+
+This is a dummy streaming data reader that generate 2 rows in every 
microbatch. The streamReader instance has a integer offset that increase by 2 
in every microbatch.
+
+.. code-block:: python
+
+class RangePartition(InputPartition):
+def __init__(self, start, end):
+self.start = start
+self.end = end
+
+class FakeStreamReader(DataSourceStreamReader):
+def __init__(self, schema, options):
+self.current = 0
+
+def initialOffset(self) -> dict:
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def latestOffset(self) -> dict:
+"""
+Return the current latest offset that the next microbatch will 
read to.
+"""
+self.current += 2
+return {"offset": self.current}
+
+def partitions(self, start: dict, end: dict):
+"""
+Plans the partitioning of the current microbatch defined by start 
and end offset,
+it needs to return a sequence of :class:`InputPartition` object.
+"""
+return [RangePartition(start["offset"], end["offset"])]
+
+def commit(self, end: dict):
+"""
+This is invoked when the query has finished processing data before 
end offset, this can be used to clean up resource.
+"""
+pass
+
+def read(self, partition) -> Iterator[Tuple]:
+"""
+Takes a partition as an input and read an iterator of tuples from 
the data source.
+"""
+start, end = partition.start, partition.end
+for i in range(start, end):
+yield (i, str(i))
+
+**Implement the Simple Stream Reader**
+
+If the data source has low throughput and doesn't require partitioning, you 
can implement SimpleDataSourceStreamReader instead of DataSourceStreamReader.
+
+One of simpleStreamReader() and streamReader() must be implemented for 
readable streaming data source. And simpleStreamReader() will only be invoked 
when streamReader() is not implemented.
+
+This is the same dummy streaming reader that generate 2 rows every batch 
implemented with SimpleDataSourceStreamReader interface.
+
+.. code-block:: python
+
+class SimpleStreamReader(SimpleDataSourceStreamReader):
+def initialOffset(self):
+"""
+Return the initial start offset of the reader.
+"""
+return {"offset": 0}
+
+def read(self, start: dict) -> (Iterator[Tuple], dict):
+"""
+Takes start offset as an input, return an iterator of tuples and 
the start offset of next read.
+"""
+start_idx = start["offset"]
+it = iter([(i,) for i in range(start_idx, start_idx + 2)])
+return (it, {"offset": start_idx + 2})
+
+def readBetweenOffsets(self, start: dict, end: dict) -> 
Iterator[Tuple]:
+"""
+Takes start and end offset as input and read an iterator of data 
deterministically.
+This is called whe query replay batches during restart or after 
failure.
+"""
+start_idx = start["offset"]
+end_idx = end["offset"]
+return iter([(i,) for i in range(start_id

Re: [PR] [SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the original WithCTE node [spark]

2024-05-16 Thread via GitHub



amaliujia commented on code in PR #46617:
URL: https://github.com/apache/spark/pull/46617#discussion_r1603865663


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala:
##
@@ -74,34 +70,33 @@ case class InlineCTE(alwaysInline: Boolean = false) extends 
Rule[LogicalPlan] {
*
* @param plan The plan to collect the CTEs from
* @param cteMap A mutable map that accumulates the CTEs and their reference 
information by CTE
-   *   ids. The value of the map is tuple whose elements are:
-   *   - The CTE definition
-   *   - The number of incoming references to the CTE. This 
includes references from
-   * other CTEs and regular places.
-   *   - A mutable inner map that tracks outgoing references 
(counts) to other CTEs.
+   *   ids.
* @param outerCTEId While collecting the map we use this optional CTE id to 
identify the
*   current outer CTE.
*/
-  def buildCTEMap(
+  private def buildCTEMap(
   plan: LogicalPlan,
-  cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])],
+  cteMap: mutable.Map[Long, CTEReferenceInfo],
   outerCTEId: Option[Long] = None): Unit = {
 plan match {
   case WithCTE(child, cteDefs) =>
 cteDefs.foreach { cteDef =>
-  cteMap(cteDef.id) = (cteDef, 0, 
mutable.Map.empty.withDefaultValue(0))
+  cteMap(cteDef.id) = CTEReferenceInfo(
+cteDef = cteDef,
+refCount = 0,
+outgoingRefs = mutable.Map.empty.withDefaultValue(0),
+shouldInline = true
+  )
 }
 cteDefs.foreach { cteDef =>
   buildCTEMap(cteDef, cteMap, Some(cteDef.id))
 }
 buildCTEMap(child, cteMap, outerCTEId)
 
   case ref: CTERelationRef =>
-val (cteDef, refCount, refMap) = cteMap(ref.cteId)
-cteMap(ref.cteId) = (cteDef, refCount + 1, refMap)
+cteMap(ref.cteId) = cteMap(ref.cteId).withRefCountIncreased(1)
 outerCTEId.foreach { cteId =>
-  val (_, _, outerRefMap) = cteMap(cteId)
-  outerRefMap(ref.cteId) += 1
+  cteMap(cteId).recordOutgoingReference(ref.cteId)

Review Comment:
   I personally like to not hide the `+1` implementation details here as I had 
troubles to understand it.
   
   Or the function name can be `IncreaseOutgoingReferenceByOne`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48294] Handle lowercase in nestedTypeMissingElementTypeError [spark]

2024-05-16 Thread via GitHub



michaelzhan-db opened a new pull request, #46623:
URL: https://github.com/apache/spark/pull/46623

   
   
   ### What changes were proposed in this pull request?
   
   Handle lowercase values inside of nestTypeMissingElementTypeError to prevent 
match errors.
   
   ### Why are the changes needed?
   
   The previous match error was not user-friendly. Now it gives an actionable 
`INCOMPLETE_TYPE_DEFINITION` error.
   
   ### Does this PR introduce _any_ user-facing change?
   
   N/A
   
   ### How was this patch tested?
   
   Newly added tests pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48281][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (StringInStr, SubstringIndex) [spark]

2024-05-16 Thread via GitHub



mkaravel commented on code in PR #46589:
URL: https://github.com/apache/spark/pull/46589#discussion_r1603752983


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // search left to right (note: the start code point is inclusive)

Review Comment:
   ```suggestion
 // Search left to right (note: the start code point is inclusive).
   ```
   Please use regular sentences as comments.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // search left to right (note: the start code point is inclusive)
+  int matchLength = -1;
   while (count > 0) {
-idx = lowercaseString.find(lowercaseDelimiter, idx + 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
-  }
-  if (idx == 0) {
-return UTF8String.EMPTY_UTF8;
+matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 
1);
+if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter
+else return string; // cannot find enough delimiters in the string

Review Comment:
   ```suggestion
   else return string; // Cannot find enough delimiters in the string.
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8String lowercaseDelimiter = delimiter.toLowerCase();
 
 if (count > 0) {
-  int idx = -1;
+  // search left to right (note: the start code point is inclusive)
+  int matchLength = -1;
   while (count > 0) {
-idx = lowercaseString.find(lowercaseDelimiter, idx + 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
-  }
-  if (idx == 0) {
-return UTF8String.EMPTY_UTF8;
+matchLength = lowercaseFind(string, lowercaseDelimiter, matchLength + 
1);
+if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter
+else return string; // cannot find enough delimiters in the string
   }
-  byte[] bytes = new byte[idx];
-  copyMemory(string.getBaseObject(), string.getBaseOffset(), bytes, 
BYTE_ARRAY_OFFSET, idx);
-  return UTF8String.fromBytes(bytes);
-
+  if (matchLength == 0) return UTF8String.EMPTY_UTF8;
+  return string.substring(0, matchLength);
 } else {
-  int idx = string.numBytes() - delimiter.numBytes() + 1;
+  // search right to left (note: the end code point is exclusive)
+  int matchLength = string.numChars() + 1;
   count = -count;
   while (count > 0) {
-idx = lowercaseString.rfind(lowercaseDelimiter, idx - 1);
-if (idx >= 0) {
-  count--;
-} else {
-  // can not find enough delim
-  return string;
-}
+matchLength = lowercaseRFind(string, lowercaseDelimiter, matchLength - 
1);
+if (matchLength > MATCH_NOT_FOUND) count--; // found a delimiter
+else return string; // cannot find enough delimiters in the string
   }
-  if (idx + delimiter.numBytes() == string.numBytes()) {
-return UTF8String.EMPTY_UTF8;
-  }
-  int size = string.numBytes() - delimiter.numBytes() - idx;
-  byte[] bytes = new byte[size];
-  copyMemory(string.getBaseObject(), string.getBaseOffset() + idx + 
delimiter.numBytes(),
-bytes, BYTE_ARRAY_OFFSET, size);
-  return UTF8String.fromBytes(bytes);
+  if (matchLength == string.numChars()) return UTF8String.EMPTY_UTF8;
+  return string.substring(matchLength, string.numChars());

Review Comment:
   ```suggestion
 return string.substring(matchLength, string.numChars());
   ```
   Same here.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -278,47 +431,29 @@ public static UTF8String lowercaseSubStringIndex(final 
UTF8String string,
   return UTF8String.EMPTY_UTF8;
 }
 
-UTF8String lowercaseString = string.toLowerCase();
 UTF8

Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

2024-05-16 Thread via GitHub



mkaravel commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1603716669


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)

Review Comment:
   Let's specify that this is 0-based.



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with
+   * the specified prefix, with respect to the UTF8_BINARY_LCASE collation.
+   * The method assumes that the prefix is already lowercased. The method only
+   * considers the part of target string that starts from the specified 
position.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the end position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
suffix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+  final UTF8String target,
+  final UTF8String lowercasePattern,
+  int startPos) {
+assert startPos >= 0;
+for (int len = 0; len <= target.numChars() - startPos; ++len) {
+  if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+return len;
+  }
+}
+return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string
+   * in the target string from the specified position (0-based index),
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the pattern string is already lowercased prior to method call.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target, if not 
found, -1 returned.

Review Comment:
   ```suggestion
  * @return the position of the first occurrence of pattern in target, if 
not found, `MATCH_NOT_FOUND` returned.
   ```



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to b

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603733508


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala:
##
@@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite {
 
 assert(result === expected)
   }
+
+  test("schema with collation should not change during ser/de") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val nestedStruct = StructType(
+  StructField("nested", simpleStruct) :: Nil)
+
+val caseInsensitiveNames = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) ::
+  StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val specialCharsInName = StructType(
+  StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val arrayInSchema = StructType(
+  StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) 
:: Nil)
+
+val mapInSchema = StructType(
+  StructField("mapField",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val mapWithKeyInNameInSchema = StructType(
+  StructField("name.key", StringType) ::
+  StructField("name",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val arrayInMapInNestedSchema = StructType(
+  StructField("arrInMap",
+MapType(StringType(UNICODE_COLLATION_ID),
+ArrayType(StringType(UNICODE_COLLATION_ID :: Nil)
+
+val nestedArrayInMap = StructType(
+  StructField("nestedArrayInMap",
+ArrayType(MapType(StringType(UNICODE_COLLATION_ID),
+  ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val schemaWithMultipleFields = StructType(
+  simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ 
mapInSchema.fields ++
+mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ 
nestedArrayInMap.fields)
+
+Seq(
+  simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, 
arrayInSchema,
+  mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, 
arrayInMapInNestedSchema,
+  schemaWithMultipleFields)
+  .foreach { schema =>
+val json = schema.json
+val parsed = DataType.fromJson(json)
+assert(parsed === schema)
+  }
+  }
+
+  test("non string field has collation metadata") {

Review Comment:
   `StructTypeSuite` is used just to validate `toJson` value, deserialization 
is tested in `DataTypeSuite`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603732173


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)
+  val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", 
collationsMap)
+  val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", 
collationsMap)

Review Comment:
   good catch! After some refactoring earlier this is no longer needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub



allisonwang-db commented on code in PR #46608:
URL: https://github.com/apache/spark/pull/46608#discussion_r1603731266


##
common/utils/src/main/resources/error/error-conditions.json:
##
@@ -2675,9 +2675,9 @@
   "ANALYZE TABLE(S) ... COMPUTE STATISTICS ...  must be either 
NOSCAN or empty."
 ]
   },
-  "CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
+  "CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE" : {
 "message" : [
-  "CREATE FUNCTION with both IF NOT EXISTS and REPLACE is not allowed."
+  "CREATE PROCEDURE or CREATE FUNCTION with both IF NOT EXISTS and 
REPLACE is not allowed."

Review Comment:
   How about: "Cannot create a routine with both IF NOT EXISTS and REPLACE 
specified."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603727528


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   So you also think we should put optional here? Even intellij complains when 
it is used in a field.
   
   https://github.com/apache/spark/assets/154237371/ea7bd5c5-43cb-4826-aae5-bf76588b6198";>
   
   Also, you can read below that it's intended to be used for method return type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub



agubichev commented on PR #46622:
URL: https://github.com/apache/spark/pull/46622#issuecomment-2115745170

   @cloud-fan PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603651070


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()

Review Comment:
   for writing statistics - version change invalidates them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SQL][SPARK-48312] Improve Alias.removeNonInheritableMetadata performance [spark]

2024-05-16 Thread via GitHub



vladimirg-db commented on PR #46622:
URL: https://github.com/apache/spark/pull/46622#issuecomment-2115620509

   @agubichev, hi! Here's the improvement fix for the ultra-wide views


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48159][SQL] Extending support for collated strings on datetime expressions [spark]

2024-05-16 Thread via GitHub



nebojsa-db commented on PR #46618:
URL: https://github.com/apache/spark/pull/46618#issuecomment-2115535153

   @cloud-fan  Please review :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48309][YARN]Stop am retry, in situations where some errors and… [spark]

2024-05-16 Thread via GitHub



guixiaowen opened a new pull request, #46620:
URL: https://github.com/apache/spark/pull/46620

   … retries may not be successful
   
   
   
   ### What changes were proposed in this pull request?
   In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our 
production environment, it is configured as 2 If the first execution fails, AM 
will retry. However, in some scenarios, even attempting a second task may fail.
   
   For example:
   
   org. apache. park. SQL AnalysisException: Table or view not found: 
test.test_x; Line 1 pos 14;
   Project
   +-Unresolved Relationship [bigdata_qa, testx_x], [], false
   
   
   
   Other example:
   Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The 
NameSpace quota (directories and files) of directory/tmp/xxx_file/ is 
exceeded: quota=100 file count=101
   
   Would it be more appropriate to try capturing these exceptions and stopping 
retry?
   
   
   ### Why are the changes needed?
   In some scenarios, even attempting a second task may fail.
   
   
   ### Does this PR introduce _any_ user-facing change?
   The user can throw a SparkStopAMRetryException, and the Application Master 
will catch the exception and stop retry
   
   For examle
   
   val spark = SparkSession
 .builder()
 .appName("Spark SQL basic example")
 .enableHiveSupport()
 .config("spark.some.config.option", "some-value")
 .getOrCreate()
   
   try {
 spark.sql("select * from test.test_x;").show
   } catch {
 case e:AnalysisException => throw new SparkStopAMRetryException("this 
is a test", e)
   } finally {
 spark.stop()
   }
   

   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-16 Thread via GitHub



PetarVasiljevic-DB commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1603513988


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -125,6 +126,29 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 
   override def caseConvert(tableName: String): String = 
tableName.toUpperCase(Locale.ROOT)
 
+  override protected val timestampNTZType: String = "TIMESTAMP WITH LOCAL TIME 
ZONE"

Review Comment:
   Yep, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-05-16 Thread via GitHub



cloud-fan closed pull request #46619: [SPARK-48308][Core] Unify getting data 
schema without partition columns in FileSourceStrategy
URL: https://github.com/apache/spark/pull/46619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48308][Core] Unify getting data schema without partition columns in FileSourceStrategy [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on PR #46619:
URL: https://github.com/apache/spark/pull/46619#issuecomment-2115430404

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1603428926


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -125,6 +126,29 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 
   override def caseConvert(tableName: String): String = 
tableName.toUpperCase(Locale.ROOT)
 
+  override protected val timestampNTZType: String = "TIMESTAMP WITH LOCAL TIME 
ZONE"

Review Comment:
   isn't it LTZ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1603426755


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala:
##
@@ -102,4 +105,7 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest {
   }
 
   override def caseConvert(tableName: String): String = 
tableName.toUpperCase(Locale.ROOT)
+
+  override protected val timestampNTZType: String = "TIMESTAMP WITHOUT TIME 
ZONE"
+  override protected val timestampTZType: String = "TIMESTAMP"

Review Comment:
   do you mean NTZ? TZ usually means WITH TIME ZONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603420522


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteCollationJoin.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
CollationKey, EqualNullSafe, EqualTo, Lower}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteCollationJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUpWithNewOutput {
+case j @ Join(_, _, _, Some(condition), _) =>
+  val newCondition = condition transform {
+case EqualTo(l: AttributeReference, r: AttributeReference) =>
+  (l.dataType, r.dataType) match {
+case (st: StringType, _: StringType) =>

Review Comment:
   how about nested string type such as array of string, struct of string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603406813


##
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala:
##
@@ -712,4 +714,181 @@ class DataTypeSuite extends SparkFunSuite {
 
 assert(result === expected)
   }
+
+  test("schema with collation should not change during ser/de") {
+val simpleStruct = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val nestedStruct = StructType(
+  StructField("nested", simpleStruct) :: Nil)
+
+val caseInsensitiveNames = StructType(
+  StructField("c1", StringType(UNICODE_COLLATION_ID)) ::
+  StructField("C1", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val specialCharsInName = StructType(
+  StructField("c1.*23?", StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val arrayInSchema = StructType(
+  StructField("arrayField", ArrayType(StringType(UNICODE_COLLATION_ID))) 
:: Nil)
+
+val mapInSchema = StructType(
+  StructField("mapField",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val mapWithKeyInNameInSchema = StructType(
+  StructField("name.key", StringType) ::
+  StructField("name",
+MapType(StringType(UNICODE_COLLATION_ID), 
StringType(UNICODE_COLLATION_ID))) :: Nil)
+
+val arrayInMapInNestedSchema = StructType(
+  StructField("arrInMap",
+MapType(StringType(UNICODE_COLLATION_ID),
+ArrayType(StringType(UNICODE_COLLATION_ID :: Nil)
+
+val nestedArrayInMap = StructType(
+  StructField("nestedArrayInMap",
+ArrayType(MapType(StringType(UNICODE_COLLATION_ID),
+  ArrayType(ArrayType(StringType(UNICODE_COLLATION_ID)) :: Nil)
+
+val schemaWithMultipleFields = StructType(
+  simpleStruct.fields ++ nestedStruct.fields ++ arrayInSchema.fields ++ 
mapInSchema.fields ++
+mapWithKeyInNameInSchema ++ arrayInMapInNestedSchema.fields ++ 
nestedArrayInMap.fields)
+
+Seq(
+  simpleStruct, caseInsensitiveNames, specialCharsInName, nestedStruct, 
arrayInSchema,
+  mapInSchema, mapWithKeyInNameInSchema, nestedArrayInMap, 
arrayInMapInNestedSchema,
+  schemaWithMultipleFields)
+  .foreach { schema =>
+val json = schema.json
+val parsed = DataType.fromJson(json)
+assert(parsed === schema)
+  }
+  }
+
+  test("non string field has collation metadata") {

Review Comment:
   how do we decide to put test in here or `StructTypeSuite`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603405105


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()

Review Comment:
   when will we use the version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603404296


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,61 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = schemaCollationValue(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = schemaCollationValue(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)
+fieldToCollationMap.toMap
+  }
+
+  private def isCollatedString(dt: DataType): Boolean = dt match {
+case st: StringType => !st.isUTF8BinaryCollation
+case _ => false
+  }
+
+  private def schemaCollationValue(dt: DataType): String = dt match {
+case st: StringType =>
+  val collation = CollationFactory.fetchCollation(st.collationId)
+  collation.identifier().toStringWithoutVersion()
+case _ =>
+  throw new IllegalStateException(s"Unexpected data type $dt")

Review Comment:
   we should use `SparkException.internalError`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603399289


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)
+  val keyType = parseDataTypeWithCollation(k, fieldPath + ".key", 
collationsMap)
+  val valueType = parseDataTypeWithCollation(v, fieldPath + ".value", 
collationsMap)

Review Comment:
   why do we need `parseDataTypeWithCollation`? It looks correct to just call 
`parseDataType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603393035


##
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala:
##
@@ -208,22 +206,35 @@ object DataType {
   }
 
   // NOTE: Map fields must be sorted in alphabetical order to keep consistent 
with the Python side.
-  private[sql] def parseDataType(json: JValue): DataType = json match {
+  private[sql] def parseDataType(
+  json: JValue,
+  fieldPath: String = "",
+  collationsMap: Map[String, String] = Map.empty): DataType = json match {
 case JString(name) =>
-  nameToType(name)
+  collationsMap.get(fieldPath) match {
+case Some(collation) =>
+  assertValidTypeForCollations(fieldPath, name, collationsMap)
+  stringTypeWithCollation(collation)
+case _ => nameToType(name)
+  }
 
 case JSortedObject(
 ("containsNull", JBool(n)),
 ("elementType", t: JValue),
 ("type", JString("array"))) =>
-  ArrayType(parseDataType(t), n)
+  assertValidTypeForCollations(fieldPath, "array", collationsMap)
+  val elementType = parseDataTypeWithCollation(t, fieldPath + ".element", 
collationsMap)
+  ArrayType(elementType, n)
 
 case JSortedObject(
 ("keyType", k: JValue),
 ("type", JString("map")),
 ("valueContainsNull", JBool(n)),
 ("valueType", v: JValue)) =>
-  MapType(parseDataType(k), parseDataType(v), n)
+  assertValidTypeForCollations(fieldPath, "map", collationsMap)

Review Comment:
   shall we do this assert in the StructType branch as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603384345


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -110,6 +158,8 @@ public Collation(
   // No Collation can simultaneously support binary equality and lowercase 
equality
   assert(!supportsBinaryEquality || !supportsLowercaseEquality);
 
+  assert(SUPPORTED_PROVIDERS.contains(provider) || provider == null);

Review Comment:
   when `provider` can be null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1603382702


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   attributes are not optional... It's a bad practice to leave attributes 
un-initialized, which is null for reference type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603381135


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -1051,6 +1052,153 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("CollationKey generates correct collation key for string") {

Review Comment:
   I think that these tests should be in `CollationExpressionSuite`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603377917


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -784,19 +785,19 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   case _: BroadcastHashJoinExec => ()
 }.nonEmpty)
 
-// Even with hint broadcast, hash join is not used for non-binary collated 
strings.
+// Hash join is also used for non-binary collated strings.

Review Comment:
   We should assert that collationkey was injected into physical plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603376505


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -784,19 +785,19 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
   case _: BroadcastHashJoinExec => ()
 }.nonEmpty)
 
-// Even with hint broadcast, hash join is not used for non-binary collated 
strings.
+// Hash join is also used for non-binary collated strings.

Review Comment:
   I think that you should rename the test. It's currently called `test("hash 
based joins not allowed for non-binary collated strings")` which is no longer 
true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603374449


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -397,7 +398,11 @@ trait JoinSelectionHelper extends Logging {
 
   protected def hashJoinSupported
   (leftKeys: Seq[Expression], rightKeys: Seq[Expression]): Boolean = {
-val result = leftKeys.concat(rightKeys).forall(e => 
UnsafeRowUtils.isBinaryStable(e.dataType))
+def hasStableCollationKey(e: Expression): Boolean =
+  e.dataType.isInstanceOf[StringType] || 
e.dataType.isInstanceOf[BinaryType]
+def isHashJoinSupported(e: Expression): Boolean = 
UnsafeRowUtils.isBinaryStable(e.dataType) ||

Review Comment:
   Why do you need this `hasStableCollationKey`? `isBinaryStable` will return 
true for binary, so things should work just work, right?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-16 Thread via GitHub



dbatomic commented on code in PR #46599:
URL: https://github.com/apache/spark/pull/46599#discussion_r1603367540


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationKey.scala:
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
ExprCode}
+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.internal.types.StringTypeAnyCollation
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+case class CollationKey(expr: Expression) extends UnaryExpression with 
ExpectsInputTypes {
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation)
+  override def dataType: DataType = BinaryType
+
+  final lazy val collationId: Int = expr.dataType match {
+case st: StringType =>
+  st.collationId
+  }
+
+  override def nullSafeEval(input: Any): Any = input match {
+case str: UTF8String =>
+  CollationFactory.getCollationKeyBytes(str, collationId)
+case _ =>
+  None
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+val collation = CollationFactory.fetchCollation(collationId)
+if (collation.supportsBinaryEquality) {
+  defineCodeGen(ctx, ev, c => s"$c.getBytes()")
+} else if (collation.supportsLowercaseEquality) {
+  defineCodeGen(ctx, ev, c => s"$c.toLowerCase().getBytes()")
+} else {
+  defineCodeGen(ctx, ev, c => s"CollationFactory.fetchCollation" +

Review Comment:
   You can use `getCollationKeyBytes` here?
   Also, IMO, you can just remove `getCollationKeyBytes` from collation factory 
and add the logic into `nullSafeEval`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48213][SQL] Do not push down predicate if non-cheap expression exceed reused limit [spark]

2024-05-16 Thread via GitHub



cloud-fan commented on PR #46499:
URL: https://github.com/apache/spark/pull/46499#issuecomment-2115258124

   I think https://github.com/apache/spark/pull/45802#issuecomment-2101762336 
is a better idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48305][SQL] Add collation support for CurrentLike expressions [spark]

2024-05-16 Thread via GitHub



uros-db commented on PR #46613:
URL: https://github.com/apache/spark/pull/46613#issuecomment-2115241390

   @cloud-fan ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub



zhengruifeng commented on PR #46608:
URL: https://github.com/apache/spark/pull/46608#issuecomment-2115184716

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48301][SQL] Rename `CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to `CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE` [spark]

2024-05-16 Thread via GitHub



zhengruifeng closed pull request #46608: [SPARK-48301][SQL] Rename 
`CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE` to 
`CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE`
URL: https://github.com/apache/spark/pull/46608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 135 matches

Mail list logo