Re: [PR] [SPARK-48906][SQL] Introduce `SHOW COLLATIONS LIKE ...` syntax to show all collations [spark]

via GitHub Mon, 22 Jul 2024 01:09:03 -0700


mihailom-db commented on code in PR #47364:
URL: https://github.com/apache/spark/pull/47364#discussion_r1686116407



##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -442,6 +442,7 @@ Below is a list of all the keywords in Spark SQL.
 |CODEGEN|non-reserved|non-reserved|non-reserved|
 |COLLATE|reserved|non-reserved|reserved|
 |COLLATION|reserved|non-reserved|reserved|
+|COLLATIONS|reserved|non-reserved|reserved|

Review Comment:
   When we look at MySQL, they use `SHOW COLLATION`, PostgreSQL uses 
`pg_collation` catalog, so I would argue use of `COLLATION` keyword is 
sufficient, so let's keep this keyword free.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -1465,4 +1465,24 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
       }
     }
   }
+
+  test("show collations") {
+    assert(sql("SHOW COLLATIONS").collect().length == 562)
+
+    checkAnswer(sql("SHOW COLLATIONS LIKE '*UTF8_BINARY*'"),
+      Row("UTF8_BINARY", "spark", "1.0", true, true, false))
+    checkAnswer(sql("SHOW COLLATIONS '*zh_Hant_HKG*'"),
+      Seq(Row("zh_Hant_HKG", "icu", "153.120.0.0", false, false, false),

Review Comment:
   This version seems weird, I would say it should contain 75.1 as the version 
of ICU library. @nikolamand-db do you have any more info on is this expected?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -91,7 +91,7 @@ public Optional<String> getVersion() {
   /**
    * Entry encapsulating all information about a collation.
    */
-  public static class Collation {
+  public static class Collation implements Comparable<Collation> {

Review Comment:
   Is this really needed? When do we want to order them? I would say list/table 
building should be deterministic and should always output collations in the 
same way. Also I would expect `UTF8_*` family collations to come in first as 
they represent OSS internal implementations.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -918,4 +967,8 @@ public static String getClosestSuggestionsOnInvalidName(
 
     return String.join(", ", suggestions);
   }
+
+  public static List<Collation> fetchAllCollations() {

Review Comment:
   This operation seems a bit too expensive. We always build the whole table 
and then do a like on it. We need to do something similar to show tables where 
we have the pattern, as we do not want to ask for collation information if like 
operator is not concerned with it. Also it feels pretty weird to have 
`fetchAllCollations` in many places like this. This is not as expensive as 
building the whole table, but we are also allocating ArrayList multiple times



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48906][SQL] Introduce `SHOW COLLATIONS LIKE ...` syntax to show all collations [spark]

Reply via email to