Re: [PR] [WIP][SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests [spark]

2024-04-15 Thread via GitHub


HyukjinKwon commented on code in PR #46055:
URL: https://github.com/apache/spark/pull/46055#discussion_r1565268923


##
.github/workflows/build_python_connect.yml:
##
@@ -82,7 +82,11 @@ jobs:
   cp conf/log4j2.properties.template conf/log4j2.properties
   sed -i 's/rootLogger.level = info/rootLogger.level = warn/g' 
conf/log4j2.properties
   # Start a Spark Connect server
-  
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh --driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" 
--jars `find connector/connect/server/target -name spark-connect*SNAPSHOT.jar`
+  
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh \
+--driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" \
+--jars `find connector/connect/server/target -name 
spark-connect*SNAPSHOT.jar` \
+--jars `connector/protobuf/target -name 
spark-protobuf*SNAPSHOT.jar` \
+--jars `connector/avro/target -name spark-avro*SNAPSHOT.jar`
   # Make sure running Python workers that contains pyspark.core once. 
They will be reused.

Review Comment:
   ```suggestion
   --jars `find connector/connect/server/target -name 
spark-connect*SNAPSHOT.jar` \
   --jars `find connector/protobuf/target -name 
spark-protobuf*SNAPSHOT.jar` \
   --jars `find connector/avro/target -name spark-avro*SNAPSHOT.jar`
 # Make sure running Python workers that contains pyspark.core 
once. They will be reused.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests [spark]

2024-04-15 Thread via GitHub


HyukjinKwon commented on code in PR #46055:
URL: https://github.com/apache/spark/pull/46055#discussion_r1565268923


##
.github/workflows/build_python_connect.yml:
##
@@ -82,7 +82,11 @@ jobs:
   cp conf/log4j2.properties.template conf/log4j2.properties
   sed -i 's/rootLogger.level = info/rootLogger.level = warn/g' 
conf/log4j2.properties
   # Start a Spark Connect server
-  
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh --driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" 
--jars `find connector/connect/server/target -name spark-connect*SNAPSHOT.jar`
+  
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh \
+--driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" \
+--jars `find connector/connect/server/target -name 
spark-connect*SNAPSHOT.jar` \
+--jars `connector/protobuf/target -name 
spark-protobuf*SNAPSHOT.jar` \
+--jars `connector/avro/target -name spark-avro*SNAPSHOT.jar`
   # Make sure running Python workers that contains pyspark.core once. 
They will be reused.

Review Comment:
   ```suggestion
   --jars `find connector/connect/server/target -name 
spark-connect*SNAPSHOT.jar` \
   --jars `find connector/protobuf/target -name 
spark-protobuf*SNAPSHOT.jar` \
   --jars `find connector/avro/target -name spark-avro*SNAPSHOT.jar`
 # Make sure running Python workers that contains pyspark.core 
once. They will be reused.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47851][CONNECT][DOCS] Document pyspark-connect package [spark]

2024-04-15 Thread via GitHub


HyukjinKwon commented on PR #46054:
URL: https://github.com/apache/spark/pull/46054#issuecomment-2055751081

   Locally tested.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47851][CONNECT][DOCS] Document pyspark-connect package [spark]

2024-04-15 Thread via GitHub


HyukjinKwon closed pull request #46054: [SPARK-47851][CONNECT][DOCS] Document 
pyspark-connect package
URL: https://github.com/apache/spark/pull/46054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47855][CONNECT] Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect [spark]

2024-04-15 Thread via GitHub


zhengruifeng opened a new pull request, #46056:
URL: https://github.com/apache/spark/pull/46056

   ### What changes were proposed in this pull request?
   Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect
   
   
   ### Why are the changes needed?
   it doesn't take affect in Spark Connect
   
   
   ### Does this PR introduce _any_ user-facing change?
   before:
   ```
   In [2]: spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", 
True)
   
   In [3]:
   ```
   
   
   after:
   ```
   In [1]: spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", 
True)
   /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/conf.py:48: 
UserWarning: The SQL config 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is NOT supported in Spark 
Connect
 warnings.warn(warn)
   
   In [2]:
   ```
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47359][SQL] Support TRANSLATE function to work with collated strings [spark]

2024-04-15 Thread via GitHub


mihailom-db commented on code in PR #45820:
URL: https://github.com/apache/spark/pull/45820#discussion_r1565304285


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -120,6 +119,53 @@ class CollationStringExpressionsSuite
 }
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
+  test("TRANSLATE check result on explicitly collated string") {

Review Comment:
   Let's refactor this test to follow the style of other tests in the Suite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-15 Thread via GitHub


mihailom-db commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1565315128


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -47,7 +47,7 @@ object CollationTypeCasts extends TypeCoercionRule {
 
 case otherExpr @ (
   _: In | _: InSubquery | _: CreateArray | _: ArrayJoin | _: Concat | _: 
Greatest | _: Least |
-  _: Coalesce | _: BinaryExpression | _: ConcatWs) =>
+  _: Coalesce | _: BinaryExpression | _: ConcatWs | _: StringLocate) =>

Review Comment:
   We need to special case this expression. Look at @nikolamand-db PR for 
https://github.com/apache/spark/pull/45963/files example test in 
CollationSuite. It is important that we do not take into account parameters 
that should not be strings into this resolution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47566][SQL] Support SUBSTRING_INDEX function to work with collated strings [spark]

2024-04-15 Thread via GitHub


mihailom-db commented on code in PR #45725:
URL: https://github.com/apache/spark/pull/45725#discussion_r1565318876


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -47,7 +47,7 @@ object CollationTypeCasts extends TypeCoercionRule {
 
 case otherExpr @ (
   _: In | _: InSubquery | _: CreateArray | _: ArrayJoin | _: Concat | _: 
Greatest | _: Least |
-  _: Coalesce | _: BinaryExpression | _: ConcatWs) =>
+  _: Coalesce | _: BinaryExpression | _: ConcatWs | _: SubstringIndex) =>

Review Comment:
   Please special case it, not to take IntegerType into account.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

2024-04-15 Thread via GitHub


mihailom-db commented on PR #45749:
URL: https://github.com/apache/spark/pull/45749#issuecomment-2055969457

   Please add these expressions to CollationTypeCasts rules.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47850][SQL] Support `spark.sql.hive.convertInsertingUnpartitionedTable` [spark]

2024-04-15 Thread via GitHub


pan3793 commented on PR #46052:
URL: https://github.com/apache/spark/pull/46052#issuecomment-2056046628

   @LuciferYang @yaooqinn @dongjoon-hyun Thanks for reviewing and approval, 
unfortunately, I found the converted plan has the same pattern as the Hive 
table CTAS case, which results in "Once strategy's idempotence is broken for 
batch Post-Hoc Resolution" when 
`spark.sql.hive.convertInsertingUnpartitionedTable=false`, we need a flag to 
distinguish the origin of `InsertIntoHiveTable` between Conversion and CTAS, 
https://github.com/apache/spark/pull/46052/commits/4dea7d0d9bcf4f4042900cce2958059641b2f36d
 introduced a TreeNodeTag to achieve that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47850][SQL] Support `spark.sql.hive.convertInsertingUnpartitionedTable` [spark]

2024-04-15 Thread via GitHub


pan3793 commented on PR #46052:
URL: https://github.com/apache/spark/pull/46052#issuecomment-2056061808

   Another approach I can imagine is, changing `Batch("Post-Hoc Resolution", 
Once` to `Batch("Post-Hoc Resolution", Fixed(1)` to skip idempotence check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47847][CORE] Deprecate spark.network.remoteReadNioBufferConversion [spark]

2024-04-15 Thread via GitHub


pan3793 commented on PR #46047:
URL: https://github.com/apache/spark/pull/46047#issuecomment-2056077002

   cc @dongjoon-hyun and @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-47584][SQL] SQL core: Migrate logWarn with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub


panbingkun opened a new pull request, #46057:
URL: https://github.com/apache/spark/pull/46057

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565365506


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,169 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("substring check output type on explicitly collated string") {

Review Comment:
   Let's try to unify test naming
   
   e.g. "Support Substring string expression with collation"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565371395


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,169 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("substring check output type on explicitly collated string") {
+case class SubstringTestCase[R](args: Seq[String], collation: String, 
result: R)
+val checks = Seq(
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY", "park"),
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY_LCASE", "park")
+)
+checks.foreach(ct => {
+  val query = s"SELECT substr(collate('${ct.args.head}', 
'${ct.collation}')," +
+s" ${ct.args.tail.head})"
+  // Result & data type
+  checkAnswer(sql(query), Row(ct.result))
+  
assert(sql(query).schema.fields.head.dataType.sameType(StringType(ct.collation)))
+})
+  }
+
+  test("left/right/substr on implicitly collated string returns proper value 
and type") {
+case class QTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+QTestCase(s"select left(left('$longString' collate " + c + ", 5), 1)", 
c, Row("I")),
+QTestCase(s"select right(right('$longString' collate " + c + ", 5), 
1)", c, Row("s")),
+QTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("left/right/substr on explicitly collated proper string returns proper 
value and type") {

Review Comment:
   With this, "substring check output type on explicitly collated string" seems 
unnecessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565372366


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,169 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("substring check output type on explicitly collated string") {
+case class SubstringTestCase[R](args: Seq[String], collation: String, 
result: R)
+val checks = Seq(
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY", "park"),
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY_LCASE", "park")
+)
+checks.foreach(ct => {
+  val query = s"SELECT substr(collate('${ct.args.head}', 
'${ct.collation}')," +
+s" ${ct.args.tail.head})"
+  // Result & data type
+  checkAnswer(sql(query), Row(ct.result))
+  
assert(sql(query).schema.fields.head.dataType.sameType(StringType(ct.collation)))
+})
+  }
+
+  test("left/right/substr on implicitly collated string returns proper value 
and type") {
+case class QTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+QTestCase(s"select left(left('$longString' collate " + c + ", 5), 1)", 
c, Row("I")),
+QTestCase(s"select right(right('$longString' collate " + c + ", 5), 
1)", c, Row("s")),
+QTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("left/right/substr on explicitly collated proper string returns proper 
value and type") {
+case class QTestCase(query: String, collation: String, result: Row)

Review Comment:
   Instead of QTestCase, let's use meaningful names such as `SubstringTestCase`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-15 Thread via GitHub


CTCC1 commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1565134177


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -2476,7 +2476,13 @@ def repeat(col: "ColumnOrName", n: Union["ColumnOrName", 
int]) -> Column:
 repeat.__doc__ = pysparkfuncs.repeat.__doc__
 
 
-def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
+def split(
+str: "ColumnOrName",

Review Comment:
   We might want to consider renaming variable that shadows python built-in 
such as `str`. It's annoying here that it breaks the usage of 
`isinstance(pattern, str)`, and I need to work around this by aliasing the 
python builtin (See the workaround comment in code).
   Given renaming variable would be a backwards incompatible change (for user 
code that uses kwargs) I won't add that here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub


HyukjinKwon commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2056191793

   Let's clarify why 
https://github.com/apache/spark/pull/45377#issuecomment-2041315119 happens 
before we move further. That shouldn't happen from my understanding.
   
   If we go with the current approach, it would need more changes. e.g., 
`Column.substr` because it takes a different set of arguments and types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub


panbingkun commented on code in PR #46022:
URL: https://github.com/apache/spark/pull/46022#discussion_r1565393697


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala:
##
@@ -38,8 +39,8 @@ object SparkConnectServer extends Logging {
 SparkConnectService.server.getListenSockets.asScala.foreach { sa =>
   val isa = sa.asInstanceOf[InetSocketAddress]
   logInfo(
-s"Spark Connect server started at: " +
-  s"${isa.getAddress.getHostAddress}:${isa.getPort}")
+log"Spark Connect server started at: " +
+  log"${MDC(RPC_ADDRESS, 
isa.getAddress.getHostAddress)}:${MDC(PORT, isa.getPort)}")

Review Comment:
   I'm not sure if calling it `HOST_PORT` would be more appropriate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub


panbingkun commented on code in PR #46022:
URL: https://github.com/apache/spark/pull/46022#discussion_r1565404507


##
connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala:
##
@@ -70,7 +70,7 @@ class KafkaTestUtils(
   private val JAVA_AUTH_CONFIG = "java.security.auth.login.config"
 
   private val localHostNameForURI = Utils.localHostNameForURI()
-  logInfo(s"Local host name is $localHostNameForURI")
+  logInfo(log"Local host name is ${MDC(LogKey.URI, localHostNameForURI)}")

Review Comment:
   Because the LogKey `URI` is duplicated with the J`ava class`, we will write 
it directly as `LogKey.XXX`
   At the same time, we will delete the import 
   ```
   import org.apache.spark.internal.LogKey.ERROR
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


GideonPotok commented on PR #46040:
URL: https://github.com/apache/spark/pull/46040#issuecomment-2056280116

   @uros-db I made all suggested changes. Please re-review. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565412168


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,169 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("substring check output type on explicitly collated string") {
+case class SubstringTestCase[R](args: Seq[String], collation: String, 
result: R)
+val checks = Seq(
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY", "park"),
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY_LCASE", "park")
+)
+checks.foreach(ct => {
+  val query = s"SELECT substr(collate('${ct.args.head}', 
'${ct.collation}')," +
+s" ${ct.args.tail.head})"
+  // Result & data type
+  checkAnswer(sql(query), Row(ct.result))
+  
assert(sql(query).schema.fields.head.dataType.sameType(StringType(ct.collation)))
+})
+  }
+
+  test("left/right/substr on implicitly collated string returns proper value 
and type") {
+case class QTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+QTestCase(s"select left(left('$longString' collate " + c + ", 5), 1)", 
c, Row("I")),
+QTestCase(s"select right(right('$longString' collate " + c + ", 5), 
1)", c, Row("s")),
+QTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("left/right/substr on explicitly collated proper string returns proper 
value and type") {
+case class QTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+QTestCase("select left('abc' collate " + c + ", 1)", c, Row("a")),
+QTestCase("select right('def' collate " + c + ", 1)", c, Row("f")),
+QTestCase("select substr('abc' collate " + c + ", 2)", c, Row("bc")),
+QTestCase("select substr('example' collate " + c + ", 0, 2)", c, 
Row("ex")),
+QTestCase("select substr('example' collate " + c + ", 1, 2)", c, 
Row("ex")),
+QTestCase("select substr('example' collate " + c + ", 0, 7)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 1, 7)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 0, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 1, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 2, 2)", c, 
Row("xa")),
+QTestCase("select substr('example' collate " + c + ", 1, 6)", c, 
Row("exampl")),
+QTestCase("select substr('example' collate " + c + ", 2, 100)", c, 
Row("xample")),
+QTestCase("select substr('example' collate " + c + ", 0, 0)", c, 
Row("")),
+QTestCase("select substr('example' collate " + c + ", 100, 4)", c, 
Row("")),
+QTestCase("select substr('example' collate " + c + ", 0, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 1, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 2, 100)", c, 
Row("xample")),
+QTestCase("select substr('example' collate " + c + ", -3, 2)", c, 
Row("pl")),
+QTestCase("select substr('example' collate " + c + ", -100, 4)", c, 
Row("")),
+QTestCase("select substr('example' collate " + c + ", -2147483648, 
6)", c, Row("")),
+QTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+QTestCase("select left(' a世a ' collate " + c + ", 3)", c, Row(" a世")), 
// scalastyle:ignore
+QTestCase("select right(' a世a ' collate " + c + ", 3)", c, Row("世a 
")), // scalastyle:ignore
+QTestCase("select substr('AaAaAaAa00' collate " + c + ", 2, 3)", 
c, Row("aAa")),
+QTestCase("select left('AaAaAaAa00' collate " + c + ", 3)", c, 
Row("AaA")),
+QTestCase("select right('AaAaAaAa00' collate " + c + ", 3)", c, 
Row("000")),
+QTestCase("select substr('' collate " + c + ", 1, 1)", c, Row("")),
+QTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+QTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+QTestCase("select left('ghi' collate " + c + ", 1)", c, Row("g"))

Review Comment:
   I don't think we need this many test cases here, if you didn't modify the 
way Substring/Left/Right expressions behave when given collated strings (i.e. 
you didn't introduce an

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565412168


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,169 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("substring check output type on explicitly collated string") {
+case class SubstringTestCase[R](args: Seq[String], collation: String, 
result: R)
+val checks = Seq(
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY", "park"),
+  SubstringTestCase(Seq("Spark", "2"), "UTF8_BINARY_LCASE", "park")
+)
+checks.foreach(ct => {
+  val query = s"SELECT substr(collate('${ct.args.head}', 
'${ct.collation}')," +
+s" ${ct.args.tail.head})"
+  // Result & data type
+  checkAnswer(sql(query), Row(ct.result))
+  
assert(sql(query).schema.fields.head.dataType.sameType(StringType(ct.collation)))
+})
+  }
+
+  test("left/right/substr on implicitly collated string returns proper value 
and type") {
+case class QTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+QTestCase(s"select left(left('$longString' collate " + c + ", 5), 1)", 
c, Row("I")),
+QTestCase(s"select right(right('$longString' collate " + c + ", 5), 
1)", c, Row("s")),
+QTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("left/right/substr on explicitly collated proper string returns proper 
value and type") {
+case class QTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+QTestCase("select left('abc' collate " + c + ", 1)", c, Row("a")),
+QTestCase("select right('def' collate " + c + ", 1)", c, Row("f")),
+QTestCase("select substr('abc' collate " + c + ", 2)", c, Row("bc")),
+QTestCase("select substr('example' collate " + c + ", 0, 2)", c, 
Row("ex")),
+QTestCase("select substr('example' collate " + c + ", 1, 2)", c, 
Row("ex")),
+QTestCase("select substr('example' collate " + c + ", 0, 7)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 1, 7)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 0, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 1, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 2, 2)", c, 
Row("xa")),
+QTestCase("select substr('example' collate " + c + ", 1, 6)", c, 
Row("exampl")),
+QTestCase("select substr('example' collate " + c + ", 2, 100)", c, 
Row("xample")),
+QTestCase("select substr('example' collate " + c + ", 0, 0)", c, 
Row("")),
+QTestCase("select substr('example' collate " + c + ", 100, 4)", c, 
Row("")),
+QTestCase("select substr('example' collate " + c + ", 0, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 1, 100)", c, 
Row("example")),
+QTestCase("select substr('example' collate " + c + ", 2, 100)", c, 
Row("xample")),
+QTestCase("select substr('example' collate " + c + ", -3, 2)", c, 
Row("pl")),
+QTestCase("select substr('example' collate " + c + ", -100, 4)", c, 
Row("")),
+QTestCase("select substr('example' collate " + c + ", -2147483648, 
6)", c, Row("")),
+QTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+QTestCase("select left(' a世a ' collate " + c + ", 3)", c, Row(" a世")), 
// scalastyle:ignore
+QTestCase("select right(' a世a ' collate " + c + ", 3)", c, Row("世a 
")), // scalastyle:ignore
+QTestCase("select substr('AaAaAaAa00' collate " + c + ", 2, 3)", 
c, Row("aAa")),
+QTestCase("select left('AaAaAaAa00' collate " + c + ", 3)", c, 
Row("AaA")),
+QTestCase("select right('AaAaAaAa00' collate " + c + ", 3)", c, 
Row("000")),
+QTestCase("select substr('' collate " + c + ", 1, 1)", c, Row("")),
+QTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+QTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+QTestCase("select left('ghi' collate " + c + ", 1)", c, Row("g"))

Review Comment:
   I don't think we need this many test cases here, if you didn't modify the 
way Substring/Left/Right expressions behave when given collated strings (i.e. 
you didn't introduce an

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565417498


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit proper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left('abc' collate " + c + ", 1)", c, 
Row("a")),
+SubstringTestCase("select right('def' collate " + c + ", 1)", c, 
Row("f")),
+SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, 
Row("bc")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", 
c, Row("xa")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", 
c, Row("exampl")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", 
c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", 
c, Row("pl")),
+SubstringTestCase("select substr('example' collate " + c + ", -100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 
-2147483648, 6)", c, Row("")),
+SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, 
Row(" a世")), // scalastyle:ignore
+SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, 
Row("世a ")), // scalastyle:ignore
+SubstringTestCase("select substr('AaAaAaAa00' collate " + c + ", 
2, 3)", c, Row("aAa")),
+SubstringTestCase("select left('AaAaAaAa00' collate " + c + ", 
3)", c, Row("AaA")),
+SubstringTestCase("select right('AaAaAaAa00' collate " + c + ", 
3)", c, Row("000")),
+SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, 
Row("")),
+SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, 
Row("g"))

Review Comment:
   I don't think we need this many test cases here, if you didn't modify the 
way Substring/Left/Right expressions behave when given collated strings (i.e. 
you didn't introduce any collation awareness to nullSafeEval/doCodeGen), then 
there should be no need to go this deep - a couple of test cases should do the 
trick just fine
   
   also, I think these tests can be combined with the one above to make:
   `test("Support Left/Right/Substr with collation") {`
   
   so that we could have something like:
   ```
   checks.foreach { check =>
   // Result & data type (explicit collation)
   ...
 

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565421124


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit proper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left('abc' collate " + c + ", 1)", c, 
Row("a")),
+SubstringTestCase("select right('def' collate " + c + ", 1)", c, 
Row("f")),
+SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, 
Row("bc")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", 
c, Row("xa")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", 
c, Row("exampl")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", 
c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", 
c, Row("pl")),
+SubstringTestCase("select substr('example' collate " + c + ", -100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 
-2147483648, 6)", c, Row("")),
+SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, 
Row(" a世")), // scalastyle:ignore
+SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, 
Row("世a ")), // scalastyle:ignore
+SubstringTestCase("select substr('AaAaAaAa00' collate " + c + ", 
2, 3)", c, Row("aAa")),
+SubstringTestCase("select left('AaAaAaAa00' collate " + c + ", 
3)", c, Row("AaA")),
+SubstringTestCase("select right('AaAaAaAa00' collate " + c + ", 
3)", c, Row("000")),
+SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, 
Row("")),
+SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, 
Row("g"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit improper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left(null collate " + c + ", 1)", c, 
Row(null)),
+SubstringTestCase(

Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-15 Thread via GitHub


panbingkun commented on PR #46022:
URL: https://github.com/apache/spark/pull/46022#issuecomment-2056317368

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565427397


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit proper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left('abc' collate " + c + ", 1)", c, 
Row("a")),
+SubstringTestCase("select right('def' collate " + c + ", 1)", c, 
Row("f")),
+SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, 
Row("bc")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", 
c, Row("xa")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", 
c, Row("exampl")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", 
c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", 
c, Row("pl")),
+SubstringTestCase("select substr('example' collate " + c + ", -100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 
-2147483648, 6)", c, Row("")),
+SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, 
Row(" a世")), // scalastyle:ignore
+SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, 
Row("世a ")), // scalastyle:ignore
+SubstringTestCase("select substr('AaAaAaAa00' collate " + c + ", 
2, 3)", c, Row("aAa")),
+SubstringTestCase("select left('AaAaAaAa00' collate " + c + ", 
3)", c, Row("AaA")),
+SubstringTestCase("select right('AaAaAaAa00' collate " + c + ", 
3)", c, Row("000")),
+SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, 
Row("")),
+SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, 
Row("g"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit improper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left(null collate " + c + ", 1)", c, 
Row(null)),
+SubstringTestCase(

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565482699


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -974,15 +974,19 @@ case class StringTranslate(srcExpr: Expression, 
matchingExpr: Expression, replac
 case class FindInSet(left: Expression, right: Expression) extends 
BinaryExpression
 with ImplicitCastInputTypes with NullIntolerant {
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  final lazy val collationId: Int = 
left.dataType.asInstanceOf[StringType].collationId
+
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
-  override protected def nullSafeEval(word: Any, set: Any): Any =
-set.asInstanceOf[UTF8String].findInSet(word.asInstanceOf[UTF8String])
+  override protected def nullSafeEval(word: Any, set: Any): Any = {
+CollationSupport.FindInSet.
+  exec(word.asInstanceOf[UTF8String], set.asInstanceOf[UTF8String], 
collationId)
+  }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-nullSafeCodeGen(ctx, ev, (word, set) =>
-  s"${ev.value} = $set.findInSet($word);"
-)
+nullSafeCodeGen(ctx, ev, (l, r) =>
+  s"${ev.value} = " + CollationSupport.FindInSet.genCode(l, r, 
collationId) + ";")

Review Comment:
   Did you try just using `defineCodeGen`?
   
   `defineCodeGen(ctx, ev, (word, set) => 
CollationSupport.FindInSet.genCode(word, set, collationId))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565484403


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -1346,20 +1350,24 @@ case class StringTrimRight(srcStr: Expression, trimStr: 
Option[Expression] = Non
 case class StringInstr(str: Expression, substr: Expression)
   extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  final lazy val collationId: Int = 
left.dataType.asInstanceOf[StringType].collationId
+
   override def left: Expression = str
   override def right: Expression = substr
   override def dataType: DataType = IntegerType
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
   override def nullSafeEval(string: Any, sub: Any): Any = {
-string.asInstanceOf[UTF8String].indexOf(sub.asInstanceOf[UTF8String], 0) + 
1
+CollationSupport.IndexOf.
+  exec(string.asInstanceOf[UTF8String], sub.asInstanceOf[UTF8String], 0, 
collationId) + 1
   }
 
   override def prettyName: String = "instr"
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-defineCodeGen(ctx, ev, (l, r) =>
-  s"($l).indexOf($r, 0) + 1")
+  defineCodeGen(ctx, ev, (l, r) =>
+CollationSupport.IndexOf.genCode(l, r, 0, collationId) + " + 1")

Review Comment:
   perhaps we should use "string" & "sub" (instead of "l" & "r") to preserve 
the original naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1565491650


##
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -171,6 +171,29 @@ class Column(val expr: Expression) extends Logging {
 Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations 
in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param nameThe name of the operation to be performed.
+   * @param other   The value to be used in the operation, which 
will be converted to a
+   *Column if not already one.
+   * @param pysparkFragment A string representing the 'fragment' of the 
PySpark error context,
+   *typically indicates the name of PySpark 
function.
+   * @param pysparkCallSite A string representing the 'callSite' of the 
PySpark error context,
+   *providing the exact location within the 
PySpark code where the
+   *operation originated.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(

Review Comment:
   @HyukjinKwon This probably can't cover all the cases, and we may need to add 
more overloads for certain functions that require non-expression parameters, 
but it shouldn't be any.
   
   I think it's better than using ThreadLocal which can be quite fragile to 
pass values between Python and JVM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565489630


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}

Review Comment:
   we shouldn't use unit tests like this, please take a look at the other tests 
in their appropriate suites:
   
   - CollationSupportSuite.java for unit tests related to newly introduced 
collation-aware support in CollationSupport.java
   - CollationStringExpressionsSuite.scala for e2e SQL tests that should be 
used in limited quantity
   
   read more in the refactor Jira ticket description notes:
   https://issues.apache.org/jira/browse/SPARK-47410



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565490452


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}
+
+var collationId = CollationFactory.collationNameToId("UTF8_BINARY")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaads", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UTF8_BINARY_LCASE")
+testInStr("aaads", "Aa", collationId, 1)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaaDs", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaads", "aa", collationId, 1)
+testInStr("aaads", "de", collationId, 0)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 0)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE_CI")
+testInStr("aaads", "AD", collationId, 3)
+testInStr("aaads", "dS", collationId, 4)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+  }
+
+  test("FIND_IN_SET check result on explicitly collated strings") {
+def testFindInSet(word: String, set: String, collationId: Integer, 
expected: Integer): Unit = {
+  val w = Literal.create(word, StringType(collationId))
+  val s = Literal.create(set, StringType(collationId))
+
+  checkEvaluation(FindInSet(w, s), expected)
+}

Review Comment:
   we shouldn't use unit tests like this, please take a look at the other tests 
in their appropriate suites:
   
   - CollationSupportSuite.java for unit tests related to newly introduced 
collation-aware support in CollationSupport.java
   - CollationStringExpressionsSuite.scala for e2e SQL tests that should be 
used in limited quantity
   
   read more in the refactor Jira ticket description notes:
   https://issues.apache.org/jira/browse/SPARK-47410



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565492245


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}
+
+var collationId = CollationFactory.collationNameToId("UTF8_BINARY")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaads", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UTF8_BINARY_LCASE")
+testInStr("aaads", "Aa", collationId, 1)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaaDs", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaads", "aa", collationId, 1)
+testInStr("aaads", "de", collationId, 0)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 0)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE_CI")
+testInStr("aaads", "AD", collationId, 3)
+testInStr("aaads", "dS", collationId, 4)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on

Review Comment:
   unit tests should be moved to CollationSupportSuite.java, following the 
appropriate format
   
   please adjust your tests
   
   also, add appropriate sql tests to CollationStringExpressionsSuite.scala



##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}
+
+var collationId = CollationFactory.collationNameToId("UTF8_BINARY")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaads", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UTF8_BINARY_LCASE")
+testInStr("aaads", "Aa", collationId, 1)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaaDs", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaads", "aa", collationId, 1)
+testInStr("aaads", "de", collationId, 0)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 0)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE_CI")
+testInStr("aaads", "AD", collationId, 3)
+testInStr("aaads", "dS", collationId, 4)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+  }
+
+  test("FIND_IN_SET check result on explicitly collated strings") {
+def testFindInSet(word: String, set: String, collationId: Integer, 
expected: Integer): Unit = {
+  val 

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1565491650


##
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -171,6 +171,29 @@ class Column(val expr: Expression) extends Logging {
 Column.fn(name, this, lit(other))
   }
 
+  /**
+   * A version of the `fn` method specifically designed for binary operations 
in PySpark
+   * that require logging information.
+   * This method is used when the operation involves another Column.
+   *
+   * @param nameThe name of the operation to be performed.
+   * @param other   The value to be used in the operation, which 
will be converted to a
+   *Column if not already one.
+   * @param pysparkFragment A string representing the 'fragment' of the 
PySpark error context,
+   *typically indicates the name of PySpark 
function.
+   * @param pysparkCallSite A string representing the 'callSite' of the 
PySpark error context,
+   *providing the exact location within the 
PySpark code where the
+   *operation originated.
+   * @return A Column resulting from the operation.
+   */
+  private def fn(

Review Comment:
   @HyukjinKwon This probably can't cover all the cases, and we may need to add 
more overloads for certain functions that require non-expression parameters, 
but it shouldn't be many.
   
   I think it's better than using ThreadLocal which can be quite fragile to 
pass values between Python and JVM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565494692


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}
+
+var collationId = CollationFactory.collationNameToId("UTF8_BINARY")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaads", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UTF8_BINARY_LCASE")
+testInStr("aaads", "Aa", collationId, 1)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaaDs", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaads", "aa", collationId, 1)
+testInStr("aaads", "de", collationId, 0)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 0)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE_CI")
+testInStr("aaads", "AD", collationId, 3)
+testInStr("aaads", "dS", collationId, 4)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+  }
+
+  test("FIND_IN_SET check result on explicitly collated strings") {

Review Comment:
   Let's try to unify test naming
   
   e.g. `test("Support StringInstr string expression with collation")` in 
CollationStringExpressionsSuite.scala



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565494692


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}
+
+var collationId = CollationFactory.collationNameToId("UTF8_BINARY")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaads", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UTF8_BINARY_LCASE")
+testInStr("aaads", "Aa", collationId, 1)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaaDs", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaads", "aa", collationId, 1)
+testInStr("aaads", "de", collationId, 0)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 0)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE_CI")
+testInStr("aaads", "AD", collationId, 3)
+testInStr("aaads", "dS", collationId, 4)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+  }
+
+  test("FIND_IN_SET check result on explicitly collated strings") {

Review Comment:
   Let's try to unify test naming
   
   e.g. `test("Support FindInSet string expression with collation")` in 
CollationStringExpressionsSuite.scala



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

2024-04-15 Thread via GitHub


vicennial commented on code in PR #46012:
URL: https://github.com/apache/spark/pull/46012#discussion_r1565497107


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala:
##
@@ -381,6 +405,53 @@ case class SessionHolder(userId: String, sessionId: 
String, session: SparkSessio
*/
   private[connect] val pythonAccumulator: Option[PythonAccumulator] =
 Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption
+
+  /**
+   * Transform a relation into a logical plan, using the plan cache if enabled.
+   * The plan cache is enable only if 
`spark.connect.session.planCache.maxSize` is greater than zero
+   * AND `spark.connect.session.planCache.enabled` is true.
+   * @param rel The relation to transform.
+   * @param cachePlan Whether to cache the result logical plan.
+   * @param transform Function to transform the relation into a logical plan.
+   * @return The logical plan.
+   */
+  private[connect] def usePlanCache(rel: proto.Relation, cachePlan: Boolean)(

Review Comment:
   @zhengruifeng Does point 3 extend to all DDL/DML commands?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565494137


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {

Review Comment:
   Let's try to unify test naming
   
   e.g. `test("Support StringInstr string expression with collation")`if the 
test is in CollationStringExpressionsSuite.scala



##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -96,6 +95,107 @@ class CollationStringExpressionsSuite
 assert(collationMismatch.getErrorClass === "COLLATION_MISMATCH.EXPLICIT")
   }
 
+  test("INSTR check result on explicitly collated strings") {
+def testInStr(str: String, substr: String, collationId: Integer, expected: 
Integer): Unit = {
+  val string = Literal.create(str, StringType(collationId))
+  val substring = Literal.create(substr, StringType(collationId))
+
+  checkEvaluation(StringInstr(string, substring), expected)
+}
+
+var collationId = CollationFactory.collationNameToId("UTF8_BINARY")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaads", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UTF8_BINARY_LCASE")
+testInStr("aaads", "Aa", collationId, 1)
+testInStr("aaaDs", "de", collationId, 0)
+testInStr("aaaDs", "ds", collationId, 4)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "大千", collationId, 5)
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE")
+testInStr("aaads", "Aa", collationId, 0)
+testInStr("aaads", "aa", collationId, 1)
+testInStr("aaads", "de", collationId, 0)
+testInStr("", "", collationId, 1)
+testInStr("", "", collationId, 0)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 0)
+testInStr("test大千世界X大千世界", "界X", collationId, 8)
+// scalastyle:on
+
+collationId = CollationFactory.collationNameToId("UNICODE_CI")
+testInStr("aaads", "AD", collationId, 3)
+testInStr("aaads", "dS", collationId, 4)
+// scalastyle:off
+testInStr("test大千世界X大千世界", "界x", collationId, 8)
+// scalastyle:on
+  }
+
+  test("FIND_IN_SET check result on explicitly collated strings") {

Review Comment:
   Let's try to unify test naming
   
   e.g. `test("Support FindInSet string expression with collation")` if the 
test is in CollationStringExpressionsSuite.scala



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47856][SQL] Oracle: Document Mapping Spark SQL Data Types from Oracle [spark]

2024-04-15 Thread via GitHub


yaooqinn opened a new pull request, #46059:
URL: https://github.com/apache/spark/pull/46059

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR added Document Mapping Spark SQL Data Types from Oracle
   
   ### Why are the changes needed?
   
   Document Improvement
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   ### How was this patch tested?
   
   doc build
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565502312


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -137,6 +137,76 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
 }
   }
 
+  public static class FindInSet {
+public static int exec(final UTF8String l, final UTF8String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return execLowercase(l, r);
+  } else {
+return execICU(l, r, collationId);
+  }
+}
+public static String genCode(final String l, final String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.FindInSet.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static int execBinary(final UTF8String l, final UTF8String r) {
+  return r.findInSet(l);
+}
+public static int execLowercase(final UTF8String l, final UTF8String r) {
+  return r.toLowerCase().findInSet(l.toLowerCase());
+}
+public static int execICU(final UTF8String l, final UTF8String r,
+  final int collationId) {
+  return CollationAwareUTF8String.findInSet(l, r, collationId);
+}
+  }
+
+  public static class IndexOf {

Review Comment:
   let's keep the naming uniform
   
   if the original expression is `case class StringInstr`, then this should 
probably be:
   `public static class IndexOf`



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -137,6 +137,76 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
 }
   }
 
+  public static class FindInSet {
+public static int exec(final UTF8String l, final UTF8String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return execLowercase(l, r);
+  } else {
+return execICU(l, r, collationId);
+  }
+}
+public static String genCode(final String l, final String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.FindInSet.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static int execBinary(final UTF8String l, final UTF8String r) {
+  return r.findInSet(l);
+}
+public static int execLowercase(final UTF8String l, final UTF8String r) {
+  return r.toLowerCase().findInSet(l.toLowerCase());
+}
+public static int execICU(final UTF8String l, final UTF8String r,
+  final int collationId) {
+  return CollationAwareUTF8String.findInSet(l, r, collationId);
+}
+  }
+
+  public static class IndexOf {

Review Comment:
   let's keep the naming uniform
   
   if the original expression is `case class StringInstr`, then this should 
probably be:
   `public static class StringInstr`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565502312


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -137,6 +137,76 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
 }
   }
 
+  public static class FindInSet {
+public static int exec(final UTF8String l, final UTF8String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return execLowercase(l, r);
+  } else {
+return execICU(l, r, collationId);
+  }
+}
+public static String genCode(final String l, final String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.FindInSet.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static int execBinary(final UTF8String l, final UTF8String r) {
+  return r.findInSet(l);
+}
+public static int execLowercase(final UTF8String l, final UTF8String r) {
+  return r.toLowerCase().findInSet(l.toLowerCase());
+}
+public static int execICU(final UTF8String l, final UTF8String r,
+  final int collationId) {
+  return CollationAwareUTF8String.findInSet(l, r, collationId);
+}
+  }
+
+  public static class IndexOf {

Review Comment:
   let's keep the naming uniform
   
   if the original expression is `case class StringInstr`, then this should 
probably be:
   `public static class IndexOf`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565508312


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -137,6 +137,76 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
 }
   }
 
+  public static class FindInSet {
+public static int exec(final UTF8String l, final UTF8String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return execLowercase(l, r);
+  } else {
+return execICU(l, r, collationId);
+  }
+}
+public static String genCode(final String l, final String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.FindInSet.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static int execBinary(final UTF8String l, final UTF8String r) {
+  return r.findInSet(l);
+}
+public static int execLowercase(final UTF8String l, final UTF8String r) {
+  return r.toLowerCase().findInSet(l.toLowerCase());
+}
+public static int execICU(final UTF8String l, final UTF8String r,
+  final int collationId) {
+  return CollationAwareUTF8String.findInSet(l, r, collationId);
+}
+  }
+
+  public static class IndexOf {

Review Comment:
   `CollationAwareUTF8String.indexOf` should follow the `UTF8String.indexOf 
naming`, so that part is fine for example
   
   but `CollationSupport.StringInstr` should definitely follow 
`stringExpressions.StringInstr` naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565510703


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -974,15 +974,19 @@ case class StringTranslate(srcExpr: Expression, 
matchingExpr: Expression, replac
 case class FindInSet(left: Expression, right: Expression) extends 
BinaryExpression
 with ImplicitCastInputTypes with NullIntolerant {
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  final lazy val collationId: Int = 
left.dataType.asInstanceOf[StringType].collationId
+
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
-  override protected def nullSafeEval(word: Any, set: Any): Any =
-set.asInstanceOf[UTF8String].findInSet(word.asInstanceOf[UTF8String])
+  override protected def nullSafeEval(word: Any, set: Any): Any = {
+CollationSupport.FindInSet.
+  exec(word.asInstanceOf[UTF8String], set.asInstanceOf[UTF8String], 
collationId)
+  }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-nullSafeCodeGen(ctx, ev, (word, set) =>
-  s"${ev.value} = $set.findInSet($word);"
-)
+nullSafeCodeGen(ctx, ev, (l, r) =>

Review Comment:
   perhaps we should use "word" & "set" (instead of "l" & "r") to preserve the 
original naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565515634


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {

Review Comment:
   these parameters should be `final`



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();
+  int wordStart = 0;
+  while ((wordStart = stringSearch.next()) != StringSearch.DONE) {
+boolean isValidStart = wordStart == 0 || setString.charAt(wordStart - 
1) == ',';
+boolean isValidEnd = wordStart + stringSearch.getMatchLength() == 
setString.length()
+|| setString.charAt(wordStart + stringSearch.getMatchLength()) 
== ',';
+
+if (isValidStart && isValidEnd) {
+  int pos = 0;
+  for (int i = 0; i < setString.length() && i < wordStart; i++) {
+if (setString.charAt(i) == ',') {
+  pos++;
+}
+  }
+
+  return pos + 1;
+}
+  }
+
+  return 0;
+}
+
+private static int indexOf(UTF8String target, UTF8String pattern, int 
start, int collationId) {

Review Comment:
   these parameters should be `final`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565521994


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does set.toString()
   
   there's no need to do that twice, let's add:
   `  public static StringSearch getStringSearch(
 final String target,
 final String patternString,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(left.toString());
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }`
   to `CollationFactory`
   
   this way, we can get the `stringSearch` and `setString` instances here, 
without extra String allocations



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does set.toString()
   
   there's no need to do that twice, let's add:
   
   `  public static StringSearch getStringSearch(
 final String target,
 final String patternString,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(left.toString());
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }`
   
   to `CollationFactory`
   
   this way, we can get the `stringSearch` and `setString` instances here, 
without extra String allocations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565521994


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does set.toString()
   
   there's no need to do that twice, let's add:
   
   `
 public static StringSearch getStringSearch(
 final String target,
 final String patternString,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(left.toString());
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }
 `
   
   to `CollationFactory`
   
   this way, we can get the `stringSearch` and `setString` instances here, 
without extra String allocations



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does set.toString()
   
   there's no need to do that twice, let's add:
   
   ```
 public static StringSearch getStringSearch(
 final String target,
 final String patternString,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(left.toString());
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }
   ```
   
   to `CollationFactory`
   
   this way, we can get the `stringSearch` and `setString` instances here, 
without extra String allocations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565521994


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does set.toString()
   
   there's no need to do that twice, let's add:
   
   ```
 public static StringSearch getStringSearch(
 final String targetString,
 final String pattern,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(targetString);
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }
   ```
   
   to `CollationFactory`
   
   this way, we can get the `stringSearch` and `setString` instances here, 
without extra String allocations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565521994


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does set.toString()
   
   there's no need to do that twice, let's add:
   
   ```
 public static StringSearch getStringSearch(
 final String targetString,
 final String pattern,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(targetString);
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }
   ```
   
   to `CollationFactory`
   
   this way, we can get the `stringSearch` and `setString` instances here, 
without extra String allocations



##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think CollationFactory.getStringSearch(set, match, collationId) already 
does set.toString()
   
   there's no need to do that twice, let's add:
   
 public static StringSearch getStringSearch(
 final String targetString,
 final String pattern,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(targetString);
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }
   to CollationFactory
   
   this way, we can get the stringSearch and setString instances here, without 
extra String allocations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


GideonPotok commented on PR #46041:
URL: https://github.com/apache/spark/pull/46041#issuecomment-2056451989

   @uros-db  please review this one, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565508312


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -137,6 +137,76 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
 }
   }
 
+  public static class FindInSet {
+public static int exec(final UTF8String l, final UTF8String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  if (collation.supportsBinaryEquality) {
+return execBinary(l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return execLowercase(l, r);
+  } else {
+return execICU(l, r, collationId);
+  }
+}
+public static String genCode(final String l, final String r, final int 
collationId) {
+  CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+  String expr = "CollationSupport.FindInSet.exec";
+  if (collation.supportsBinaryEquality) {
+return String.format(expr + "Binary(%s, %s)", l, r);
+  } else if (collation.supportsLowercaseEquality) {
+return String.format(expr + "Lowercase(%s, %s)", l, r);
+  } else {
+return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+  }
+}
+public static int execBinary(final UTF8String l, final UTF8String r) {
+  return r.findInSet(l);
+}
+public static int execLowercase(final UTF8String l, final UTF8String r) {
+  return r.toLowerCase().findInSet(l.toLowerCase());
+}
+public static int execICU(final UTF8String l, final UTF8String r,
+  final int collationId) {
+  return CollationAwareUTF8String.findInSet(l, r, collationId);
+}
+  }
+
+  public static class IndexOf {

Review Comment:
   `CollationAwareUTF8String.indexOf` should follow the `UTF8String.indexOf` 
naming, so that part is fine for example
   
   but `CollationSupport.StringInstr` should definitely follow 
`stringExpressions.StringInstr` naming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


GideonPotok closed pull request #46039: [SPARK-47413][SQL] - add support to 
substr/left/right for collations   [Post-Refactor]
URL: https://github.com/apache/spark/pull/46039


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


GideonPotok commented on PR #46040:
URL: https://github.com/apache/spark/pull/46040#issuecomment-2056457118

   > What's idff w/ #46039?
   
   @HyukjinKwon Sorry for confusion. I closed out the other one to clear up 
which one to review.  
   
   (To answer your question, that other one did not have the unit test over a 
struct field, because in the past I have dealt with GHA Test flakiness over 
using withTable expressions.  )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565524362


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   I think `CollationFactory.getStringSearch(set, match, collationId)` already 
does `set.toString()`
   
   there's no need to do that twice, let's add:
   
   ```
 public static StringSearch getStringSearch(
 final String targetString,
 final String pattern,
 final int collationId) {
   CharacterIterator target = new StringCharacterIterator(targetString);
   Collator collator = 
CollationFactory.fetchCollation(collationId).collator;
   return new StringSearch(pattern, target, (RuleBasedCollator) collator);
 }
   ```
   
   to `CollationFactory`
   
   this way, we can get the stringSearch and setString instances here, without 
extra String allocations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565531354


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   and if you do that, I'd say that the UTF8String implementation for 
`getStringSearch` can amount to:
   
   ```
 public static StringSearch getStringSearch(
 final UTF8String left,
 final UTF8String right,
 final int collationId) {
   return getStringSearch(left.toString(), right.toString(), collationId);
 }
   ```
   
   this way, we can have a more complete API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565541109


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,48 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Lpad string expressions with collation") {
+// Supported collations
+case class PadTestCase[R](s: String, len: Int, pad: String, c: String, 
result: R)
+val testCases = Seq(
+  PadTestCase("", 5, " ", "UTF8_BINARY", " "),
+  PadTestCase("abc", 5, " ", "UNICODE", "  abc"),
+  PadTestCase("abc", 5, "XY", "UTF8_BINARY_LCASE", "XYabc"),
+  PadTestCase("abc", 5, "123", "UNICODE_CI", "12abc"),
+  PadTestCase("abc", 2, " ", "UTF8_BINARY", "ab"),
+  PadTestCase("abc", 2, "XY", "UNICODE", "ab"),
+  PadTestCase("abc", 2, "123", "UTF8_BINARY_LCASE", "ab"),
+  PadTestCase("abc", 2, "123", "UNICODE_CI", "ab")

Review Comment:
   since we're already using this compact subset of testcases (which is great), 
let's spice these up a bit
   
   mix in some case & accent variation, as well as variable byte-length 
characters here (but no need to add more test cases)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565544619


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,48 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Lpad string expressions with collation") {

Review Comment:
   ```suggestion
 test("Support StringLPad string expressions with collation") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565545008


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,48 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Lpad string expressions with collation") {
+// Supported collations
+case class PadTestCase[R](s: String, len: Int, pad: String, c: String, 
result: R)

Review Comment:
   ```suggestion
   case class StringLPadTestCase[R](s: String, len: Int, pad: String, c: 
String, result: R)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565545434


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,48 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Lpad string expressions with collation") {
+// Supported collations
+case class PadTestCase[R](s: String, len: Int, pad: String, c: String, 
result: R)
+val testCases = Seq(
+  PadTestCase("", 5, " ", "UTF8_BINARY", " "),
+  PadTestCase("abc", 5, " ", "UNICODE", "  abc"),
+  PadTestCase("abc", 5, "XY", "UTF8_BINARY_LCASE", "XYabc"),
+  PadTestCase("abc", 5, "123", "UNICODE_CI", "12abc"),
+  PadTestCase("abc", 2, " ", "UTF8_BINARY", "ab"),
+  PadTestCase("abc", 2, "XY", "UNICODE", "ab"),
+  PadTestCase("abc", 2, "123", "UTF8_BINARY_LCASE", "ab"),
+  PadTestCase("abc", 2, "123", "UNICODE_CI", "ab")
+)
+testCases.foreach(t => {
+  val query = s"SELECT lpad(collate('${t.s}', '${t.c}'), ${t.len}, 
'${t.pad}')"
+  // Result & data type
+  checkAnswer(sql(query), Row(t.result))
+  assert(sql(query).schema.fields.head.dataType.sameType(StringType(t.c)))
+})
+  }
+
+  test("Support Rpad string expressions with collation") {

Review Comment:
   ```suggestion
 test("Support StringRPad string expressions with collation") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565545698


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,48 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Lpad string expressions with collation") {
+// Supported collations
+case class PadTestCase[R](s: String, len: Int, pad: String, c: String, 
result: R)
+val testCases = Seq(
+  PadTestCase("", 5, " ", "UTF8_BINARY", " "),
+  PadTestCase("abc", 5, " ", "UNICODE", "  abc"),
+  PadTestCase("abc", 5, "XY", "UTF8_BINARY_LCASE", "XYabc"),
+  PadTestCase("abc", 5, "123", "UNICODE_CI", "12abc"),
+  PadTestCase("abc", 2, " ", "UTF8_BINARY", "ab"),
+  PadTestCase("abc", 2, "XY", "UNICODE", "ab"),
+  PadTestCase("abc", 2, "123", "UTF8_BINARY_LCASE", "ab"),
+  PadTestCase("abc", 2, "123", "UNICODE_CI", "ab")
+)
+testCases.foreach(t => {
+  val query = s"SELECT lpad(collate('${t.s}', '${t.c}'), ${t.len}, 
'${t.pad}')"
+  // Result & data type
+  checkAnswer(sql(query), Row(t.result))
+  assert(sql(query).schema.fields.head.dataType.sameType(StringType(t.c)))
+})
+  }
+
+  test("Support Rpad string expressions with collation") {
+// Supported collations
+case class PadTestCase[R](s: String, len: Int, pad: String, c: String, 
result: R)

Review Comment:
   ```suggestion
   case class StringRPadTestCase[R](s: String, len: Int, pad: String, c: 
String, result: R)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47081][CONNECT][FOLLOW] Unflake Progress Execution [spark]

2024-04-15 Thread via GitHub


grundprinzip opened a new pull request, #46060:
URL: https://github.com/apache/spark/pull/46060

   ### Why are the changes needed?
   Running some of the tests repetitively have shown that there is an edge case 
in which the progress reporting yields undesired behavior. This is in 
particular the case when the response sender loop has already sent the last 
expected message in the stream to indicate that the stream is finished, but the 
response sender would send at least one additional progress message.
   
   This patch removes this race condition and guarantees that the 
`ResultComplete` message is the last message on the stream.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added an additional UT, but existing tests cover the race condition.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565548686


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -1574,7 +1574,10 @@ case class StringLPad(str: Expression, len: Expression, 
pad: Expression)
   override def third: Expression = pad
 
   override def dataType: DataType = str.dataType
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
IntegerType, StringType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+StringTypeAnyCollation,
+IntegerType,
+StringTypeAnyCollation)

Review Comment:
   ```suggestion
 override def inputTypes: Seq[AbstractDataType] =
   Seq(StringTypeAnyCollation, IntegerType, StringTypeAnyCollation)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47412][SQL] Add Collation Support for LPad/RPad. [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46041:
URL: https://github.com/apache/spark/pull/46041#discussion_r1565549001


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -1653,7 +1656,10 @@ case class StringRPad(str: Expression, len: Expression, 
pad: Expression = Litera
   override def third: Expression = pad
 
   override def dataType: DataType = str.dataType
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
IntegerType, StringType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+StringTypeAnyCollation,
+IntegerType,
+StringTypeAnyCollation)

Review Comment:
   ```suggestion
 override def inputTypes: Seq[AbstractDataType] =
   Seq(StringTypeAnyCollation, IntegerType, StringTypeAnyCollation)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


miland-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r156472


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -974,15 +974,19 @@ case class StringTranslate(srcExpr: Expression, 
matchingExpr: Expression, replac
 case class FindInSet(left: Expression, right: Expression) extends 
BinaryExpression
 with ImplicitCastInputTypes with NullIntolerant {
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  final lazy val collationId: Int = 
left.dataType.asInstanceOf[StringType].collationId
+
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
-  override protected def nullSafeEval(word: Any, set: Any): Any =
-set.asInstanceOf[UTF8String].findInSet(word.asInstanceOf[UTF8String])
+  override protected def nullSafeEval(word: Any, set: Any): Any = {
+CollationSupport.FindInSet.
+  exec(word.asInstanceOf[UTF8String], set.asInstanceOf[UTF8String], 
collationId)
+  }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-nullSafeCodeGen(ctx, ev, (word, set) =>
-  s"${ev.value} = $set.findInSet($word);"
-)
+nullSafeCodeGen(ctx, ev, (l, r) =>
+  s"${ev.value} = " + CollationSupport.FindInSet.genCode(l, r, 
collationId) + ";")

Review Comment:
   It works, but I am not sure if we should go from `nullSafeCodeGen` to just 
`defineCodeGen`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565559007


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -974,15 +974,19 @@ case class StringTranslate(srcExpr: Expression, 
matchingExpr: Expression, replac
 case class FindInSet(left: Expression, right: Expression) extends 
BinaryExpression
 with ImplicitCastInputTypes with NullIntolerant {
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType)
+  final lazy val collationId: Int = 
left.dataType.asInstanceOf[StringType].collationId
+
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
-  override protected def nullSafeEval(word: Any, set: Any): Any =
-set.asInstanceOf[UTF8String].findInSet(word.asInstanceOf[UTF8String])
+  override protected def nullSafeEval(word: Any, set: Any): Any = {
+CollationSupport.FindInSet.
+  exec(word.asInstanceOf[UTF8String], set.asInstanceOf[UTF8String], 
collationId)
+  }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-nullSafeCodeGen(ctx, ev, (word, set) =>
-  s"${ev.value} = $set.findInSet($word);"
-)
+nullSafeCodeGen(ctx, ev, (l, r) =>
+  s"${ev.value} = " + CollationSupport.FindInSet.genCode(l, r, 
collationId) + ";")

Review Comment:
   of course, it's just shorthand (see defineCodeGen impl)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


miland-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565559690


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -169,6 +239,46 @@ private static boolean matchAt(final UTF8String target, 
final UTF8String pattern
 pos, pos + pattern.numChars()), pattern, collationId).last() == 0;
 }
 
+private static int findInSet(UTF8String match, UTF8String set, int 
collationId) {
+  if (match.contains(UTF8String.fromString(","))) {
+return 0;
+  }
+
+  StringSearch stringSearch = CollationFactory.getStringSearch(set, match, 
collationId);
+
+  String setString = set.toString();

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47420][SQL] Fix test output [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46058:
URL: https://github.com/apache/spark/pull/46058#discussion_r1565572482


##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -176,12 +176,12 @@ public void testStartsWith() throws SparkException {
 assertStartsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
   }
 
-  private void assertEndsWith(String pattern, String suffix, String 
collationName, boolean value)
+  private void assertEndsWith(String pattern, String suffix, String 
collationName)

Review Comment:
   ```suggestion
 private void assertEndsWith(String pattern, String suffix, String 
collationName, boolean expected)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations) [spark]

2024-04-15 Thread via GitHub


mihailom-db opened a new pull request, #46061:
URL: https://github.com/apache/spark/pull/46061

   ### What changes were proposed in this pull request?
   Addition of support for ConcatWs and Elt expressions.
   
   
   ### Why are the changes needed?
   We need to enable these functions to support collations in order to scope 
all functions.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, both expressions now will not return error when called with collated 
strings.
   
   
   ### How was this patch tested?
   Addition of tests to `CollationStringExpressionsSuite`
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565606851


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }

Review Comment:
   I think this test is not needed
   
   for the scope of this ticket (adding collation awareness to a string 
expression), it doesn't matter where the collationId came from (explicit or 
implicit), what matters is can the expression handle it properly
   
   for example: if `left('In the course of human events' collate unicode, 5)` 
returns proper result value (`'In th'`) and type (`StringType(3)`), then 
calling left on that resulting string is no different than calling `left('In 
th' collate unicode, 1)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565608369


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }

Review Comment:
   in other words, let's just look from the outside at string expression like a 
black box module, and focus just on making it work with collations properly
   
   otherwise, we might end up with too many repetitive tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565610737


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit proper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left('abc' collate " + c + ", 1)", c, 
Row("a")),
+SubstringTestCase("select right('def' collate " + c + ", 1)", c, 
Row("f")),
+SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, 
Row("bc")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", 
c, Row("xa")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", 
c, Row("exampl")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", 
c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", 
c, Row("pl")),
+SubstringTestCase("select substr('example' collate " + c + ", -100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 
-2147483648, 6)", c, Row("")),
+SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, 
Row(" a世")), // scalastyle:ignore
+SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, 
Row("世a ")), // scalastyle:ignore
+SubstringTestCase("select substr('AaAaAaAa00' collate " + c + ", 
2, 3)", c, Row("aAa")),
+SubstringTestCase("select left('AaAaAaAa00' collate " + c + ", 
3)", c, Row("AaA")),
+SubstringTestCase("select right('AaAaAaAa00' collate " + c + ", 
3)", c, Row("000")),
+SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, 
Row("")),
+SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, 
Row("g"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit improper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left(null collate " + c + ", 1)", c, 
Row(null)),
+SubstringTestCase(

Re: [PR] [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #46040:
URL: https://github.com/apache/spark/pull/46040#discussion_r1565611174


##
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala:
##
@@ -163,6 +163,155 @@ class CollationStringExpressionsSuite
 })
   }
 
+  test("Support Left/Right/Substr with implicit collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val longString = "In the course of human events"
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase(s"select left(left('$longString' collate " + c + ", 
5), 1)", c, Row("I")),
+SubstringTestCase(
+  s"select right(right('$longString' collate " + c + ", 5), 1)", c, 
Row("s")),
+SubstringTestCase(
+  s"select substr(substr('$longString' collate " + c + ", 4), 2)", c,
+  Row("he course of human events"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit proper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left('abc' collate " + c + ", 1)", c, 
Row("a")),
+SubstringTestCase("select right('def' collate " + c + ", 1)", c, 
Row("f")),
+SubstringTestCase("select substr('abc' collate " + c + ", 2)", c, 
Row("bc")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 2)", 
c, Row("ex")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 7)", 
c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 2)", 
c, Row("xa")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 6)", 
c, Row("exampl")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 0)", 
c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 0, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 1, 
100)", c, Row("example")),
+SubstringTestCase("select substr('example' collate " + c + ", 2, 
100)", c, Row("xample")),
+SubstringTestCase("select substr('example' collate " + c + ", -3, 2)", 
c, Row("pl")),
+SubstringTestCase("select substr('example' collate " + c + ", -100, 
4)", c, Row("")),
+SubstringTestCase("select substr('example' collate " + c + ", 
-2147483648, 6)", c, Row("")),
+SubstringTestCase("select substr(' a世a ' collate " + c + ", 2, 3)", c, 
Row("a世a")), // scalastyle:ignore
+SubstringTestCase("select left(' a世a ' collate " + c + ", 3)", c, 
Row(" a世")), // scalastyle:ignore
+SubstringTestCase("select right(' a世a ' collate " + c + ", 3)", c, 
Row("世a ")), // scalastyle:ignore
+SubstringTestCase("select substr('AaAaAaAa00' collate " + c + ", 
2, 3)", c, Row("aAa")),
+SubstringTestCase("select left('AaAaAaAa00' collate " + c + ", 
3)", c, Row("AaA")),
+SubstringTestCase("select right('AaAaAaAa00' collate " + c + ", 
3)", c, Row("000")),
+SubstringTestCase("select substr('' collate " + c + ", 1, 1)", c, 
Row("")),
+SubstringTestCase("select left('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select right('' collate " + c + ", 1)", c, Row("")),
+SubstringTestCase("select left('ghi' collate " + c + ", 1)", c, 
Row("g"))
+  )
+)
+
+checks.foreach { check =>
+  // Result & data type
+  checkAnswer(sql(check.query), check.result)
+  
assert(sql(check.query).schema.fields.head.dataType.sameType(StringType(check.collation)))
+}
+  }
+
+  test("Support Left/Right/Substr with explicit improper collation") {
+case class SubstringTestCase(query: String, collation: String, result: Row)
+val checks = Seq("utf8_binary_lcase", "utf8_binary", "unicode", 
"unicode_ci").flatMap(
+  c => Seq(
+SubstringTestCase("select left(null collate " + c + ", 1)", c, 
Row(null)),
+SubstringTestCase(

Re: [PR] [SPARK-47463][SQL] Use V2Predicate to wrap expression with return type of boolean [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45589:
URL: https://github.com/apache/spark/pull/45589#discussion_r1565631145


##
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##
@@ -966,6 +966,22 @@ class DataSourceV2Suite extends QueryTest with 
SharedSparkSession with AdaptiveS
   )
 }
   }
+
+  test("SPARK-47463: Pushed down v2 filter with if expression") {
+withTempView("t1") {
+  
spark.read.format(classOf[AdvancedDataSourceV2WithV2Filter].getName).load()
+.createTempView("t1")
+  val df1 = sql(
+s"""
+   |select * from
+   |(select if(i = 1, i, 0) as c from t1) t
+   |where t.c > 0

Review Comment:
   I think `where if(i = 1, i, 0) > 0` is a valid SQL? BTW, please upper case 
the SQL keywords in the SQL statement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub


itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2056611480

   The difficulty with the previous method was that it was not easy to 
perfectly sync the data between two separately operating TheadLocal, 
`CurrentOrigin` and `PySparkCurrentOrigin`.
   
   After taking deeper look at the structure, I think we may be able to make 
the `CurrentOrigin` more flexible to support PySpark error context instead of 
adding a separate ThreadLocal like `PySparkCurrentOrigin`.
   
   If it works, it seems possible to improve the structure to a more flexible 
while maintaining the existing communication rules between Python and JVM 
without adding helper functions such as PySpark-specific `fn`.
   
   Let me give it a try and create a PR to refactoring the current structure, 
and ping you guys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1565638065


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, 
IsNull, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.JOIN
+
+/**
+ * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> 
t2.id`
+ * in join condition for better performance.
+ */
+object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
+_.containsPattern(JOIN), ruleId) {
+case join @ Join(left, right, joinType, condition, hint) if 
condition.nonEmpty =>
+  val predicates = condition.map(splitConjunctivePredicates).get
+  var replace = false
+  val newPredicates = predicates.map {
+case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2)))

Review Comment:
   Do we need to call `splitConjunctivePredicates` at the beginning? It seems 
we can just transform the join condition expression and match this pattern of 
`Or`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47857][SQL] Utilize java.sql.RowId.getBytes API directly for UTF8String [spark]

2024-04-15 Thread via GitHub


yaooqinn opened a new pull request, #46062:
URL: https://github.com/apache/spark/pull/46062

   
   
   ### What changes were proposed in this pull request?
   
   
   Utilize java.sql.RowId.getBytes API directly to simplify 
`RowId.toString.getBytes` to RowId.getBytes.
   
   ### Why are the changes needed?
   
   
   To simplify code and might get some performance gain by visiting underly 
bytes array directly instead of copying back and forth.
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   existing tests
   
   ```
   SPARK-32992: map Oracle's ROWID type to StringType (721 milliseconds)
   SPARK-44885: query row with ROWID type containing NULL value (821 
milliseconds)
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47840][SS] Disable foldable propagation across Streaming Aggregate/Join nodes [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on PR #46035:
URL: https://github.com/apache/spark/pull/46035#issuecomment-2056640353

   Shall we revisit all the rules that match Aggregate/Join? It's a good point 
that streaming aggregate/join also reads data from state and we can't optimize 
it with the query plan of the current micro-batch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47858][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub


itholic opened a new pull request, #46063:
URL: https://github.com/apache/spark/pull/46063

   ### What changes were proposed in this pull request?
   
   This PR proposes to refactoring the current structure for DataFrame error 
context.
   
   
   ### Why are the changes needed?
   
   To make future management and expansion more flexible
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's internal code refactoring
   
   
   ### How was this patch tested?
   
   The existing `DataFrameTests.test_dataframe_error_context` should pass.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-15 Thread via GitHub


itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1565663205


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala:
##
@@ -76,6 +76,12 @@ object CurrentOrigin {
   value.get.copy(line = Some(line), startPosition = Some(start)))
   }
 
+  def setPySparkErrorContext(pysparkFragment: String, pysparkCallSite: 
String): Unit = {
+val tupleInfo = (pysparkFragment, pysparkCallSite)
+value.set(
+  value.get.copy(pysparkErrorContext = Some(tupleInfo)))
+  }
+

Review Comment:
   cc @HyukjinKwon @cloud-fan 
   
   Try reusing the existing `CurrentOrigin` here so we don't need to add 
separate ThreadLocal and also don't need to add helper functions for each 
existing methods.
   
   Some test cases are still need to be addressed, but I believe the overall 
structure is ready to review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565667394


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -143,6 +145,36 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
* Collation-aware regexp expressions.
*/
 
+  public static class StringSplit {
+public static UTF8String[] exec(final UTF8String l, final UTF8String r, 
final int limit,

Review Comment:
   let's be consistent with naming
   
   `string, regex, limit` instead of `l, r, limit`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565669167


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##
@@ -543,25 +544,29 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 case class StringSplit(str: Expression, regex: Expression, limit: Expression)
   extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = ArrayType(StringType, containsNull = false)
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, 
IntegerType)
+  override def dataType: DataType = ArrayType(str.dataType, containsNull = 
false)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeBinaryLcase, StringTypeAnyCollation, IntegerType)
   override def first: Expression = str
   override def second: Expression = regex
   override def third: Expression = limit
 
+  final lazy val collationId: Int = 
str.dataType.asInstanceOf[StringType].collationId
+
   def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1))
 
   override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
-val strings = string.asInstanceOf[UTF8String].split(
-  regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
+val strings = 
CollationSupport.StringSplit.exec(string.asInstanceOf[UTF8String],
+  regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int], collationId)
 new GenericArrayData(strings.asInstanceOf[Array[Any]])
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val arrayClass = classOf[GenericArrayData].getName
 nullSafeCodeGen(ctx, ev, (str, regex, limit) => {

Review Comment:
   for single function call (like here), use `defineCodeGen`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565671721


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -116,26 +116,37 @@ class CollationRegexpExpressionsSuite
 
   test("Support StringSplit string expression with collation") {
 // Supported collations
-case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R)
+case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R, limit: Int = -1)
 val testCases = Seq(
-  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),
+  StringSplitTestCase("", "", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("", "[a]", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("xAxBxaxbx", "[AB]", "UTF8_BINARY_LCASE", Seq("x", 
"x", "x", "x", "x")),
+  StringSplitTestCase("ABC", "", "UTF8_BINARY_LCASE", Seq("A", "B", "C")),
+  // test split with limit
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("ABC"), 1),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 2),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 3),
+  StringSplitTestCase("ABC", "[B]", "UNICODE", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UNICODE", Seq("ABC"))
 )
 testCases.foreach(t => {
-  val query = s"SELECT split(collate('${t.l}', '${t.c}'), 
collate('${t.r}', '${t.c}'))"
+  val query = s"SELECT split(collate('${t.l}', '${t.c}'), '${t.r}', 
${t.limit})"
   // Result & data type
   checkAnswer(sql(query), Row(t.result))
   
assert(sql(query).schema.fields.head.dataType.sameType(ArrayType(StringType(t.c
   // TODO: Implicit casting (not currently supported)

Review Comment:
   don't leave any TODOs
   
   if there is no casting to be done for this expression, note that in the 
comment here and explain why that's the case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-15 Thread via GitHub


itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2056680916

   > Let me give it a try and create a PR to refactoring the current structure, 
and ping you guys.
   
   Created https://github.com/apache/spark/pull/46063.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1565715402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala:
##
@@ -30,50 +32,74 @@ abstract class AbstractSqlParser extends AbstractParser 
with ParserInterface {
   override def astBuilder: AstBuilder
 
   /** Creates Expression for a given SQL string. */
-  override def parseExpression(sqlText: String): Expression = parse(sqlText) { 
parser =>
-val ctx = parser.singleExpression()
-withOrigin(ctx, Some(sqlText)) {
-  astBuilder.visitSingleExpression(ctx)
+  override def parseExpression(sqlText: String): Expression =
+parse(sqlText) { parser =>

Review Comment:
   shall we put the try-catch in `AbstractParser#parse` to avoid duplicated 
code here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1565716214


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala:
##
@@ -32,6 +32,22 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession with SQL
 intercept[ParseException](sql(sqlText).collect())
   }
 
+  test("PARSE_STACK_OVERFLOW_ERROR: Stack overflow hit") {

Review Comment:
   how many seconds does this test run? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1565717129


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1300,6 +1300,13 @@
 ],
 "sqlState" : "2203G"
   },
+  "FAILED_TO_PARSE_TOO_COMPLEX" : {
+"message" : [
+  "The statement, including potential SQL functions and referenced views,  
was too complex to parse.",

Review Comment:
   ```suggestion
 "The statement, including potential SQL functions and referenced 
views, was too complex to parse.",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-47819][CONNECT][3.5] Use asynchronous callback for execution cleanup [spark]

2024-04-15 Thread via GitHub


xi-db opened a new pull request, #46064:
URL: https://github.com/apache/spark/pull/46064

   ([Original PR](https://github.com/apache/spark/pull/46027))
   
   ### What changes were proposed in this pull request?
   
   Expired sessions are regularly checked and cleaned up by a maintenance 
thread. However, currently, this process is synchronous. Therefore, in rare 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.
   
   We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above scenarios. To be more specific, instead of 
calling `runner.join()` in `ExecutorHolder.close()`, we set a post-cleanup 
function as the callback through `runner.processOnCompletion`, which will be 
called asynchronously once the execution runner is completed or interrupted. In 
this way, the maintenance thread won't get blocked on joining an execution 
thread.
   
   ### Why are the changes needed?
   
   In the rare cases mentioned above, performance can be severely affected.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests and a new test `Async cleanup callback gets called after the 
execution is closed` in `SparkConnectServiceE2ESuite.scala`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1565727826


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -21,36 +21,57 @@ import scala.collection.mutable
 
 import org.apache.spark.SparkException
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, PlanHelper, 
Project}
+import org.apache.spark.sql.catalyst.planning.PhysicalAggregation
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, 
PlanHelper, Project}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
WITH_EXPRESSION}
 
 /**
  * Rewrites the `With` expressions by adding a `Project` to pre-evaluate the 
common expressions, or
  * just inline them if they are cheap.
  *
+ * Since this rule can introduce new `Project` operators, it is advised to run 
[[CollapseProject]]
+ * after this rule.
+ *
  * Note: For now we only use `With` in a few `RuntimeReplaceable` expressions. 
If we expand its
  *   usage, we should support aggregate/window functions as well.
  */
 object RewriteWithExpression extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = {
-
plan.transformDownWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {
+
plan.transformUpWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {
+  case p @ PhysicalAggregation(
+  groupingExpressions, aggregateExpressions, resultExpressions, child)
+  if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
+// For aggregates, separate computation of the aggregations themselves 
from the final
+// result by moving the final result computation into a projection 
above. This prevents
+// this rule from producing an invalid Aggregate operator.
+// TODO: the names of these aliases will become outdated after the 
rewrite
+val aggExprs = aggregateExpressions.map(ae => Alias(ae, 
ae.toString)(ae.resultId))
+// Rewrite the projection and the aggregate separately and then piece 
them together.
+val agg = Aggregate(groupingExpressions, groupingExpressions ++ 
aggExprs, child)
+val rewrittenAgg = applyInternal(agg)
+val proj = Project(resultExpressions, rewrittenAgg)
+applyInternal(proj)
   case p if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
-val inputPlans = p.children.toArray
-var newPlan: LogicalPlan = p.mapExpressions { expr =>
-  rewriteWithExprAndInputPlans(expr, inputPlans)
-}
-newPlan = newPlan.withNewChildren(inputPlans.toIndexedSeq)
-// Since we add extra Projects with extra columns to pre-evaluate the 
common expressions,
-// the current operator may have extra columns if it inherits the 
output columns from its
-// child, and we need to project away the extra columns to keep the 
plan schema unchanged.
-assert(p.output.length <= newPlan.output.length)
-if (p.output.length < newPlan.output.length) {
-  assert(p.outputSet.subsetOf(newPlan.outputSet))
-  Project(p.output, newPlan)
-} else {
-  newPlan
-}
+applyInternal(p)
+}
+  }
+
+  private def applyInternal(p: LogicalPlan): LogicalPlan = {
+val inputPlans = p.children.toArray
+var newPlan: LogicalPlan = p.mapExpressions { expr =>
+  rewriteWithExprAndInputPlans(expr, inputPlans)
+}
+newPlan = newPlan.withNewChildren(inputPlans)

Review Comment:
   the children do not change, why calling `withNewChildren`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1565732326


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala:
##
@@ -29,25 +29,29 @@ import org.apache.spark.sql.types.IntegerType
 class RewriteWithExpressionSuite extends PlanTest {
 
   object Optimizer extends RuleExecutor[LogicalPlan] {
-val batches = Batch("Rewrite With expression", Once, 
RewriteWithExpression) :: Nil
+val batches = Batch("Rewrite With expression", Once,
+  PullOutGroupingExpressions,
+  RewriteWithExpression) :: Nil
   }
 
   private val testRelation = LocalRelation($"a".int, $"b".int)
   private val testRelation2 = LocalRelation($"x".int, $"y".int)
 
   test("simple common expression") {
 val a = testRelation.output.head
-val commonExprDef = CommonExpressionDef(a)
-val ref = new CommonExpressionRef(commonExprDef)
-val plan = testRelation.select(With(ref + ref, 
Seq(commonExprDef)).as("col"))
+val expr = With.create((a, 0)) { case Seq(ref) =>

Review Comment:
   does the id matter for this test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1565739557


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -21,36 +21,57 @@ import scala.collection.mutable
 
 import org.apache.spark.SparkException
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, PlanHelper, 
Project}
+import org.apache.spark.sql.catalyst.planning.PhysicalAggregation
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, 
PlanHelper, Project}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
WITH_EXPRESSION}
 
 /**
  * Rewrites the `With` expressions by adding a `Project` to pre-evaluate the 
common expressions, or
  * just inline them if they are cheap.
  *
+ * Since this rule can introduce new `Project` operators, it is advised to run 
[[CollapseProject]]
+ * after this rule.
+ *
  * Note: For now we only use `With` in a few `RuntimeReplaceable` expressions. 
If we expand its
  *   usage, we should support aggregate/window functions as well.
  */
 object RewriteWithExpression extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = {
-
plan.transformDownWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {
+
plan.transformUpWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {
+  case p @ PhysicalAggregation(
+  groupingExpressions, aggregateExpressions, resultExpressions, child)
+  if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
+// For aggregates, separate computation of the aggregations themselves 
from the final
+// result by moving the final result computation into a projection 
above. This prevents
+// this rule from producing an invalid Aggregate operator.
+// TODO: the names of these aliases will become outdated after the 
rewrite
+val aggExprs = aggregateExpressions.map(ae => Alias(ae, 
ae.toString)(ae.resultId))

Review Comment:
   The alias name doesn't matter as it's only for internal bookkeeping. 
`.toString` can be super long if the aggregate function input is a complex 
expression. Shall we follow `PullOutGroupingExpressions` and use consistent 
naming? like `_aggregateExpression`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-15 Thread via GitHub


milastdbx commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1565726777


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala:
##
@@ -32,6 +32,22 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession with SQL
 intercept[ParseException](sql(sqlText).collect())
   }
 
+  test("PARSE_STACK_OVERFLOW_ERROR: Stack overflow hit") {

Review Comment:
   ![Uploading image.png…]()
   



##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala:
##
@@ -32,6 +32,22 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession with SQL
 intercept[ParseException](sql(sqlText).collect())
   }
 
+  test("PARSE_STACK_OVERFLOW_ERROR: Stack overflow hit") {

Review Comment:
   1sec 168ms



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala:
##
@@ -30,50 +32,74 @@ abstract class AbstractSqlParser extends AbstractParser 
with ParserInterface {
   override def astBuilder: AstBuilder
 
   /** Creates Expression for a given SQL string. */
-  override def parseExpression(sqlText: String): Expression = parse(sqlText) { 
parser =>
-val ctx = parser.singleExpression()
-withOrigin(ctx, Some(sqlText)) {
-  astBuilder.visitSingleExpression(ctx)
+  override def parseExpression(sqlText: String): Expression =
+parse(sqlText) { parser =>

Review Comment:
   I dont think we can, as we need context for proper error reporting but 
context can only be given by lambda executed by parse method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1565740751


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala:
##
@@ -229,4 +236,85 @@ class RewriteWithExpressionSuite extends PlanTest {
 .analyze
 )
   }
+
+  test("WITH expression in grouping exprs") {
+val a = testRelation.output.head
+val expr1 = With.create((a + 1, 0)) { case Seq(ref) =>
+  ref * ref
+}
+val expr2 = With.create((a + 1, 1)) { case Seq(ref) =>
+  ref * ref
+}
+val expr3 = With.create((a + 1, 2)) { case Seq(ref) =>
+  ref * ref
+}
+val plan = testRelation.groupBy(expr1)(
+  (expr2 + 2).as("col1"),
+  count(expr3 - 3).as("col2")
+)
+val commonExpr1Name = "_common_expr_0"
+// Note that _common_expr_1 gets deduplicated by 
PullOutGroupingExpressions.
+val commonExpr2Name = "_common_expr_2"
+val groupingExprName = "_groupingexpression"
+val countAlias = count(expr3 - 3).toString
+comparePlans(
+  Optimizer.execute(plan),
+  testRelation
+.select(testRelation.output :+ (a + 1).as(commonExpr1Name): _*)
+.select(testRelation.output :+
+  ($"$commonExpr1Name" * $"$commonExpr1Name").as(groupingExprName): _*)
+.select(testRelation.output ++ Seq($"$groupingExprName", (a + 
1).as(commonExpr2Name)): _*)
+.groupBy($"$groupingExprName")(
+  $"$groupingExprName",
+  count($"$commonExpr2Name" * $"$commonExpr2Name" - 3).as(countAlias)
+)
+.select(($"$groupingExprName" + 2).as("col1"), 
$"`$countAlias`".as("col2"))
+.analyze
+)
+// Running CollapseProject after the rule cleans up the unnecessary 
projections.
+comparePlans(
+  CollapseProject(Optimizer.execute(plan)),
+  testRelation
+.select(testRelation.output :+ (a + 1).as(commonExpr1Name): _*)
+.select(testRelation.output ++ Seq(
+  ($"$commonExpr1Name" * $"$commonExpr1Name").as(groupingExprName),
+  (a + 1).as(commonExpr2Name)): _*)
+.groupBy($"$groupingExprName")(
+  ($"$groupingExprName" + 2).as("col1"),
+  count($"$commonExpr2Name" * $"$commonExpr2Name" - 3).as("col2")
+)
+.analyze
+)
+  }
+
+  test("WITH expression in aggregate exprs") {

Review Comment:
   can we merge the two tests and have WITH in both grouping and aggregate 
expressions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1565741611


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala:
##
@@ -229,4 +236,85 @@ class RewriteWithExpressionSuite extends PlanTest {
 .analyze
 )
   }
+
+  test("WITH expression in grouping exprs") {
+val a = testRelation.output.head
+val expr1 = With.create((a + 1, 0)) { case Seq(ref) =>
+  ref * ref
+}
+val expr2 = With.create((a + 1, 1)) { case Seq(ref) =>
+  ref * ref
+}
+val expr3 = With.create((a + 1, 2)) { case Seq(ref) =>
+  ref * ref
+}
+val plan = testRelation.groupBy(expr1)(
+  (expr2 + 2).as("col1"),
+  count(expr3 - 3).as("col2")
+)
+val commonExpr1Name = "_common_expr_0"
+// Note that _common_expr_1 gets deduplicated by 
PullOutGroupingExpressions.
+val commonExpr2Name = "_common_expr_2"
+val groupingExprName = "_groupingexpression"
+val countAlias = count(expr3 - 3).toString
+comparePlans(
+  Optimizer.execute(plan),
+  testRelation
+.select(testRelation.output :+ (a + 1).as(commonExpr1Name): _*)
+.select(testRelation.output :+
+  ($"$commonExpr1Name" * $"$commonExpr1Name").as(groupingExprName): _*)
+.select(testRelation.output ++ Seq($"$groupingExprName", (a + 
1).as(commonExpr2Name)): _*)
+.groupBy($"$groupingExprName")(
+  $"$groupingExprName",
+  count($"$commonExpr2Name" * $"$commonExpr2Name" - 3).as(countAlias)
+)
+.select(($"$groupingExprName" + 2).as("col1"), 
$"`$countAlias`".as("col2"))
+.analyze
+)
+// Running CollapseProject after the rule cleans up the unnecessary 
projections.
+comparePlans(
+  CollapseProject(Optimizer.execute(plan)),
+  testRelation
+.select(testRelation.output :+ (a + 1).as(commonExpr1Name): _*)
+.select(testRelation.output ++ Seq(
+  ($"$commonExpr1Name" * $"$commonExpr1Name").as(groupingExprName),
+  (a + 1).as(commonExpr2Name)): _*)
+.groupBy($"$groupingExprName")(
+  ($"$groupingExprName" + 2).as("col1"),
+  count($"$commonExpr2Name" * $"$commonExpr2Name" - 3).as("col2")
+)
+.analyze
+)
+  }
+
+  test("WITH expression in aggregate exprs") {
+val Seq(a, b) = testRelation.output
+val expr1 = With.create((a + 1, 0)) { case Seq(ref) =>
+  ref * ref

Review Comment:
   can we test aggregate function as the common expression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1565750642


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala:
##
@@ -30,50 +32,74 @@ abstract class AbstractSqlParser extends AbstractParser 
with ParserInterface {
   override def astBuilder: AstBuilder
 
   /** Creates Expression for a given SQL string. */
-  override def parseExpression(sqlText: String): Expression = parse(sqlText) { 
parser =>
-val ctx = parser.singleExpression()
-withOrigin(ctx, Some(sqlText)) {
-  astBuilder.visitSingleExpression(ctx)
+  override def parseExpression(sqlText: String): Expression =
+parse(sqlText) { parser =>
+  val ctx = parser.singleExpression()
+  withOrigin[Expression](ctx, Some(sqlText)) {
+withErrorHandling(ctx) {
+  astBuilder.visitSingleExpression(ctx)
+}
+  }
 }
-  }
 
   /** Creates TableIdentifier for a given SQL string. */
-  override def parseTableIdentifier(sqlText: String): TableIdentifier = 
parse(sqlText) { parser =>
-astBuilder.visitSingleTableIdentifier(parser.singleTableIdentifier())

Review Comment:
   Looking at methods in this file, I think all of them should do 
`withOrigin(ctx, Some(sqlText))` but some are missing. Now it's a good chance 
to fix it, by doing it in the new `withErrorHandling` function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-15 Thread via GitHub


cloud-fan commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1565753990


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala:
##
@@ -30,50 +32,74 @@ abstract class AbstractSqlParser extends AbstractParser 
with ParserInterface {
   override def astBuilder: AstBuilder
 
   /** Creates Expression for a given SQL string. */
-  override def parseExpression(sqlText: String): Expression = parse(sqlText) { 
parser =>
-val ctx = parser.singleExpression()
-withOrigin(ctx, Some(sqlText)) {
-  astBuilder.visitSingleExpression(ctx)
+  override def parseExpression(sqlText: String): Expression =
+parse(sqlText) { parser =>
+  val ctx = parser.singleExpression()
+  withOrigin[Expression](ctx, Some(sqlText)) {
+withErrorHandling(ctx) {
+  astBuilder.visitSingleExpression(ctx)
+}
+  }
 }
-  }
 
   /** Creates TableIdentifier for a given SQL string. */
-  override def parseTableIdentifier(sqlText: String): TableIdentifier = 
parse(sqlText) { parser =>
-astBuilder.visitSingleTableIdentifier(parser.singleTableIdentifier())

Review Comment:
   oh nvm, some do not have parser context



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala:
##
@@ -30,50 +32,74 @@ abstract class AbstractSqlParser extends AbstractParser 
with ParserInterface {
   override def astBuilder: AstBuilder
 
   /** Creates Expression for a given SQL string. */
-  override def parseExpression(sqlText: String): Expression = parse(sqlText) { 
parser =>
-val ctx = parser.singleExpression()
-withOrigin(ctx, Some(sqlText)) {
-  astBuilder.visitSingleExpression(ctx)
+  override def parseExpression(sqlText: String): Expression =
+parse(sqlText) { parser =>
+  val ctx = parser.singleExpression()
+  withOrigin[Expression](ctx, Some(sqlText)) {
+withErrorHandling(ctx) {
+  astBuilder.visitSingleExpression(ctx)
+}
+  }
 }
-  }
 
   /** Creates TableIdentifier for a given SQL string. */
-  override def parseTableIdentifier(sqlText: String): TableIdentifier = 
parse(sqlText) { parser =>
-astBuilder.visitSingleTableIdentifier(parser.singleTableIdentifier())

Review Comment:
   oh nvm, some do not have parser context



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565766846


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java:
##
@@ -138,72 +138,72 @@ public static boolean execICU(final UTF8String l, final 
UTF8String r,
   }
 
   public static class FindInSet {
-public static int exec(final UTF8String l, final UTF8String r, final int 
collationId) {
+public static int exec(final UTF8String word, final UTF8String set, final 
int collationId) {
   CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
   if (collation.supportsBinaryEquality) {
-return execBinary(l, r);
+return execBinary(word, set);
   } else if (collation.supportsLowercaseEquality) {
-return execLowercase(l, r);
+return execLowercase(word, set);
   } else {
-return execICU(l, r, collationId);
+return execICU(word, set, collationId);
   }
 }
-public static String genCode(final String l, final String r, final int 
collationId) {
+public static String genCode(final String word, final String set, final 
int collationId) {
   CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
   String expr = "CollationSupport.FindInSet.exec";
   if (collation.supportsBinaryEquality) {
-return String.format(expr + "Binary(%s, %s)", l, r);
+return String.format(expr + "Binary(%s, %s)", word, set);
   } else if (collation.supportsLowercaseEquality) {
-return String.format(expr + "Lowercase(%s, %s)", l, r);
+return String.format(expr + "Lowercase(%s, %s)", word, set);
   } else {
-return String.format(expr + "ICU(%s, %s, %d)", l, r, collationId);
+return String.format(expr + "ICU(%s, %s, %d)", word, set, collationId);
   }
 }
-public static int execBinary(final UTF8String l, final UTF8String r) {
-  return r.findInSet(l);
+public static int execBinary(final UTF8String word, final UTF8String set) {
+  return set.findInSet(word);
 }
-public static int execLowercase(final UTF8String l, final UTF8String r) {
-  return r.toLowerCase().findInSet(l.toLowerCase());
+public static int execLowercase(final UTF8String word, final UTF8String 
set) {
+  return set.toLowerCase().findInSet(word.toLowerCase());
 }
-public static int execICU(final UTF8String l, final UTF8String r,
+public static int execICU(final UTF8String word, final UTF8String set,
   final int collationId) {
-  return CollationAwareUTF8String.findInSet(l, r, collationId);
+  return CollationAwareUTF8String.findInSet(word, set, collationId);
 }
   }
 
-  public static class IndexOf {
-public static int exec(final UTF8String l, final UTF8String r, final int 
start,
+  public static class StringInstr {
+public static int exec(final UTF8String string, final UTF8String substring,
 final int collationId) {
   CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
   if (collation.supportsBinaryEquality) {
-return execBinary(l, r, start);
+return execBinary(string, substring);
   } else if (collation.supportsLowercaseEquality) {
-return execLowercase(l, r, start);
+return execLowercase(string, substring);
   } else {
-return execICU(l, r, start, collationId);
+return execICU(string, substring, collationId);
   }
 }
-public static String genCode(final String l, final String r, final int 
start,
+public static String genCode(final String string, final String substring, 
final int start,
 final int collationId) {
   CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
-  String expr = "CollationSupport.IndexOf.exec";
+  String expr = "CollationSupport.StringInstr.exec";
   if (collation.supportsBinaryEquality) {
-return String.format(expr + "Binary(%s, %s, %d)", l, r, start);
+return String.format(expr + "Binary(%s, %s, %d)", string, substring, 
start);
   } else if (collation.supportsLowercaseEquality) {
-return String.format(expr + "Lowercase(%s, %s, %d)", l, r, start);
+return String.format(expr + "Lowercase(%s, %s, %d)", string, 
substring, start);
   } else {
-return String.format(expr + "ICU(%s, %s, %d, %d)", l, r, start, 
collationId);
+return String.format(expr + "ICU(%s, %s, %d, %d)", string, substring, 
start, collationId);
   }
 }
-public static int execBinary(final UTF8String l, final UTF8String r, final 
int start) {
-  return l.indexOf(r, start);
+public static int execBinary(final UTF8String string, final UTF8String 
substring) {
+  return string.indexOf(substring, 0);
 }
-public static int execLowercase(final UTF8String l, final 

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565770706


##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -249,6 +249,86 @@ public void testEndsWith() throws SparkException {
 assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
   }
 
+  private void assertStringInstr(String string, String substring, String 
collationName,

Review Comment:
   this is important: Vladimir noticed that we've been using assertEquals 
incorrectly in `CollationSupportSuite.java`
   
   please fix this according to his PR changes: 
https://github.com/apache/spark/pull/46058
   
   since all of the asserts pass here, we wouldn't notice this, but the correct 
order is: first **expected** value, and then **received** value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565771738


##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -249,6 +249,86 @@ public void testEndsWith() throws SparkException {
 assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
   }
 
+  private void assertStringInstr(String string, String substring, String 
collationName,
+  Integer value) throws SparkException {
+UTF8String str = UTF8String.fromString(string);
+UTF8String substr = UTF8String.fromString(substring);
+int collationId = CollationFactory.collationNameToId(collationName);
+
+assertEquals(CollationSupport.StringInstr.exec(str, substr, collationId) + 
1, value);

Review Comment:
   ```suggestion
   assertEquals(expected, CollationSupport.StringInstr.exec(str, substr, 
collationId) + 1);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-15 Thread via GitHub


uros-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1565772062


##
common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java:
##
@@ -249,6 +249,86 @@ public void testEndsWith() throws SparkException {
 assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
   }
 
+  private void assertStringInstr(String string, String substring, String 
collationName,
+  Integer value) throws SparkException {
+UTF8String str = UTF8String.fromString(string);
+UTF8String substr = UTF8String.fromString(substring);
+int collationId = CollationFactory.collationNameToId(collationName);
+
+assertEquals(CollationSupport.StringInstr.exec(str, substr, collationId) + 
1, value);
+  }
+  
+  @Test
+  public void testStringInstr() throws SparkException {
+assertStringInstr("aaads", "Aa", "UTF8_BINARY", 0);
+assertStringInstr("aaaDs", "de", "UTF8_BINARY", 0);
+assertStringInstr("aaads", "ds", "UTF8_BINARY", 4);
+assertStringInstr("", "", "UTF8_BINARY", 1);
+assertStringInstr("", "", "UTF8_BINARY", 0);
+assertStringInstr("test大千世界X大千世界", "大千", "UTF8_BINARY", 5);
+assertStringInstr("test大千世界X大千世界", "界X", "UTF8_BINARY", 8);
+assertStringInstr("aaads", "Aa", "UTF8_BINARY_LCASE", 1);
+assertStringInstr("aaaDs", "de", "UTF8_BINARY_LCASE", 0);
+assertStringInstr("aaaDs", "ds", "UTF8_BINARY_LCASE", 4);
+assertStringInstr("", "", "UTF8_BINARY_LCASE", 1);
+assertStringInstr("", "", "UTF8_BINARY_LCASE", 0);
+assertStringInstr("test大千世界X大千世界", "大千", "UTF8_BINARY_LCASE", 5);
+assertStringInstr("test大千世界X大千世界", "界x", "UTF8_BINARY_LCASE", 8);
+assertStringInstr("aaads", "Aa", "UNICODE", 0);
+assertStringInstr("aaads", "aa", "UNICODE", 1);
+assertStringInstr("aaads", "de", "UNICODE", 0);
+assertStringInstr("", "", "UNICODE", 1);
+assertStringInstr("", "", "UNICODE", 0);
+assertStringInstr("test大千世界X大千世界", "界x", "UNICODE", 0);
+assertStringInstr("test大千世界X大千世界", "界X", "UNICODE", 8);
+assertStringInstr("aaads", "AD", "UNICODE_CI", 3);
+assertStringInstr("aaads", "dS", "UNICODE_CI", 4);
+assertStringInstr("test大千世界X大千世界", "界x", "UNICODE_CI", 8);
+  }
+
+  //word: String, set: String, collationId: Integer, expected: Integer
+  private void assertFindInSet(String word, String set, String collationName,
+ Integer value) throws SparkException {
+UTF8String w = UTF8String.fromString(word);
+UTF8String s = UTF8String.fromString(set);
+int collationId = CollationFactory.collationNameToId(collationName);
+
+assertEquals(CollationSupport.FindInSet.exec(w, s, collationId), value);

Review Comment:
   ```suggestion
   assertEquals(expected, CollationSupport.FindInSet.exec(w, s, 
collationId));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests [spark]

2024-04-15 Thread via GitHub


vicennial commented on code in PR #46012:
URL: https://github.com/apache/spark/pull/46012#discussion_r1565802827


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala:
##
@@ -381,6 +405,53 @@ case class SessionHolder(userId: String, sessionId: 
String, session: SparkSessio
*/
   private[connect] val pythonAccumulator: Option[PythonAccumulator] =
 Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption
+
+  /**
+   * Transform a relation into a logical plan, using the plan cache if enabled.
+   * The plan cache is enable only if 
`spark.connect.session.planCache.maxSize` is greater than zero
+   * AND `spark.connect.session.planCache.enabled` is true.
+   * @param rel The relation to transform.
+   * @param cachePlan Whether to cache the result logical plan.
+   * @param transform Function to transform the relation into a logical plan.
+   * @return The logical plan.
+   */
+  private[connect] def usePlanCache(rel: proto.Relation, cachePlan: Boolean)(

Review Comment:
   > 1, LOCAL_RELATION, its size might be large;
   > 2, CACHED_REMOTE_RELATION, it is already cached in dataFrameCache in the 
SessionHolder;
   
   We use a low default value for `CONNECT_SESSION_PLAN_CACHE_SIZE`, `5` 
entries to be specific, to prevent the cache from becoming too large. We 
haven't been too surgical in what kind of plans we cache to keep it as a simple 
wrapper in aiding common scenarios that may generate some repeated 
analysis/execute calls. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-15 Thread via GitHub


nikolamand-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1565804085


##
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala:
##
@@ -116,26 +116,37 @@ class CollationRegexpExpressionsSuite
 
   test("Support StringSplit string expression with collation") {
 // Supported collations
-case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R)
+case class StringSplitTestCase[R](l: String, r: String, c: String, result: 
R, limit: Int = -1)
 val testCases = Seq(
-  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C"))
+  StringSplitTestCase("ABC", "[B]", "UTF8_BINARY", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY", Seq("ABC")),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")),
+  StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"")),
+  StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")),
+  StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", 
"", "")),
+  StringSplitTestCase("", "", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("", "[a]", "UTF8_BINARY_LCASE", Seq("")),
+  StringSplitTestCase("xAxBxaxbx", "[AB]", "UTF8_BINARY_LCASE", Seq("x", 
"x", "x", "x", "x")),
+  StringSplitTestCase("ABC", "", "UTF8_BINARY_LCASE", Seq("A", "B", "C")),
+  // test split with limit
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("ABC"), 1),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 2),
+  StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C"), 3),
+  StringSplitTestCase("ABC", "[B]", "UNICODE", Seq("A", "C")),
+  StringSplitTestCase("ABC", "[b]", "UNICODE", Seq("ABC"))
 )
 testCases.foreach(t => {
-  val query = s"SELECT split(collate('${t.l}', '${t.c}'), 
collate('${t.r}', '${t.c}'))"
+  val query = s"SELECT split(collate('${t.l}', '${t.c}'), '${t.r}', 
${t.limit})"
   // Result & data type
   checkAnswer(sql(query), Row(t.result))
   
assert(sql(query).schema.fields.head.dataType.sameType(ArrayType(StringType(t.c
   // TODO: Implicit casting (not currently supported)

Review Comment:
   Removed both TODOs since string split doesn't have any custom collation cast 
logic as regex parameter's collation is irrelevant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >