date:20240408

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

2024-04-08 Thread via GitHub



leixm commented on PR #45765:
URL: https://github.com/apache/spark/pull/45765#issuecomment-2042017037

   @LuciferYang  @MaxGekk  PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47680][SQL] Add variant_explode expression. [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on code in PR #45805:
URL: https://github.com/apache/spark/pull/45805#discussion_r1555333257


##
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##
@@ -298,4 +300,28 @@ class VariantSuite extends QueryTest with 
SharedSparkSession {
 }
 assert(ex.getErrorClass == "DATATYPE_MISMATCH.INVALID_ORDERING_TYPE")
   }
+
+  test("variant_explode") {
+def check(input: String, expected: Seq[Row]): Unit = {
+  withView("v") {
+Seq(input).toDF("json").createOrReplaceTempView("v")
+checkAnswer(sql("select pos, key, to_json(value) from v, " +
+  "lateral variant_explode(parse_json(json))"), expected)
+val expectedOuter = if (expected.isEmpty) Seq(Row(null, null, null)) 
else expected
+checkAnswer(sql("select pos, key, to_json(value) from v, " +
+  "lateral variant_explode_outer(parse_json(json))"), expectedOuter)
+  }
+}
+
+Seq("true", "false").foreach { codegenEnabled =>
+  withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> codegenEnabled) {
+check(null, Nil)
+check("1", Nil)

Review Comment:
   @srielau shall we fail if the input variant is not array or struct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-08 Thread via GitHub



itholic commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1555359203


##
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -699,6 +699,13 @@ class Column(val expr: Expression) extends Logging {
*/
   def plus(other: Any): Column = this + other
 
+  def plusWithPySparkLoggingInfo(
+  other: Any, loggingInfo: java.util.Map[String, String]): Column = {
+withOrigin(Some(loggingInfo)) {
+  this + other
+}
+  }

Review Comment:
   Hi, @cloud-fan I added these kind of new `Column` expression methods for 
taking Python call site information properly along with extending the 
implementation of the existing `Origin` and `WithOrigin`.
   
   Thanks for your suggestion, now the call site correctness problem seems to 
be resolved, but I'm wondering if these `{name}PySparkLoggingInfo` methods 
should be added to all each `Coulmn` APIs.
   
   As this design can be changed after the review, I added for only four 
functions - `plus`, `divide`, `minus`, and `multiply` - as a prototype for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



itholic commented on code in PR #45923:
URL: https://github.com/apache/spark/pull/45923#discussion_r1555367380


##
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:
##
@@ -231,14 +232,15 @@ private[hive] object SparkSQLCLIDriver extends Logging {
 val historyFile = historyDirectory + File.separator + ".hivehistory"
 reader.setHistory(new FileHistory(new File(historyFile)))
   } else {
-logWarning("WARNING: Directory for Hive history file: " + 
historyDirectory +
-   " does not exist.   History will not be available 
during this session.")
+logWarning(
+  log"WARNING: Directory for Hive history file: ${MDC(HISTORY_DIR, 
historyDirectory)}" +
+log" does not exist.   History will not be available during this 
session.")

Review Comment:
   Sure. I was just trying to keep the existing error message as it is, but it 
would be better to have only keep one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

2024-04-08 Thread via GitHub



ahshahid commented on PR #43854:
URL: https://github.com/apache/spark/pull/43854#issuecomment-2042082750

   Caching issue is fixed in this PR.
   That was the complex part.
   It will not miss any cache.
   I have described the approach in PR description.
   And as I mentioned it makes cache lookup code much robust as described in
   other bug I filed.
   
   
   On Mon, Apr 8, 2024, 12:22 AM Wenchen Fan ***@***.***> wrote:
   
   > This is a well-known issue. The suggested fix is to ask users to not chain
   > transformations too much, and use "batch" like APIs such as
   > Dataset#withColumns.
   >
   > How does this PR fix the issue without the problem mentioned in 23d9822
   > 

   > ?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you authored the thread.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

2024-04-08 Thread via GitHub



ahshahid commented on PR #43854:
URL: https://github.com/apache/spark/pull/43854#issuecomment-2042090754

   I understand that suggestion is to not use api to add single column.. but I
   have come across many companies which generate dataframes via some loop
   logic. In my previous works I have seen query plans containing 40million
   plus project nodes only ( not counting filters joins windows etc).
   
   There are other customers who are now seeing query compilation times
   increased from 3 mins to 2 plus hours, due to de dup relation rule or plan
   cloning at every stage.
   
   On Mon, Apr 8, 2024, 12:48 AM Asif Shahid ***@***.***> wrote:
   
   > Caching issue is fixed in this PR.
   > That was the complex part.
   > It will not miss any cache.
   > I have described the approach in PR description.
   > And as I mentioned it makes cache lookup code much robust as described in
   > other bug I filed.
   >
   >
   > On Mon, Apr 8, 2024, 12:22 AM Wenchen Fan ***@***.***>
   > wrote:
   >
   >> This is a well-known issue. The suggested fix is to ask users to not
   >> chain transformations too much, and use "batch" like APIs such as
   >> Dataset#withColumns.
   >>
   >> How does this PR fix the issue without the problem mentioned in 23d9822
   >> 

   >> ?
   >>
   >> —
   >> Reply to this email directly, view it on GitHub
   >> , or
   >> unsubscribe
   >> 

   >> .
   >> You are receiving this because you authored the thread.Message ID:
   >> ***@***.***>
   >>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-08 Thread via GitHub



dbatomic commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1555378573


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -549,6 +549,51 @@ public int findInSet(UTF8String match) {
 return 0;
   }
 
+  public int findInSet(UTF8String match, int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return this.findInSet(match);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {

Review Comment:
   We don't have to expect more special collations besides `UTF8_BINARY_LCASE`. 
Everything else will be based on ICU.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47762][PYTHON][CONNECT] Add pyspark.sql.connect.protobuf into setup.py [spark]

2024-04-08 Thread via GitHub



HyukjinKwon commented on PR #45924:
URL: https://github.com/apache/spark/pull/45924#issuecomment-2042102698

   cc @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-41811][PYTHON][CONNECT] Implement `SQLStringFormatter` with `WithRelations` [spark]

2024-04-08 Thread via GitHub



HyukjinKwon commented on PR #45614:
URL: https://github.com/apache/spark/pull/45614#issuecomment-2042103298

   Let's just go ahead as all comments are addresed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



itholic commented on code in PR #45923:
URL: https://github.com/apache/spark/pull/45923#discussion_r1555389485


##
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala:
##
@@ -218,7 +232,9 @@ private[thriftserver] class HiveThriftServer2Listener(
 executionData.state = ExecutionState.CLOSED
 updateStoreWithTriggerEnabled(executionData)
 executionList.remove(e.id)
-  case None => logWarning(s"onOperationClosed called with unknown 
operation id: ${e.id}")
+  case None => logWarning(
+log"onOperationClosed called with unknown operation id: 
${MDC(OPERATION_ID, e.id)}"

Review Comment:
   Maybe can we keep the current behavior as it is?
   
   I'm a bit afraid that it will even become more complicated because most of 
them are slightly different except for the “called with unknown” part.
   
   For example, we might need at least 3 parameters for the common method would 
look like:
   
   ```scala
   private def logWarningWithUnknownId(eventType: String, id: String, 
additionalMessage: String = ""): Unit = {
 val idType = eventType match {
   case "operation" => OPERATION_ID
   case "session" => SESSION_ID
 }
 logWarning(log"$eventType called with unknown $idType: ${MDC(idType, 
id)}." + additionalMessage)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-08 Thread via GitHub



dbatomic commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1555389826


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -549,6 +549,51 @@ public int findInSet(UTF8String match) {
 return 0;
   }
 
+  public int findInSet(UTF8String match, int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return this.findInSet(match);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {

Review Comment:
   That being said, I also don't like current direction of pushing everything 
into `UTF8String`.  Let me see if we can come up with some cleaner approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47762][PYTHON][CONNECT] Add pyspark.sql.connect.protobuf into setup.py [spark]

2024-04-08 Thread via GitHub



HyukjinKwon commented on PR #45924:
URL: https://github.com/apache/spark/pull/45924#issuecomment-2042116207

   Merged to master, and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-08 Thread via GitHub



miland-db commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1555394471


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -549,6 +549,51 @@ public int findInSet(UTF8String match) {
 return 0;
   }
 
+  public int findInSet(UTF8String match, int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return this.findInSet(match);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {

Review Comment:
   We can solve this in the same way we did i t here: 
[STRING_LOCATE](https://github.com/apache/spark/pull/45791). In the next PRs we 
updated `CollationFactory` to be able to return "lowercase collator" so we can 
unify the way we deal with collated strings. If you prefer, I can update this 
PR with that change.



##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -549,6 +549,51 @@ public int findInSet(UTF8String match) {
 return 0;
   }
 
+  public int findInSet(UTF8String match, int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return this.findInSet(match);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {

Review Comment:
   We can solve this in the same way we did it here: 
[STRING_LOCATE](https://github.com/apache/spark/pull/45791). In the next PRs we 
updated `CollationFactory` to be able to return "lowercase collator" so we can 
unify the way we deal with collated strings. If you prefer, I can update this 
PR with that change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47762][PYTHON][CONNECT] Add pyspark.sql.connect.protobuf into setup.py [spark]

2024-04-08 Thread via GitHub



HyukjinKwon closed pull request #45924: [SPARK-47762][PYTHON][CONNECT] Add 
pyspark.sql.connect.protobuf into setup.py
URL: https://github.com/apache/spark/pull/45924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47761][SQL] Oracle: Support reading AnsiIntervalTypes [spark]

2024-04-08 Thread via GitHub



yaooqinn opened a new pull request, #45925:
URL: https://github.com/apache/spark/pull/45925

   
   
   ### What changes were proposed in this pull request?
   
   
   In this PR, I proposed to add support for reading well-defined ANSI interval 
types from Oracle databases.
   
   - NTERVAL YEAR [(year_precision)] TO MONTH
 - Stores a period of time in years and months, where year_precision is the 
number of digits in the YEAR datetime field. Accepted values are 0 to 9. The 
default is 2. The size is fixed at 5 bytes.
   - INTERVAL DAY [(day_precision)] TO SECOND [(fractional_seconds_precision)]
 - Stores a period of time in days, hours, minutes, and seconds, where
   - day_precision is the maximum number of digits in the DAY datetime 
field. Accepted values are 0 to 9. The default is 2.
   - fractional_seconds_precision is the number of digits in the fractional 
part of the SECOND field. Accepted values are 0 to 9. The default is 6. The 
size is fixed at 11 bytes.
   
   Both of them are mapped to the defaults of AnsiIntervalTypes.
   
   We also add two developer APIs that convert interval strings to underlying 
representations of YearMonthIntervalType and DaytimeIntervalType. The default 
implementations assume that the inputs are compatible with the ANSI style, 
incompatible values will fail to be parsed. However, it shall fail much earlier 
at the data type mapping step because of undefined mapping rules
   
   ### Why are the changes needed?
   
   
   Improve the Oracle accessibility, as tt's safe to read Oracle external 
intervals 
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, reading Oracle intervals are available now
   ### How was this patch tested?
   new tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47748][BUILD] Upgrade `zstd-jni` to 1.5.6-2 [spark]

2024-04-08 Thread via GitHub



yaooqinn closed pull request #45909: [SPARK-47748][BUILD] Upgrade `zstd-jni` to 
1.5.6-2
URL: https://github.com/apache/spark/pull/45909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47748][BUILD] Upgrade `zstd-jni` to 1.5.6-2 [spark]

2024-04-08 Thread via GitHub



yaooqinn commented on PR #45909:
URL: https://github.com/apache/spark/pull/45909#issuecomment-2042137351

   Merged to master. thank you @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



dbatomic commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555413439


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);

Review Comment:
   Why would we ever do this against UTF8_BINARY? For UTF8_BINARY we should 
just stay on binary level and avoid string copy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



miland-db commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555419316


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);

Review Comment:
   Following 
[comment1](https://github.com/apache/spark/pull/45791#discussion_r1549312876) 
and 
[comment2](https://github.com/apache/spark/pull/45791#discussion_r1551446372) 
we decided to do this just in case because this is public method and all 
collationId values should be handled. We never call this method with 
`collationId == 0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



dbatomic commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555419431


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);
+} else if (collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+  return getStringSearch(targetUTF8String.toLowerCase(), 
patternUTF8String.toLowerCase());
+}
+
+String pattern = patternUTF8String.toString();

Review Comment:
   General principle is that we should minimize heap allocations in string 
compare path.
   In this snipped we are doing:
   1) 2 allocations for UTF8_BINARY (should be 0)
   2) For UTF8_BINARY_LCASE it get's kind of crazy. If this string has 
characters in non-ascii range we first push it to String, then back to 
UTF8String, then back to String :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



dbatomic commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555425477


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);
+} else if (collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+  return getStringSearch(targetUTF8String.toLowerCase(), 
patternUTF8String.toLowerCase());
+}
+
+String pattern = patternUTF8String.toString();

Review Comment:
   It should be fine to do all the routing at compile time (e.g. in 
analysis/code gen phase). But in execution phase we should be aware of perf 
consequences.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47591][SQL] Hive-thriftserver: Migrate logInfo with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



itholic opened a new pull request, #45926:
URL: https://github.com/apache/spark/pull/45926

   ### What changes were proposed in this pull request?
   
   This PR proposes to migrate `logInfo` with variables of Hive-thriftserver 
module to structured logging framework.
   
   
   ### Why are the changes needed?
   
   To improve the existing logging system by migrating into structured logging.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but the SQL catalyst logs will contain MDC(Mapped Diagnostic 
Context) from now.
   
   
   ### How was this patch tested?
   
   Run Scala auto formatting and style check. Also the existing CI should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47587][SQL] Hive module: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



panbingkun opened a new pull request, #45927:
URL: https://github.com/apache/spark/pull/45927

   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47746] Implement ordinal-based range encoding in the RocksDBStateEncoder [spark]

2024-04-08 Thread via GitHub



HeartSaVioR commented on PR #45905:
URL: https://github.com/apache/spark/pull/45905#issuecomment-2042183035

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555443148


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);

Review Comment:
   Then can we just add an assert? This is not public and only used by Spark 
internally.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



panbingkun commented on code in PR #45923:
URL: https://github.com/apache/spark/pull/45923#discussion_r1555443549


##
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala:
##
@@ -218,7 +232,9 @@ private[thriftserver] class HiveThriftServer2Listener(
 executionData.state = ExecutionState.CLOSED
 updateStoreWithTriggerEnabled(executionData)
 executionList.remove(e.id)
-  case None => logWarning(s"onOperationClosed called with unknown 
operation id: ${e.id}")
+  case None => logWarning(
+log"onOperationClosed called with unknown operation id: 
${MDC(OPERATION_ID, e.id)}"

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1555443962


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -549,6 +549,51 @@ public int findInSet(UTF8String match) {
 return 0;
   }
 
+  public int findInSet(UTF8String match, int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return this.findInSet(match);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {

Review Comment:
   "lowercase collator" SGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47746] Implement ordinal-based range encoding in the RocksDBStateEncoder [spark]

2024-04-08 Thread via GitHub



HeartSaVioR closed pull request #45905: [SPARK-47746] Implement ordinal-based 
range encoding in the RocksDBStateEncoder
URL: https://github.com/apache/spark/pull/45905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



miland-db commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555445740


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);
+} else if (collationId == UTF8_BINARY_LCASE_COLLATION_ID) {
+  return getStringSearch(targetUTF8String.toLowerCase(), 
patternUTF8String.toLowerCase());
+}
+
+String pattern = patternUTF8String.toString();

Review Comment:
   1. _2 allocations for UTF8_BINARY (should be 0)_ - this should be never be 
called for UTF8_BINARY
   2. For this one, I understand the consequences, but it's very similar to 
what we had to do in the `UTF8String` to be able to successfully work with 
UTF8_BINARY_LCASE collation. This makes the code a lot cleaner than it was 
before when we had separate methods with _almost identical_ code for 
UTF8_BINARY_LCASE and other collations.
   
   If this is not good/performant enough, we should think of a some other way 
to solve it because more and more PRs are coming with this change
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][TESTS] Make `check_invalid_args` reusable in parity test [spark]

2024-04-08 Thread via GitHub



zhengruifeng opened a new pull request, #45928:
URL: https://github.com/apache/spark/pull/45928

   ### What changes were proposed in this pull request?
   Make `check_invalid_args` reusable in parity test
   
   
   ### Why are the changes needed?
   test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test only
   
   
   ### How was this patch tested?
   ci
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][TESTS] Make `check_invalid_args` reusable in parity test [spark]

2024-04-08 Thread via GitHub



zhengruifeng commented on code in PR #45928:
URL: https://github.com/apache/spark/pull/45928#discussion_r1555451339


##
python/pyspark/sql/tests/pandas/test_pandas_udf_window.py:
##
@@ -292,7 +292,7 @@ def check_invalid_args(self):
 
 with self.assertRaisesRegex(AnalysisException, ".*not supported within 
a window function"):
 foo_udf = pandas_udf(lambda x: x, "v double", 
PandasUDFType.GROUPED_MAP)
-df.withColumn("v2", foo_udf(df["v"]).over(w))
+df.withColumn("v2", foo_udf(df["v"]).over(w)).schema

Review Comment:
   `withColumn` is lazy in Spark Connect, add `schema` to trigger resolution



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on PR #43854:
URL: https://github.com/apache/spark/pull/43854#issuecomment-2042194892

   Oh, the idea to make cache lookup smarter looks promising. Shall we create 
an individual PR for it? It's useful for SQL queries as well, as we can hit the 
cache if one SELECT query only has a few more columns than another cached 
SELECT query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

2024-04-08 Thread via GitHub



ahshahid commented on PR #43854:
URL: https://github.com/apache/spark/pull/43854#issuecomment-2042199910

   Sure..but I cannot split cache lookup code from this PR as otherwise tests
   will fail right left.
   The robustness of cache lookup is just a side effect of this fix.
   Regards
   Asif
   
   On Mon, Apr 8, 2024, 1:45 AM Wenchen Fan ***@***.***> wrote:
   
   > Oh, the idea to make cache lookup smarter looks promising. Shall we create
   > an individual PR for it? It's useful for SQL queries as well, as we can hit
   > the cache if one SELECT query only has a few more columns than another
   > cached SELECT query.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you authored the thread.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

2024-04-08 Thread via GitHub



ahshahid commented on PR #43854:
URL: https://github.com/apache/spark/pull/43854#issuecomment-2042209960

   Oh I see.. you mean first use cache lookup pr then this pr goes in
   
   On Mon, Apr 8, 2024, 1:47 AM Asif Shahid ***@***.***> wrote:
   
   > Sure..but I cannot split cache lookup code from this PR as otherwise tests
   > will fail right left.
   > The robustness of cache lookup is just a side effect of this fix.
   > Regards
   > Asif
   >
   > On Mon, Apr 8, 2024, 1:45 AM Wenchen Fan ***@***.***> wrote:
   >
   >> Oh, the idea to make cache lookup smarter looks promising. Shall we
   >> create an individual PR for it? It's useful for SQL queries as well, as we
   >> can hit the cache if one SELECT query only has a few more columns than
   >> another cached SELECT query.
   >>
   >> —
   >> Reply to this email directly, view it on GitHub
   >> , or
   >> unsubscribe
   >> 

   >> .
   >> You are receiving this because you authored the thread.Message ID:
   >> ***@***.***>
   >>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on code in PR #45377:
URL: https://github.com/apache/spark/pull/45377#discussion_r1555465767


##
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##
@@ -699,6 +699,13 @@ class Column(val expr: Expression) extends Logging {
*/
   def plus(other: Any): Column = this + other
 
+  def plusWithPySparkLoggingInfo(
+  other: Any, loggingInfo: java.util.Map[String, String]): Column = {
+withOrigin(Some(loggingInfo)) {
+  this + other
+}
+  }

Review Comment:
   This is too much work... Ideally all column creation APIs call `Column.fn` 
at the end, if we can change pyspark code to only call the Scala `Column.fn`, 
then we don't need to change the Scala side too much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47567][SQL] Support LOCATE function to work with collated strings [spark]

2024-04-08 Thread via GitHub



miland-db commented on code in PR #45791:
URL: https://github.com/apache/spark/pull/45791#discussion_r1555471879


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -176,15 +176,31 @@ public Collation(
*/
 
   public static StringSearch getStringSearch(
-  final UTF8String left,
-  final UTF8String right,
+  final UTF8String targetUTF8String,
+  final UTF8String patternUTF8String,
   final int collationId) {
-String pattern = right.toString();
-CharacterIterator target = new StringCharacterIterator(left.toString());
+
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return getStringSearch(targetUTF8String, patternUTF8String);

Review Comment:
   I think assert would do the job for UTF8_BINARY. That way we know not to 
call this method for UTF8_BINARY collation, and there will be no 
unnecessary/confusing code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1555471908


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
   return nullRow
 }
 
+val fieldNames = if (constantFields == fieldExpressions.length) {
+  // typically the user will provide the field names as foldable 
expressions
+  // so we can use the cached copy
+  foldableFieldNames.map(_.orNull)
+} else if (constantFields == 0) {
+  // none are foldable so all field names need to be evaluated from the 
input row
+  fieldExpressions.zipWithIndex.map {
+case (expr, index) =>
+  Option(expr.eval(input)).map {
+path =>
+  JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+  }.orNull
+  }
+} else {
+  // if there is a mix of constant and non-constant expressions
+  // prefer the cached copy when available
+  foldableFieldNames.zip(fieldExpressions).zipWithIndex.map {
+case ((null, expr), index) =>
+  Option(expr.eval(input)).map {
+path =>
+  JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+  }.orNull
+case ((fieldName, _), _) => fieldName.orNull
+  }
+}
+
+val evaluator = new JsonTupleEvaluator(
+  json,
+  fieldNames
+.filter{jsonPathIndex => jsonPathIndex != null && jsonPathIndex.path 
!= null }
+.asJava,
+  fieldExpressions.length)
+evaluator.evaluate()
+  }
+
+  override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): JsonTuple =
+copy(children = newChildren)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {

Review Comment:
   Haven't looked into it yet, but is it possible to make codegen simpler and 
write most of the code in Scala?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47411][SQL] Support INSTR & FIND_IN_SET functions to work with collated strings [spark]

2024-04-08 Thread via GitHub



dbatomic commented on code in PR #45643:
URL: https://github.com/apache/spark/pull/45643#discussion_r1555475151


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -549,6 +549,51 @@ public int findInSet(UTF8String match) {
 return 0;
   }
 
+  public int findInSet(UTF8String match, int collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return this.findInSet(match);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {

Review Comment:
   @cloud-fan - Let's put on hold string expression development until we 
provide proper design. @uros-db, @miland-db and I will follow up on this today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47639] Support codegen for json_tuple. [spark]

2024-04-08 Thread via GitHub



leixm commented on code in PR #45765:
URL: https://github.com/apache/spark/pull/45765#discussion_r1555489388


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
   return nullRow
 }
 
+val fieldNames = if (constantFields == fieldExpressions.length) {
+  // typically the user will provide the field names as foldable 
expressions
+  // so we can use the cached copy
+  foldableFieldNames.map(_.orNull)
+} else if (constantFields == 0) {
+  // none are foldable so all field names need to be evaluated from the 
input row
+  fieldExpressions.zipWithIndex.map {
+case (expr, index) =>
+  Option(expr.eval(input)).map {
+path =>
+  JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+  }.orNull
+  }
+} else {
+  // if there is a mix of constant and non-constant expressions
+  // prefer the cached copy when available
+  foldableFieldNames.zip(fieldExpressions).zipWithIndex.map {
+case ((null, expr), index) =>
+  Option(expr.eval(input)).map {
+path =>
+  JsonPathIndex(index, path.asInstanceOf[UTF8String].toString)
+  }.orNull
+case ((fieldName, _), _) => fieldName.orNull
+  }
+}
+
+val evaluator = new JsonTupleEvaluator(
+  json,
+  fieldNames
+.filter{jsonPathIndex => jsonPathIndex != null && jsonPathIndex.path 
!= null }
+.asJava,
+  fieldExpressions.length)
+evaluator.evaluate()
+  }
+
+  override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): JsonTuple =
+copy(children = newChildren)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {

Review Comment:
   Because we have to consider calculating the foldable expr in advance, which 
is the reason why doGenCode is bloated. I have tried to simplify the codegen 
code as much as possible. Do you have any good suggestions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47692][SQL] Addition of priority flag to StringType [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on code in PR #45819:
URL: https://github.com/apache/spark/pull/45819#discussion_r142892


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -645,6 +646,34 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
 },
 errorClass = "COLLATION_MISMATCH.IMPLICIT"
   )
+
+  // check if substring passes through implicit collation
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SELECT substr('a' COLLATE UNICODE, 0, 1) == substr('b' COLLATE 
UNICODE_CI, 0, 1)")
+},
+errorClass = "COLLATION_MISMATCH.IMPLICIT"
+  )
+
+  checkAnswer(spark.sql("SELECT collation(:var1 || :var2)",
+Map(
+  "var1" -> Literal.create("a", StringType(1)),
+  "var2" -> Literal.create("b", StringType(2))
+  )
+  ),
+Seq(Row("UTF8_BINARY"))
+  )
+
+  withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+checkAnswer(spark.sql("SELECT collation(:var1 || :var2)",
+  Map(
+"var1" -> Literal.create("a", StringType(1)),
+"var2" -> Literal.create("b", StringType(2))
+  )
+),
+  Seq(Row("UNICODE"))
+)
+  }

Review Comment:
   @srielau Is this the expected behaviour? Apparently we can pass different 
collations to parameters. I understood that the behaviour should be if 
StringType has priority of default then it has to have session level default 
collation, as otherwise we might have different collations with same default 
priority, which is not covered by design.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47761][SQL] Oracle: Support reading AnsiIntervalTypes [spark]

2024-04-08 Thread via GitHub



yaooqinn commented on PR #45925:
URL: https://github.com/apache/spark/pull/45925#issuecomment-2042341631

   cc @cloud-fan @dongjoon-hyun, thank you .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add support for AbstractArrayType(StringTypeCollated) [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on PR #45891:
URL: https://github.com/apache/spark/pull/45891#issuecomment-2042405846

   @cloud-fan Do you think adding AbstractArrayType could be a good idea as a 
general and not only for StringTypeCollated? Otherwise this PR should be ready 
for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47692][SQL] Addition of priority flag to StringType [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on code in PR #45819:
URL: https://github.com/apache/spark/pull/45819#discussion_r1555608196


##
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##
@@ -645,6 +646,34 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
 },
 errorClass = "COLLATION_MISMATCH.IMPLICIT"
   )
+
+  // check if substring passes through implicit collation
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SELECT substr('a' COLLATE UNICODE, 0, 1) == substr('b' COLLATE 
UNICODE_CI, 0, 1)")
+},
+errorClass = "COLLATION_MISMATCH.IMPLICIT"
+  )
+
+  checkAnswer(spark.sql("SELECT collation(:var1 || :var2)",
+Map(
+  "var1" -> Literal.create("a", StringType(1)),
+  "var2" -> Literal.create("b", StringType(2))
+  )
+  ),
+Seq(Row("UTF8_BINARY"))
+  )
+
+  withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
+checkAnswer(spark.sql("SELECT collation(:var1 || :var2)",
+  Map(
+"var1" -> Literal.create("a", StringType(1)),
+"var2" -> Literal.create("b", StringType(2))
+  )
+),
+  Seq(Row("UNICODE"))
+)
+  }

Review Comment:
   Why I ask this is because by implementation we are using Parameter which 
binds an Expression to a name/postition depending on the query type. Allowed 
expression can be CreateArray, which produces implicit collated elementType, 
should we cast it to default if the value comes from a parameter. The thing is 
that if I only cast the flag to defaultST, I might get two arrays of different 
collationId, but same priority of DafaultST. Current implementation casts any 
StringType or ArrayType(StringType) to session level default StringType/ 
ArrayType with that session level default StringType.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47692][SQL] Addition of priority flag to StringType [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on PR #45819:
URL: https://github.com/apache/spark/pull/45819#issuecomment-2042432922

   @dbatomic @stefankandic @uros-db @stevomitric @nikolamand-db  Could you take 
a look at this PR? It introduces a new flag for collations and makes sure most 
of the functions that return StringType without StringType in inputs return 
default collation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47725][INFRA] Set up the CI for pyspark-connect package [spark]

2024-04-08 Thread via GitHub



HyukjinKwon commented on PR #45870:
URL: https://github.com/apache/spark/pull/45870#issuecomment-2042534288

   Test results: https://github.com/HyukjinKwon/spark/actions/runs/8598881063
   
   Should be ready to go, cc @zhengruifeng @dongjoon-hyun @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [DRAFT][SPARK-47692][SQL] Hash aggregate support for strings with collation [spark]

2024-04-08 Thread via GitHub



uros-db opened a new pull request, #45929:
URL: https://github.com/apache/spark/pull/45929

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47692][SQL] Addition of priority flag to StringType [spark]

2024-04-08 Thread via GitHub



nikolamand-db commented on PR #45819:
URL: https://github.com/apache/spark/pull/45819#issuecomment-2042559655

   How do we want to handle cases like string 
[split](https://github.com/apache/spark/pull/45856/files#diff-8ce4728fc96a1a5f2a3a846470f03175ba00b61ff6bc51933883241e736d886aR548)
 (work in progress, used as illustration) in general? The regex parameter 
accepts only `StringType` and if I understand the plan would fail when we 
explicitly set the default collation type to be something other than 
`UTF8_BINARY` and run query like `select split('abc', 'b')`. However, in this 
case the regex parameter's collation doesn't have any semantic meaning and I 
don't think the queries alike should fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT][SPARK-47692][SQL] Hash aggregate support for strings with collation [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on code in PR #45929:
URL: https://github.com/apache/spark/pull/45929#discussion_r1555757796


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -168,7 +168,13 @@ object ExprUtils extends QueryErrorsBase {
 a.failAnalysis(
   errorClass = "MISSING_GROUP_BY",
   messageParameters = Map.empty)
-  case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) 
=>
+  case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) 
&&
+!a.groupingExpressions.exists {

Review Comment:
   why do we need this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT][SPARK-47692][SQL] Hash aggregate support for strings with collation [spark]

2024-04-08 Thread via GitHub



uros-db commented on code in PR #45929:
URL: https://github.com/apache/spark/pull/45929#discussion_r1555767969


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -168,7 +168,13 @@ object ExprUtils extends QueryErrorsBase {
 a.failAnalysis(
   errorClass = "MISSING_GROUP_BY",
   messageParameters = Map.empty)
-  case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) 
=>
+  case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) 
&&
+!a.groupingExpressions.exists {

Review Comment:
   without it, it hits the 
`QueryCompilationErrors.columnNotInGroupByClauseError(e)`
   
   with query plan:
   ```
   Aggregate [collationkey(name#17)], [any_value(name#17, false) AS name#24, 
count(1) AS count(1)#20L]
   +- LogicalRDD [name#17], false
   ```
   
   so name#17 is not found in groupingExpressions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DRAFT][SPARK-47692][SQL] Hash aggregate support for strings with collation [spark]

2024-04-08 Thread via GitHub



uros-db commented on code in PR #45929:
URL: https://github.com/apache/spark/pull/45929#discussion_r1555767969


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala:
##
@@ -168,7 +168,13 @@ object ExprUtils extends QueryErrorsBase {
 a.failAnalysis(
   errorClass = "MISSING_GROUP_BY",
   messageParameters = Map.empty)
-  case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) 
=>
+  case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) 
&&
+!a.groupingExpressions.exists {

Review Comment:
   without it, it hits the 
`QueryCompilationErrors.columnNotInGroupByClauseError(e)`
   with error: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (192.168.15.30 
executor driver): org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't 
find name#17 in [collationkey(name#17)#30,count(1)#19L] SQLSTATE: XX000
   
   with query plan:
   ```
   Aggregate [collationkey(name#17)], [any_value(name#17, false) AS name#24, 
count(1) AS count(1)#20L]
   +- LogicalRDD [name#17], false
   ```
   
   so name#17 is not found in groupingExpressions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47692][SQL] Addition of priority flag to StringType [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on PR #45819:
URL: https://github.com/apache/spark/pull/45819#issuecomment-2042659387

   If the collation is not important to the function execution we can create a 
separate pattern for it where we will exclude parameters that do not need 
casting. For example, substring is done in that way. We only cast if it is 
DefaultST or ExplicitST to ImplicitST, as we need the return type to have 
implicit collation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47764][CORE][SQL] Cleanup shuffle dependencies based on ShuffleCleanupMode [spark]

2024-04-08 Thread via GitHub



bozhang2820 opened a new pull request, #45930:
URL: https://github.com/apache/spark/pull/45930

   ### What changes were proposed in this pull request?
   This change adds a new trait, `ShuffleCleanupMode` under `QueryExecution`, 
and two new configs, `spark.sql.shuffleDependency.skipMigration.enabled` and 
`spark.sql.shuffleDependency.fileCleanup.enabled`.
   
   For Spark Connect query executions, `ShuffleCleanupMode` is controlled by 
the two new configs, and shuffle dependency cleanup are performed accordingly. 
   
   When `spark.sql.shuffleDependency.fileCleanup.enabled` is `true`, shuffle 
dependency files will be cleaned up at the end of query executions. 
   
   When `spark.sql.shuffleDependency.skipMigration.enabled` is `true`, shuffle 
dependencies will be skipped at the shuffle data migration for node 
decommissions.
   
   ### Why are the changes needed?
   This is to: 1. speed up shuffle data migration at decommissions and 2. 
possibly (when file cleanup mode is enabled) release disk space occupied by 
unused shuffle files.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. This change adds two new configs, 
`spark.sql.shuffleDependency.skipMigration.enabled` and 
`spark.sql.shuffleDependency.fileCleanup.enabled` to control the cleanup 
behaviors.
   
   ### How was this patch tested?
   Existing tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47297][SQL] Add collations support to split regex expression [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on code in PR #45856:
URL: https://github.com/apache/spark/pull/45856#discussion_r1555843730


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##
@@ -1078,7 +1078,22 @@ public UTF8String[] split(UTF8String pattern, int limit) 
{
   }
   return result;
 }
-return split(pattern.toString(), limit);
+return split(pattern.toString(), limit, regexFlags);
+  }
+
+  public UTF8String[] split(UTF8String pattern, int limit) {
+return split(pattern, limit, 0); // Pattern without regex flags
+  }
+
+  public UTF8String[] splitCollationAware(UTF8String pattern, int limit, int 
collationId) {
+if (CollationFactory.fetchCollation(collationId).supportsBinaryEquality) {
+  return split(pattern, limit);
+}
+if (collationId == CollationFactory.UTF8_BINARY_LCASE_COLLATION_ID) {
+  return split(pattern, limit, Pattern.UNICODE_CASE | 
Pattern.CASE_INSENSITIVE);
+}
+throw new UnsupportedOperationException("Unsupported collation " +
+  CollationFactory.fetchCollation(collationId).collationName);

Review Comment:
   +1, this should fail in inputTypeCheck of the related expression. I am not 
sure if we need this double guard.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47504][SQL] Resolve AbstractDataType simpleStrings for StringTypeCollated [spark]

2024-04-08 Thread via GitHub



mihailom-db commented on PR #45694:
URL: https://github.com/apache/spark/pull/45694#issuecomment-2042815237

   @cloud-fan Could we merge this in? So some new PR does not conflict with it 
again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47739][SQL] Register logical avro type [spark]

2024-04-08 Thread via GitHub



milastdbx commented on PR #45895:
URL: https://github.com/apache/spark/pull/45895#issuecomment-2042987166

   @dongjoon-hyun thanks for the review. You are absolutely right. Your tests 
are passing because you are running them in batch, and this regression only 
really happens if you call don't initalize `LogicalTypes` prior to resolving 
schema, hence I would assume that some test registered it and as a side effect 
you get tests passing.
   
   Given that `LogicalTypes` is an external static class, part of `Apache Avro` 
I am not sure how to resolve this reliably.
   One idea is to separate this into different Suite, but this also does not 
give complete protection, as multiple suites can be run within same process, 
hence hitting same issue as if test was part of suite.
   
   Do you have any suggestion how to test this static object initalization when 
we do not have access to `teardown` api of these classes ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47318][CORE] Adds HKDF round to AuthEngine key derivation to follow standard KEX practices [spark]

2024-04-08 Thread via GitHub



sweisdb commented on code in PR #45425:
URL: https://github.com/apache/spark/pull/45425#discussion_r1556042923


##
docs/security.md:
##
@@ -149,30 +149,44 @@ secret file agrees with the executors' secret file.
 
 # Network Encryption
 
-Spark supports two mutually exclusive forms of encryption for RPC connections.
+Spark supports two mutually exclusive forms of encryption for RPC connections:
 
-The first is an AES-based encryption which relies on a shared secret, and thus 
requires
-RPC authentication to also be enabled.
+The **preferred method** uses TLS (aka SSL) encryption via Netty's support for 
SSL. Enabling SSL
+requires keys and certificates to be properly configured. SSL is standardized 
and considered more
+secure.
 
-The second is an SSL based encryption mechanism utilizing Netty's support for 
SSL. This requires
-keys and certificates to be properly configured. It can be used with or 
without the authentication
-mechanism discussed earlier.
-
-One may prefer to use the SSL based encryption in scenarios where compliance 
mandates the usage
-of specific protocols; or to leverage the security of a more standard 
encryption library. However,
-the AES based encryption is simpler to configure and may be preferred if the 
only requirement
-is that data be encrypted in transit.
+The legacy method is an AES-based encryption mechanism relying on a shared 
secret. This requires
+RPC authentication to also be enabled. This method uses a bespoke protocol and 
it is recommended
+to use SSL instead.

Review Comment:
   I will take this out. I think it is rationalizing why there is an ad hoc 
protocol here instead in the first place. I agree that we should make it clear 
that TLS is strongly preferred and make it easy to migrate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47681][SQL] Add schema_of_variant expression. [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on PR #45806:
URL: https://github.com/apache/spark/pull/45806#issuecomment-2043133548

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47681][SQL] Add schema_of_variant expression. [spark]

2024-04-08 Thread via GitHub



cloud-fan closed pull request #45806: [SPARK-47681][SQL] Add schema_of_variant 
expression.
URL: https://github.com/apache/spark/pull/45806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47504][SQL] Resolve AbstractDataType simpleStrings for StringTypeCollated [spark]

2024-04-08 Thread via GitHub



cloud-fan commented on PR #45694:
URL: https://github.com/apache/spark/pull/45694#issuecomment-2043136877

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47504][SQL] Resolve AbstractDataType simpleStrings for StringTypeCollated [spark]

2024-04-08 Thread via GitHub



cloud-fan closed pull request #45694: [SPARK-47504][SQL] Resolve 
AbstractDataType simpleStrings for StringTypeCollated
URL: https://github.com/apache/spark/pull/45694


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47318][CORE] Adds HKDF round to AuthEngine key derivation to follow standard KEX practices [spark]

2024-04-08 Thread via GitHub



sweisdb commented on code in PR #45425:
URL: https://github.com/apache/spark/pull/45425#discussion_r1556042923


##
docs/security.md:
##
@@ -149,30 +149,44 @@ secret file agrees with the executors' secret file.
 
 # Network Encryption
 
-Spark supports two mutually exclusive forms of encryption for RPC connections.
+Spark supports two mutually exclusive forms of encryption for RPC connections:
 
-The first is an AES-based encryption which relies on a shared secret, and thus 
requires
-RPC authentication to also be enabled.
+The **preferred method** uses TLS (aka SSL) encryption via Netty's support for 
SSL. Enabling SSL
+requires keys and certificates to be properly configured. SSL is standardized 
and considered more
+secure.
 
-The second is an SSL based encryption mechanism utilizing Netty's support for 
SSL. This requires
-keys and certificates to be properly configured. It can be used with or 
without the authentication
-mechanism discussed earlier.
-
-One may prefer to use the SSL based encryption in scenarios where compliance 
mandates the usage
-of specific protocols; or to leverage the security of a more standard 
encryption library. However,
-the AES based encryption is simpler to configure and may be preferred if the 
only requirement
-is that data be encrypted in transit.
+The legacy method is an AES-based encryption mechanism relying on a shared 
secret. This requires
+RPC authentication to also be enabled. This method uses a bespoke protocol and 
it is recommended
+to use SSL instead.

Review Comment:
   (Edit: I misread your comment as asking to remove text, not questioning why 
I removed text.) 
   
   I think the original text is rationalizing why there is an ad hoc protocol 
here instead in the first place. We should make it clear that TLS is strongly 
preferred and make it easy to migrate.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47318][CORE] Adds HKDF round to AuthEngine key derivation to follow standard KEX practices [spark]

2024-04-08 Thread via GitHub



sweisdb commented on code in PR #45425:
URL: https://github.com/apache/spark/pull/45425#discussion_r1556106453


##
docs/security.md:
##
@@ -149,30 +149,44 @@ secret file agrees with the executors' secret file.
 
 # Network Encryption
 
-Spark supports two mutually exclusive forms of encryption for RPC connections.
+Spark supports two mutually exclusive forms of encryption for RPC connections:
 
-The first is an AES-based encryption which relies on a shared secret, and thus 
requires
-RPC authentication to also be enabled.
+The **preferred method** uses TLS (aka SSL) encryption via Netty's support for 
SSL. Enabling SSL
+requires keys and certificates to be properly configured. SSL is standardized 
and considered more
+secure.
 
-The second is an SSL based encryption mechanism utilizing Netty's support for 
SSL. This requires
-keys and certificates to be properly configured. It can be used with or 
without the authentication
-mechanism discussed earlier.
-
-One may prefer to use the SSL based encryption in scenarios where compliance 
mandates the usage
-of specific protocols; or to leverage the security of a more standard 
encryption library. However,
-the AES based encryption is simpler to configure and may be preferred if the 
only requirement
-is that data be encrypted in transit.
+The legacy method is an AES-based encryption mechanism relying on a shared 
secret. This requires
+RPC authentication to also be enabled. This method uses a bespoke protocol and 
it is recommended
+to use SSL instead.

Review Comment:
   I put the original text back in. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47682][SQL] Support cast from variant. [spark]

2024-04-08 Thread via GitHub



chenhao-db commented on PR #45807:
URL: https://github.com/apache/spark/pull/45807#issuecomment-2043169795

   
   @cloud-fan Could you help review this PR? Thanks!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-08 Thread via GitHub



milastdbx commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1556129163


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1083,6 +1083,12 @@
 ],
 "sqlState" : "42702"
   },
+  "EXECUTE_IMMEDIATE_FAILED_TO_PARSE_STACK_OVERFLOW" : {

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47725][INFRA] Set up the CI for pyspark-connect package [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun closed pull request #45870: [SPARK-47725][INFRA] Set up the CI 
for pyspark-connect package
URL: https://github.com/apache/spark/pull/45870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47725][INFRA] Set up the CI for pyspark-connect package [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on PR #45870:
URL: https://github.com/apache/spark/pull/45870#issuecomment-2043203281

   Let me merge this to start from this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47767]Show offset value in TakeOrderedAndProjectExec [spark]

2024-04-08 Thread via GitHub



guixiaowen opened a new pull request, #45931:
URL: https://github.com/apache/spark/pull/45931

   
   
   ### What changes were proposed in this pull request?
   Show the offset value in TakeOrderedAndProjectExec.
   

   
   For example:
   

   `
   explain select * from test_limit_offset order by a  limit 2  offset 1;
   
   plan
   
   == Physical Plan ==
   
   TakeOrderedAndProject(limit=3, orderBy=[a#171 ASC NULLS 
FIRST](https://issues.apache.org/jira/browse/SPARK-47767#171%20ASC%20NULLS%20FIRST),
 output=[a#171](https://issues.apache.org/jira/browse/SPARK-47767#171))
   
   +- Scan hive spark_catalog.bigdata_qa.test_limit_offset 
[a#171](https://issues.apache.org/jira/browse/SPARK-47767#171), 
HiveTableRelation [`spark_catalog`.`test`.`test_limit_offset`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: 
[a#171](https://issues.apache.org/jira/browse/SPARK-47767#171), Partition Cols: 
[]]
   `
   
   
   
   No offset is displayed. If it is displayed, it will be more user-friendly
   
   
   ### Why are the changes needed?
   No offset is displayed. If it is displayed, it will be more user-friendly
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   After this pr, the offset in TakeOrderedAndProjectExec will  be displayed.
   
   `
   explain select * from test_limit_offset order by a  limit 2  offset 1;
   
   plan
   
   == Physical Plan ==
   
   TakeOrderedAndProject(limit=3,  offset=1, orderBy=[a#171 ASC NULLS 
FIRST](https://issues.apache.org/jira/browse/SPARK-47767#171%20ASC%20NULLS%20FIRST),
 output=[a#171](https://issues.apache.org/jira/browse/SPARK-47767#171))
   
   +- Scan hive spark_catalog.bigdata_qa.test_limit_offset 
[a#171](https://issues.apache.org/jira/browse/SPARK-47767#171), 
HiveTableRelation [`spark_catalog`.`test`.`test_limit_offset`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: 
[a#171](https://issues.apache.org/jira/browse/SPARK-47767#171), Partition Cols: 
[]]
   `
   
   
   ### How was this patch tested?
   ut
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47737][PYTHON] Bump PyArrow to 10.0.0 [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun closed pull request #45892: [SPARK-47737][PYTHON] Bump PyArrow to 
10.0.0
URL: https://github.com/apache/spark/pull/45892


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47741] Added stack overflow handling in parser [spark]

2024-04-08 Thread via GitHub



srielau commented on code in PR #45896:
URL: https://github.com/apache/spark/pull/45896#discussion_r1556139176


##
docs/sql-error-conditions.md:
##
@@ -756,6 +756,12 @@ Failed to rename temp file `` to `` as 
FileSystem.rename retur
 
 Failed to convert the row value `` of the class `` to the target 
SQL type `` in the JSON format.
 
+### FAILED_TO_PARSE_STACK_OVERFLOW
+
+[SQLSTATE: 
42907](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation)

Review Comment:
   ```suggestion
   [SQLSTATE: 
54001](sql-error-conditions-sqlstates.html#class-54-program-limit-exceeded)
   ```



##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1083,6 +1083,12 @@
 ],
 "sqlState" : "42702"
   },
+  "FAILED_TO_PARSE_STACK_OVERFLOW" : {
+"message" : [
+  "Stack overflow was hit while parsing the SQL query. "
+],
+"sqlState" : "42907"

Review Comment:
   ```suggestion
   "sqlState" : "54001"
   ```



##
docs/sql-error-conditions.md:
##
@@ -756,6 +756,12 @@ Failed to rename temp file `` to `` as 
FileSystem.rename retur
 
 Failed to convert the row value `` of the class `` to the target 
SQL type `` in the JSON format.
 
+### FAILED_TO_PARSE_STACK_OVERFLOW

Review Comment:
   ```suggestion
   ### FAILED_TO_PARSE_STACK_TOO_COMPLEX
   ```



##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1083,6 +1083,12 @@
 ],
 "sqlState" : "42702"
   },
+  "FAILED_TO_PARSE_STACK_OVERFLOW" : {

Review Comment:
   ```suggestion
 "FAILED_TO_PARSE_TOO_COMPLEX" : {
   ```



##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -1083,6 +1083,12 @@
 ],
 "sqlState" : "42702"
   },
+  "FAILED_TO_PARSE_STACK_OVERFLOW" : {
+"message" : [
+  "Stack overflow was hit while parsing the SQL query. "

Review Comment:
   ```suggestion
 "The statement, including potential SQL functions and referenced 
views,  was too complex to parse.",
 "To mitigate this error divide the statement into multiple, less 
complex chunks."
   ```



##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala:
##
@@ -32,6 +32,15 @@ class QueryParsingErrorsSuite extends QueryTest with 
SharedSparkSession with SQL
 intercept[ParseException](sql(sqlText).collect())
   }
 
+  test("PARSE_STACK_OVERFLOW_ERROR: Stack overflow hit") {
+val query = (1 to 2).map(x => "SELECT 1 as a").mkString(" UNION ALL ")
+val e = intercept[ParseException] {
+  spark.sql(query)
+}
+
+assert(e.errorClass.get == "FAILED_TO_PARSE_STACK_OVERFLOW")

Review Comment:
   Use the checkError() method instead of assert.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47758][BUILD] Upgrade commons-collections4 to 4.5.0-M1 [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on code in PR #45921:
URL: https://github.com/apache/spark/pull/45921#discussion_r1556146836


##
dev/deps/spark-deps-hadoop-3-hive-2.3:
##
@@ -38,7 +38,7 @@ chill_2.13/0.10.0//chill_2.13-0.10.0.jar
 commons-cli/1.6.0//commons-cli-1.6.0.jar
 commons-codec/1.16.1//commons-codec-1.16.1.jar
 commons-collections/3.2.2//commons-collections-3.2.2.jar
-commons-collections4/4.4//commons-collections4-4.4.jar
+commons-collections4/4.5.0-M1//commons-collections4-4.5.0-M1.jar

Review Comment:
   Is `M1` stable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-47758][BUILD] Upgrade commons-collections4 to 4.5.0-M1 [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on code in PR #45921:
URL: https://github.com/apache/spark/pull/45921#discussion_r1556148469


##
dev/deps/spark-deps-hadoop-3-hive-2.3:
##
@@ -38,7 +38,7 @@ chill_2.13/0.10.0//chill_2.13-0.10.0.jar
 commons-cli/1.6.0//commons-cli-1.6.0.jar
 commons-codec/1.16.1//commons-codec-1.16.1.jar
 commons-collections/3.2.2//commons-collections-3.2.2.jar
-commons-collections4/4.4//commons-collections4-4.4.jar
+commons-collections4/4.5.0-M1//commons-collections4-4.5.0-M1.jar

Review Comment:
   I guess we can skip this `milestone` release. WDYT, @panbingkun ?
   > This milestone release requires Java 8 and adds the package 
`org.apache.commons.collections4.bloomfilter` for review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47706][BUILD] Bump json4s 4.0.7 [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on code in PR #45838:
URL: https://github.com/apache/spark/pull/45838#discussion_r1556151836


##
project/MimaExcludes.scala:
##
@@ -90,7 +90,21 @@ object MimaExcludes {
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.MySQLDialect#MySQLSQLBuilder.this"),
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.MySQLDialect#MySQLSQLQueryBuilder.this"),
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.OracleDialect#OracleSQLBuilder.this"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.OracleDialect#OracleSQLQueryBuilder.this")
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.jdbc.OracleDialect#OracleSQLQueryBuilder.this"),
+// SPARK-47706: Bump json4s from 3.7.0-M11 to 4.0.7
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.types.DataType#JSortedObject.unapplySeq"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.sql.expressions.MutableAggregationBuffer.jsonValue"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.params"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.defaultParams"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.metadata"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.getParamValue"),
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.copy"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.copy$default$5"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.copy$default$6"),
+
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.copy$default$7"),
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.this"),
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.util.DefaultParamsReader#Metadata.apply"),

Review Comment:
   Could you file a JIRA issue for this, @pan3793 ? We had better fix that 
first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on PR #3:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2043248305

   Gentle ping, @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Revert: [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on PR #45852:
URL: https://github.com/apache/spark/pull/45852#issuecomment-2043258249

   Hi, @pan3793 and all. Shall we conclude this thread?
   
   Given the discussion, I believe we can close this PR and move forward to 
generalize `--wait` for all services (if someone wants to implement that for 
consistency.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]

2024-04-08 Thread via GitHub



viirya commented on PR #3:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2043305384

   > It seems that we need to decide the versioning plan and policy before 
starting to merge.
   
   @dongjoon-hyun Sorry for missing first ping. Can you describe it more? I'm 
not sure about what versioning plan and policy. For general versioning plan, 
will it be different to Spark versioning policy?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

2024-04-08 Thread via GitHub



szehon-ho commented on PR #45267:
URL: https://github.com/apache/spark/pull/45267#issuecomment-2043308510

   Thanks @sunchao and all for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on PR #3:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2043308202

   Specifically, if you are going to merge this PR, how can you resolve this PR 
and the corresponding JIRA? What is the fixed version?
   - SPARK-47745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47754][SQL] Postgres: Support reading multidimensional arrays [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on code in PR #45917:
URL: https://github.com/apache/spark/pull/45917#discussion_r1556203063


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -798,6 +798,9 @@ abstract class JdbcDialect extends Serializable with 
Logging {
   protected final def getTimestampType(md: Metadata): DataType = {
 JdbcUtils.getTimestampType(md.getBoolean("isTimestampNTZ"))
   }
+
+  @Since("4.0.0")
+  def getArrayDimension(conn: Connection, tableName: String, columnName: 
String): Option[Int] = None

Review Comment:
   Could you add a function description to describe the meaning of return 
value? For example, in the following format
   - Some(n): What is the definition of n
   - None: What is the meaning



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file [spark]

2024-04-08 Thread via GitHub



nchammas commented on PR #44920:
URL: https://github.com/apache/spark/pull/44920#issuecomment-204232

   @HyukjinKwon - Anything else you'd like to see done here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP] ListStateTTL implementation [spark]

2024-04-08 Thread via GitHub



ericm-db opened a new pull request, #45932:
URL: https://github.com/apache/spark/pull/45932

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47417][SPARK-47418][SPARk-47420] Add collation support to multiple string functions [spark]

2024-04-08 Thread via GitHub



nikolamand-db opened a new pull request, #45933:
URL: https://github.com/apache/spark/pull/45933

   
   
   ### What changes were proposed in this pull request?
   
   
   Combining multiple passthrough string functions:
   - [SPARK-47417](https://issues.apache.org/jira/browse/SPARK-47417): `Ascii`, 
`Chr`, `Base64`, `UnBase64`; `Chr` and `Base64` are skipped as they don't 
accept input string types and don't need to be updated
   - [SPARK-47418](https://issues.apache.org/jira/browse/SPARK-47418): 
`Decode`, `StringDecode`, `Encode`, `ToBinary`
   - [SPARK-47420](https://issues.apache.org/jira/browse/SPARK-47420): 
`FormatNumber`, `Sentences`
   
   ### Why are the changes needed?
   
   
   Add collations support in string functions.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, it changes behavior of string functions when string parameters have 
collation.
   
   ### How was this patch tested?
   
   
   Add checks to `CollationStringExpressionsSuite`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47586][SQL] Hive module: Migrate logError with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45876:
URL: https://github.com/apache/spark/pull/45876#discussion_r1556267923


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala:
##
@@ -200,8 +202,9 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
   logInfo("Skip tests because old Spark versions don't support Java 
21.")
 }
   } else {
-logError(s"Python version <  
${TestUtils.minimumPythonSupportedVersion}, " +
-  "the running environment is unavailable.")
+logError(

Review Comment:
   We don't need to migrate the test changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47586][SQL] Hive module: Migrate logError with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on PR #45876:
URL: https://github.com/apache/spark/pull/45876#issuecomment-2043432284

   @itholic we need more migrations:
   ```
   $ grep -r logError\( *|grep -v test
   src/main/scala/org/apache/spark/sql/hive/TableReader.scala:
logError(s"Exception thrown in field <${fieldRefs(i).getFieldName}>")
   src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
  logError(
   src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
logError(
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47371] [SQL] XML: Ignore row tags found in CDATA [spark]

2024-04-08 Thread via GitHub



sandip-db commented on code in PR #45487:
URL: https://github.com/apache/spark/pull/45487#discussion_r1556266125


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##
@@ -682,25 +684,25 @@ class XmlTokenizer(
 return false
   }
   val c = cOrEOF.toChar
-  if (c == commentEnd(i)) {
-if (i >= commentEnd.length - 1) {
-  // Found comment close.
+  if (c == end(i)) {
+i += 1
+if (i >= end.length) {

Review Comment:
   Please add a test with two scenarios:
   - CDATA ends at the end of the file,
   - CDATA never ends. 
   The later will be invalid XML. Goal is to make sure the parser doesn't crash 
and still returns other valid records.
   
   Add the same two tests for comments as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45879:
URL: https://github.com/apache/spark/pull/45879#discussion_r1556331340


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcRetryHandler.scala:
##
@@ -197,18 +200,20 @@ private[sql] object GrpcRetryHandler extends Logging {
 
 if (time.isDefined) {
   logWarning(
-s"Non-Fatal error during RPC execution: $lastException, retrying " 
+
-  s"(wait=${time.get.toMillis}, currentRetryNum=$currentRetryNum, 
" +
-  s"policy: ${policy.getName})")
+log"Retrying (wait=${MDC(WAIT_TIME, time.get.toMillis)}, " +

Review Comment:
   We can also use `ERROR` for lastException. Either way is ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45879:
URL: https://github.com/apache/spark/pull/45879#discussion_r1556331689


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala:
##
@@ -268,9 +269,11 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
   if (interrupted) {
 // This sender got interrupted. Kill this RPC.
 logWarning(
-  s"Got detached from opId=${executeHolder.operationId} at index 
${nextIndex - 1}." +
-s"totalTime=${System.nanoTime - startTime}ns " +
-s"waitingForResults=${consumeSleep}ns 
waitingForSend=${sendSleep}ns")
+  log"Got detached from opId=${MDC(OP_ID, executeHolder.operationId)} 
" +
+log"at index ${MDC(INDEX, nextIndex - 1)}." +
+log"totalTime=${MDC(TOTAL_TIME, System.nanoTime - startTime)}ns " +

Review Comment:
   Please use ms



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala:
##
@@ -268,9 +269,11 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
   if (interrupted) {
 // This sender got interrupted. Kill this RPC.
 logWarning(
-  s"Got detached from opId=${executeHolder.operationId} at index 
${nextIndex - 1}." +
-s"totalTime=${System.nanoTime - startTime}ns " +
-s"waitingForResults=${consumeSleep}ns 
waitingForSend=${sendSleep}ns")
+  log"Got detached from opId=${MDC(OP_ID, executeHolder.operationId)} 
" +
+log"at index ${MDC(INDEX, nextIndex - 1)}." +
+log"totalTime=${MDC(TOTAL_TIME, System.nanoTime - startTime)}ns " +
+log"waitingForResults=${MDC(WAIT_RESULT_TIME, consumeSleep)}ns " +

Review Comment:
   Please use ms



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala:
##
@@ -268,9 +269,11 @@ private[connect] class ExecuteGrpcResponseSender[T <: 
Message](
   if (interrupted) {
 // This sender got interrupted. Kill this RPC.
 logWarning(
-  s"Got detached from opId=${executeHolder.operationId} at index 
${nextIndex - 1}." +
-s"totalTime=${System.nanoTime - startTime}ns " +
-s"waitingForResults=${consumeSleep}ns 
waitingForSend=${sendSleep}ns")
+  log"Got detached from opId=${MDC(OP_ID, executeHolder.operationId)} 
" +
+log"at index ${MDC(INDEX, nextIndex - 1)}." +
+log"totalTime=${MDC(TOTAL_TIME, System.nanoTime - startTime)}ns " +
+log"waitingForResults=${MDC(WAIT_RESULT_TIME, consumeSleep)}ns " +
+log"waitingForSend=${MDC(WAIT_SEND_TIME, sendSleep)}ns")

Review Comment:
   Please use ms



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47769][SQL] Add schema_of_variant_agg expression. [spark]

2024-04-08 Thread via GitHub



chenhao-db opened a new pull request, #45934:
URL: https://github.com/apache/spark/pull/45934

   ### What changes were proposed in this pull request?
   
   This PR adds a new `schema_of_variant_agg` expression. It returns the merged 
schema in the SQL format of a variant column. Compared to `schema_of_variant`, 
which is a scalar expression and returns one schema for one row, the 
`schema_of_variant_agg` expression merges the schema of all rows.
   
   Usage examples:
   
   ```
   > SELECT schema_of_variant_agg(parse_json(j)) FROM VALUES ('1'), ('2'), 
('3') AS tab(j);
BIGINT
   > SELECT schema_of_variant_agg(parse_json(j)) FROM VALUES ('{"a": 1}'), 
('{"b": true}'), ('{"c": 1.23}') AS tab(j);
STRUCT
   ```
   
   ### Why are the changes needed?
   
   This expression can help the user explore the content of variant values.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.  A new SQL expression is added.
   
   ### How was this patch tested?
   
   Unit tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45879:
URL: https://github.com/apache/spark/pull/45879#discussion_r1556386456


##
connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala:
##
@@ -332,7 +333,7 @@ class KafkaTestUtils(
 Utils.deleteRecursively(new File(f))
   } catch {
 case e: IOException if Utils.isWindows =>
-  logWarning(e.getMessage)
+  logWarning(log"${MDC(ERROR, e.getMessage)}", e)

Review Comment:
   ```suggestion
 logWarning(log"${MDC(ERROR, e.getMessage)}")
   ```



##
connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala:
##
@@ -653,13 +654,13 @@ class KafkaTestUtils(
 Utils.deleteRecursively(snapshotDir)
   } catch {
 case e: IOException if Utils.isWindows =>
-  logWarning(e.getMessage)
+  logWarning(log"${MDC(ERROR, e.getMessage)}", e)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45879:
URL: https://github.com/apache/spark/pull/45879#discussion_r1556386948


##
connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala:
##
@@ -653,13 +654,13 @@ class KafkaTestUtils(
 Utils.deleteRecursively(snapshotDir)
   } catch {
 case e: IOException if Utils.isWindows =>
-  logWarning(e.getMessage)
+  logWarning(log"${MDC(ERROR, e.getMessage)}", e)
   }
   try {
 Utils.deleteRecursively(logDir)
   } catch {
 case e: IOException if Utils.isWindows =>
-  logWarning(e.getMessage)
+  logWarning(log"${MDC(ERROR, e.getMessage)}", e)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on PR #45879:
URL: https://github.com/apache/spark/pull/45879#issuecomment-2043577634

   @panbingkun Thanks for the works! LGTM except for some minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45904:
URL: https://github.com/apache/spark/pull/45904#discussion_r1556415217


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/StreamingJoinHelper.scala:
##
@@ -284,7 +287,8 @@ object StreamingJoinHelper extends PredicateHelper with 
Logging {
   Seq(negateIfNeeded(castedLit, negate))
 case a @ _ =>
   logWarning(
-s"Failed to extract state value watermark from condition 
$exprToCollectFrom due to $a")
+log"Failed to extract state value watermark from condition " +
+  log"${MDC(JOIN_CONDITION, exprToCollectFrom)} due to 
${MDC(JOIN_CONDITION, a)}")

Review Comment:
   What is `a` here? It seems that you are using a same key for two variables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45904:
URL: https://github.com/apache/spark/pull/45904#discussion_r1556417726


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala:
##
@@ -75,19 +75,21 @@ class CSVHeaderChecker(
   }
   if (nameInHeader != nameInSchema) {
 errorMessage = Some(
-  s"""|CSV header does not conform to the schema.
-  | Header: ${columnNames.mkString(", ")}
-  | Schema: ${fieldNames.mkString(", ")}
-  |Expected: ${fieldNames(i)} but found: ${columnNames(i)}
-  |$source""".stripMargin)
+  log"""|CSV header does not conform to the schema.
+| Header: ${MDC(COLUMN_NAME, columnNames.mkString(", "))}

Review Comment:
   We can't use COLUMN_NAME for 4 different variables...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45904:
URL: https://github.com/apache/spark/pull/45904#discussion_r1556421508


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala:
##
@@ -74,10 +75,10 @@ object CharVarcharUtils extends Logging with 
SparkCharVarcharUtils {
 if (SQLConf.get.charVarcharAsString) {
   replaceCharVarcharWithString(dt)
 } else if (hasCharVarchar(dt)) {
-  logWarning("The Spark cast operator does not support char/varchar type 
and simply treats" +
-" them as string type. Please use string type directly to avoid 
confusion. Otherwise," +
-s" you can set ${SQLConf.LEGACY_CHAR_VARCHAR_AS_STRING.key} to true, 
so that Spark treat" +
-s" them as string type as same as Spark 3.0 and earlier")
+  logWarning(log"The Spark cast operator does not support char/varchar 
type and simply treats" +
+log" them as string type. Please use string type directly to avoid 
confusion. Otherwise," +
+log" you can set ${MDC(SQL_CONF_KEY, 
SQLConf.LEGACY_CHAR_VARCHAR_AS_STRING.key)} " +

Review Comment:
   ```suggestion
   log" you can set ${MDC(CONFIG, 
SQLConf.LEGACY_CHAR_VARCHAR_AS_STRING.key)} " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45904:
URL: https://github.com/apache/spark/pull/45904#discussion_r1556422516


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##
@@ -321,8 +322,11 @@ object ResolveDefaultColumns extends QueryErrorsBase
   Option(casted.eval(EmptyRow)).map(Literal(_, targetType))
 } catch {
   case e @ ( _: SparkThrowable | _: RuntimeException) =>
-logWarning(s"Failed to cast default value '$l' for column $colName 
from " +
-  s"${l.dataType} to $targetType due to ${e.getMessage}")
+logWarning(log"Failed to cast default value 
'${MDC(COLUMN_DEFAULT_VALUE, l)}' " +
+  log"for column ${MDC(COLUMN_NAME, colName)} " +
+  log"from ${MDC(COLUMN_DATA_TYPE, l.dataType)} " +

Review Comment:
   we can't use COLUMN_DATA_TYPE for two variables here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework [spark]

2024-04-08 Thread via GitHub



gengliangwang commented on code in PR #45904:
URL: https://github.com/apache/spark/pull/45904#discussion_r1556423196


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala:
##
@@ -136,9 +137,11 @@ object StringUtils extends Logging {
 override def toString: String = {
   if (atLimit) {
 logWarning(
-  "Truncated the string representation of a plan since it was too 
long. The " +
-s"plan had length ${length} and the maximum is ${maxLength}. This 
behavior " +
-s"can be adjusted by setting 
'${SQLConf.MAX_PLAN_STRING_LENGTH.key}'.")
+  log"Truncated the string representation of a plan since it was too 
long. The " +
+log"plan had length ${MDC(QUERY_PLAN_LENGTH, length)} " +
+log"and the maximum is ${MDC(QUERY_PLAN_LENGTH, maxLength)}. This 
behavior " +
+log"can be adjusted by setting " +
+log"'${MDC(QUERY_PLAN_LENGTH, 
SQLConf.MAX_PLAN_STRING_LENGTH.key)}'.")

Review Comment:
   ```suggestion
   log"'${MDC(CONFIG, SQLConf.MAX_PLAN_STRING_LENGTH.key)}'.")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47734][PYTHON][TESTS][3.4] Fix flaky DataFrame.writeStream doctest by stopping streaming query [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on PR #45908:
URL: https://github.com/apache/spark/pull/45908#issuecomment-2043670805

   Thank you, @JoshRosen and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47366][PYTHON] Add VariantVal for PySpark [spark]

2024-04-08 Thread via GitHub



gene-db commented on code in PR #45826:
URL: https://github.com/apache/spark/pull/45826#discussion_r1556456624


##
python/pyspark/sql/types.py:
##
@@ -1468,6 +1475,36 @@ def __eq__(self, other: Any) -> bool:
 return type(self) == type(other)
 
 
+class VariantVal:

Review Comment:
   Added to the documentation.
   
   I think for now, keeping the lightweight `VariantVal` here, and keeping all 
the binary encoding details in `variant_utils.py` seems fine. We can refactor 
later if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47366][PYTHON] Add VariantVal for PySpark [spark]

2024-04-08 Thread via GitHub



gene-db commented on code in PR #45826:
URL: https://github.com/apache/spark/pull/45826#discussion_r1556457254


##
python/pyspark/sql/tests/test_types.py:
##
@@ -1406,6 +1407,68 @@ def test_calendar_interval_type_with_sf(self):
 schema1 = self.spark.range(1).select(F.make_interval(F.lit(1))).schema
 self.assertEqual(schema1.fields[0].dataType, CalendarIntervalType())
 
+def test_variant_type(self):
+from decimal import Decimal
+self.assertEqual(VariantType().simpleString(), "variant")
+
+# Holds a tuple of (key, json string value, python value)
+expected_values = [
+("str", '"%s"' % ("0123456789" * 10), "0123456789" * 10),
+("short_str", '"abc"', "abc"),
+("null", "null", None),
+("true", "true", True),
+("false", "false", False),
+("int1", "1", 1),
+("-int1", "-5", -5),
+("int2", "257", 257),
+("-int2", "-124", -124),
+("int4", "65793", 65793),
+("-int4", "-69633", -69633),
+("int8", "4295033089", 4295033089),
+("-int8", "-4294967297", -4294967297),
+("float4", "1.23456789e-30", 1.23456789e-30),
+("-float4", "-4.56789e+29", -4.56789e+29),
+("dec4", "123.456", Decimal("123.456")),
+("-dec4", "-321.654", Decimal("-321.654")),
+("dec8", "429.4967297", Decimal("429.4967297")),
+("-dec8", "-5.678373902", Decimal("-5.678373902")),
+("dec16", "467440737095.51617", Decimal("467440737095.51617")),
+("-dec16", "-67.849438003827263", Decimal("-67.849438003827263")),
+("arr", '[1.1,"2",[3],{"4":5}]', [Decimal("1.1"), "2", [3], {"4": 
5}]),
+("obj", '{"a":["123",{"b":2}],"c":3}', {"a": ["123", {"b": 2}], 
"c": 3}),
+]
+json_str = "{%s}" % ",".join(['"%s": %s' % (t[0], t[1]) for t in 
expected_values])
+
+df = self.spark.createDataFrame([({"json": json_str})])

Review Comment:
   @zhengruifeng Would it be ok to handle that implementation in a separate PR? 
This one is already getting pretty large.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47609][SQL] Making CacheLookup more optimal to minimize cache miss [spark]

2024-04-08 Thread via GitHub



ahshahid opened a new pull request, #45935:
URL: https://github.com/apache/spark/pull/45935

   ### What changes were proposed in this pull request?
   Currently the  CacheLookup relies on the fragment ( sub plan of incoming 
query)  matching exactly the analyzed Logical Plan in CacheManager for which 
InMemoryRelation exists.
   This limits efficiency of the lookup.. for eg  
   consider a simple case
   we have an InMemoryRelation available for a plan as
   
Project(a, b, c, d)->  
Cached IMR
 |
   Project( x as a, b,  b+a as c, b - c as d) 
   
   and the incoming plan is say
 Project( a, b, c)
 |
   Project( x as a, b,  b+a as c, b - c as d) 
   
   Currently the incoming look up will not be able to use Cached IMR, as top 
level Projects do not match.
   But it is possible to use the IMR by putting a projection of   (a, b c)
   i.e Project( a, b, c)
|
   Cached IMR
   
   Below describes how more complex plans can use cache IMR , with filters 
present in between Projects.
   **The main idea is : for each incoming fragment of incoming subplan, check 
if there is an exact match plan in the cache data. ( this is the current code). 
 If matches use that, else  look for a partial match ** 
   
   ** A partial match, among other requirements, also mandates that 
incoming_plan.child  is canonically same as cachePlan.child **
   
   It can be argued that why only 1  level depth equality is needed, as it is 
possible that the cache plan is usable even if the child do not match, but 
grand child match.
   
i.e incoming_plan.child.child == cachePlan.child .child
   and in that sense the argument can be extended to the leaf..
   
   So we limit to 1 child level check for partial match for following reasons:
   1) Reduce the complexity 
   2) Keep lookup time in reasonable limit
   3) Hopefully the next PR in line 
[PR-SPARK-45959](https://github.com/apache/spark/pull/43854), which aggresively 
collapses two adjacents Projects or 2 Projects interspersed with Filters, into 
a single project in analyzer phase, if possible, means that in most cases, the 
child match would be good enough for partial match requirement..
   
   
   
   Case 1: using InMemoryRelation in a plan having 2 consecutive Projects.
   
   We start with a DataFrame df1 with plan = Project2 -> X
   and then we cache this df1. So that the CachedRepresentation has ( IMR and 
the logical Plan as Project2 -> X)
   
   Now we create a new data frame Df2 =  Some Project -> X 
   Clearly Project may no longer be same as Project2, so a direct check with 
CacheManager will not result in matching IMR.
   But clearly X are same .
   So the criteria is : an IMR can be used IFF following conditions are met
   
   X for both is same ( i.e incoming Project's child and CachedPlan's Project's 
child are same)
   All the NamedExpressions of Incoming Project are expressable in terms of 
output of Project2 ( which is what IMR's output is )
   To do the check for above point 2, we consider following logic
   Now given that X for both are same, which means their outputs are 
equivalent, so we remap the cached plan's Project2 in terms of output attribute 
( Expr IDs) of X of incoming Project Plan
   This will help us find out following
   Those NamedExpressions of incoming Project which are directly same as 
NamedExpressions of Project 2
   Those NamedExpressions of incoming Project which are some functions of 
output of Project 2
   Those NamedExpressions of incoming Project which are sort of Literal 
Constants and independent of output of Project2
   Those NamedExpressions of incoming Project which are functions of some 
attributes but those attributes are unavailable in the output of Project2
   So so long as above # 4 types of NamedExpressions are empty, it means that 
InMemoryRelation of the CachedPlan is usable.
   and this above logic is coded in CacheManager. The logic involves modifying 
the NamedExpressions in incoming Project, in terms of the Seq[Attributes] which 
will be forced on the IMR.
   
   Case 2: using InMemoryRelation in a plan resulting from collapse of Projects 
interspersed with Filters.
   
   We start with a DataFrame df1 with plan = Filter3 -> Filter4 -> Project2 -> X
   and then we cache this df1. So that the CachedRepresentation has ( IMR and 
the logical Plan as
   Filter3 -> Filter4 -> Project2 -> X )
   
   Now we create a new data frame Df2 = Project -> Filter1 -> Filter2 -> 
Filter3 -> Filter4  -> X 
   
   Clearly here the cached plan chain
   Filter3 -> Filter4 -> Project2 -> X
   is no longer directly similar to
   Project -> Filter1 -> Filter2 -> Filter3 - Filter4 - > X
   But it is still possible to use IMR as actually the cached plan's 
LogicalPlan can be used as a subtree of the incoming Plan.
   
   The logic for such check is partly the same as above for 2 consecutive 
Projects, with some handling for filters.
   The algo for this is as fo

Re: [PR] [SPARK-47761][SQL] Oracle: Support reading AnsiIntervalTypes [spark]

2024-04-08 Thread via GitHub



dongjoon-hyun commented on code in PR #45925:
URL: https://github.com/apache/spark/pull/45925#discussion_r1556481273


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -526,6 +526,16 @@ object JdbcUtils extends Logging with SQLConfHelper {
   (rs: ResultSet, row: InternalRow, pos: Int) =>
 row.update(pos, rs.getBytes(pos + 1))
 
+case _: YearMonthIntervalType =>

Review Comment:
   Is there any side-effect for the other JDBC dialects?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 204 matches

Mail list logo