date:20240515

[PR] [Only Test] Test without local maven cache [spark]

2024-05-15 Thread via GitHub



panbingkun opened a new pull request, #46606:
URL: https://github.com/apache/spark/pull/46606

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on code in PR #46591:
URL: https://github.com/apache/spark/pull/46591#discussion_r1602657406


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala:
##
@@ -298,40 +294,40 @@ case class StructsToXml(
   }
 
   @transient
-  lazy val writer = new CharArrayWriter()
-
-  @transient
-  lazy val inputSchema: StructType = child.dataType.asInstanceOf[StructType]
-
-  @transient
-  lazy val gen = new StaxXmlGenerator(
-inputSchema, writer, new XmlOptions(options, timeZoneId.get), false)
-
-  // This converts rows to the XML output according to the given schema.
-  @transient
-  lazy val converter: Any => UTF8String = {
-def getAndReset(): UTF8String = {
-  gen.flush()
-  val xmlString = writer.toString
-  writer.reset()
-  UTF8String.fromString(xmlString)
-}
-(row: Any) =>
-  gen.write(row.asInstanceOf[InternalRow])
-  getAndReset()
-  }
+  private lazy val evaluator = new StructsToXmlEvaluator(
+child.dataType.asInstanceOf[StructType], options, timeZoneId)
 
   override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
 
-  override def nullSafeEval(value: Any): Any = converter(value)
+  override def nullSafeEval(value: Any): Any = {
+UTF8String.fromString(evaluator.evaluate(value))
+  }
 
   override def inputTypes: Seq[AbstractDataType] = StructType :: Nil
 
   override def prettyName: String = "to_xml"
 
   override protected def withNewChildInternal(newChild: Expression): 
StructsToXml =
 copy(child = newChild)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {

Review Comment:
   We can also revert other refactoring?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]

2024-05-15 Thread via GitHub



panbingkun commented on code in PR #46591:
URL: https://github.com/apache/spark/pull/46591#discussion_r1602655851


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala:
##
@@ -298,40 +294,40 @@ case class StructsToXml(
   }
 
   @transient
-  lazy val writer = new CharArrayWriter()
-
-  @transient
-  lazy val inputSchema: StructType = child.dataType.asInstanceOf[StructType]
-
-  @transient
-  lazy val gen = new StaxXmlGenerator(
-inputSchema, writer, new XmlOptions(options, timeZoneId.get), false)
-
-  // This converts rows to the XML output according to the given schema.
-  @transient
-  lazy val converter: Any => UTF8String = {
-def getAndReset(): UTF8String = {
-  gen.flush()
-  val xmlString = writer.toString
-  writer.reset()
-  UTF8String.fromString(xmlString)
-}
-(row: Any) =>
-  gen.write(row.asInstanceOf[InternalRow])
-  getAndReset()
-  }
+  private lazy val evaluator = new StructsToXmlEvaluator(
+child.dataType.asInstanceOf[StructType], options, timeZoneId)
 
   override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
 
-  override def nullSafeEval(value: Any): Any = converter(value)
+  override def nullSafeEval(value: Any): Any = {
+UTF8String.fromString(evaluator.evaluate(value))
+  }
 
   override def inputTypes: Seq[AbstractDataType] = StructType :: Nil
 
   override def prettyName: String = "to_xml"
 
   override protected def withNewChildInternal(newChild: Expression): 
StructsToXml =
 copy(child = newChild)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {

Review Comment:
   yeah, updated.
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on PR #46593:
URL: https://github.com/apache/spark/pull/46593#issuecomment-2114079862

   Merged to master. Thank you @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 [spark]

2024-05-15 Thread via GitHub



LuciferYang commented on PR #46593:
URL: https://github.com/apache/spark/pull/46593#issuecomment-2114077454

   Thanks @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 [spark]

2024-05-15 Thread via GitHub



yaooqinn closed pull request #46593: [SPARK-48299][BUILD] Upgrade 
`scala-maven-plugin` to 4.9.1
URL: https://github.com/apache/spark/pull/46593


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-15 Thread via GitHub



panbingkun commented on PR #46551:
URL: https://github.com/apache/spark/pull/46551#issuecomment-2114068973

   > How much performance gap will there be if we completely abandon using 
`Resolver.mavenLocal` and the current one?
   
   Let me test it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47607] Add documentation for Structured logging framework [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46605:
URL: https://github.com/apache/spark/pull/46605#issuecomment-2114056759

   cc @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47607] Add documentation for Structured logging framework [spark]

2024-05-15 Thread via GitHub



gengliangwang opened a new pull request, #46605:
URL: https://github.com/apache/spark/pull/46605

   
   
   ### What changes were proposed in this pull request?
   
   Add documentation for Structured logging framework
   
   ### Why are the changes needed?
   
   Provide document for Spark developers
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Doc preview:
   https://github.com/apache/spark/assets/1097932/d3c4fcdc-57e4-4af2-8b05-6b4f6731a8c0;>
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-15 Thread via GitHub



LuciferYang commented on PR #46551:
URL: https://github.com/apache/spark/pull/46551#issuecomment-2114044184

   @panbingkun How much performance gap will there be if we completely abandon 
using `Resolver.mavenLocal` and the current one?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project [spark]

2024-05-15 Thread via GitHub



wForget commented on PR #34023:
URL: https://github.com/apache/spark/pull/34023#issuecomment-2114043502

   > Yes, because after pruning, less data are scanned and the 
non-deterministic function in the SELECT list may return different results if 
we don't prune ahead.
   
   Makes sense, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project [spark]

2024-05-15 Thread via GitHub



cloud-fan commented on PR #34023:
URL: https://github.com/apache/spark/pull/34023#issuecomment-2114040674

   Yes, because after pruning, less data are scanned and the non-deterministic 
function in the SELECT list may return different results if we don't prune 
ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]

2024-05-15 Thread via GitHub



gengliangwang closed pull request #46502: [SPARK-48214][INFRA] Ban import 
`org.slf4j.Logger` & `org.slf4j.LoggerFactory`
URL: https://github.com/apache/spark/pull/46502


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46502:
URL: https://github.com/apache/spark/pull/46502#issuecomment-2114022656

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on code in PR #46502:
URL: https://github.com/apache/spark/pull/46502#discussion_r1602614245


##
common/utils/src/main/java/org/apache/spark/internal/SparkLoggerFactory.java:
##
@@ -17,15 +17,17 @@
 
 package org.apache.spark.internal;
 
+// checkstyle.off: RegexpSinglelineJava
+import org.slf4j.LoggerFactory;
+// checkstyle.on: RegexpSinglelineJava
+
 public class SparkLoggerFactory {
 
   public static SparkLogger getLogger(String name) {
-org.slf4j.Logger slf4jLogger = org.slf4j.LoggerFactory.getLogger(name);
-return new SparkLogger(slf4jLogger);
+return new SparkLogger(LoggerFactory.getLogger(name));

Review Comment:
   yeah this looks better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48298][Core] Add TCP mode to StatsD sink [spark]

2024-05-15 Thread via GitHub



jiwen624 commented on PR #46604:
URL: https://github.com/apache/spark/pull/46604#issuecomment-2114019176

   Hi @cloud-fan @dongjoon-hyun @yaooqinn Could you take a look when you get a 
chance or perhaps pull other people in for review? Thank you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on code in PR #46591:
URL: https://github.com/apache/spark/pull/46591#discussion_r1602584809


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala:
##
@@ -298,40 +294,40 @@ case class StructsToXml(
   }
 
   @transient
-  lazy val writer = new CharArrayWriter()
-
-  @transient
-  lazy val inputSchema: StructType = child.dataType.asInstanceOf[StructType]
-
-  @transient
-  lazy val gen = new StaxXmlGenerator(
-inputSchema, writer, new XmlOptions(options, timeZoneId.get), false)
-
-  // This converts rows to the XML output according to the given schema.
-  @transient
-  lazy val converter: Any => UTF8String = {
-def getAndReset(): UTF8String = {
-  gen.flush()
-  val xmlString = writer.toString
-  writer.reset()
-  UTF8String.fromString(xmlString)
-}
-(row: Any) =>
-  gen.write(row.asInstanceOf[InternalRow])
-  getAndReset()
-  }
+  private lazy val evaluator = new StructsToXmlEvaluator(
+child.dataType.asInstanceOf[StructType], options, timeZoneId)
 
   override def dataType: DataType = SQLConf.get.defaultStringType
 
   override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
 copy(timeZoneId = Option(timeZoneId))
 
-  override def nullSafeEval(value: Any): Any = converter(value)
+  override def nullSafeEval(value: Any): Any = {
+UTF8String.fromString(evaluator.evaluate(value))
+  }
 
   override def inputTypes: Seq[AbstractDataType] = StructType :: Nil
 
   override def prettyName: String = "to_xml"
 
   override protected def withNewChildInternal(newChild: Expression): 
StructsToXml =
 copy(child = newChild)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {

Review Comment:
   Simply
   ```scala
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
   val expr = ctx.addReferenceObj("this", this)
   defineCodeGen(ctx, ev, input => s"(UTF8String) 
$expr.nullSafeEval($input)")
 }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project [spark]

2024-05-15 Thread via GitHub



wForget commented on PR #34023:
URL: https://github.com/apache/spark/pull/34023#issuecomment-2113995703

   @cloud-fan This change also affects `PhysicalOperation`, causing hive 
partition pruning to not take effect when there is a non-deterministic project. 
Is this expected behavior?
   
   like:
   ```
   select * 
   from (
 select
   c1, reflect('java.net.URLDecoder', 'decode', c2, 'UTF-8') as c2, dt
 from test) t1
   where dt='2024-05-15' limit 10;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method [spark]

2024-05-15 Thread via GitHub



zhengruifeng commented on PR #46595:
URL: https://github.com/apache/spark/pull/46595#issuecomment-2113992843

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method [spark]

2024-05-15 Thread via GitHub



zhengruifeng closed pull request #46595: [SPARK-48287][PS][CONNECT] Apply the 
builtin `timestamp_diff` method
URL: https://github.com/apache/spark/pull/46595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2113991332

   Thank you @xuzifu666.
   
   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub



yaooqinn closed pull request #46509: [SPARK-48219][CORE] StreamReader Charset 
fix with UTF8
URL: https://github.com/apache/spark/pull/46509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub



xuzifu666 commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2113986338

   XSDtoSchema would not modify it， than HiveImpl had also changed can refer 
recent pr: https://github.com/apache/hive/pull/5243，so I Think it is nesscery 
to change it？ @yaooqinn @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub



xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   XSDtoSchema would not modify it， than HiveImpl had also changed can refer 
recent pr: https://github.com/apache/hive/pull/5243



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub



xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   XSDtoSchema would not modify it， than HiveImpl had also changed can refer 
recent pr: https://github.com/apache/hive/pull/5243



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48288] Add source data type for connector cast expression [spark]

2024-05-15 Thread via GitHub



cloud-fan commented on PR #46596:
URL: https://github.com/apache/spark/pull/46596#issuecomment-2113965545

   the change LGTM. Please mention that users who override 
`V2ExpressionSQLBuilder.visitCast` should update their code, in `Does this PR 
introduce any user-facing change?` section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]

2024-05-15 Thread via GitHub



panbingkun commented on PR #46591:
URL: https://github.com/apache/spark/pull/46591#issuecomment-2113962446

   cc @yaooqinn @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-48298][Core] Add TCP mode to StatsD sink [spark]

2024-05-15 Thread via GitHub



jiwen624 opened a new pull request, #46604:
URL: https://github.com/apache/spark/pull/46604

   ### What changes were proposed in this pull request?
   Working on it...
   
   ### Why are the changes needed?
   As mentioned in the Jira ticket: 
https://issues.apache.org/jira/browse/SPARK-48298
   Currently, the StatsdSink in Spark supports UDP mode only, which is the 
default mode of StatsD. However, in real production environments, we often find 
that a more reliable transmission of metrics is needed to avoid metrics lose in 
high-traffic systems.
   
   TCP mode is already supported by Statsd: 
https://github.com/statsd/statsd/blob/master/docs/server.md
   Prometheus' statsd_exporter: https://github.com/prometheus/statsd_exporter 
   and also many other Statsd-based metrics proxies/receivers.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. 
   The following new config options are added to 
`conf/metrics.properties.template`:
   `*.sink.statsd.protocol`
   `*.sink.statsd.connTimeoutMs`
   A new error condition is defined in error-conditions.json for protocol 
configuration error.
   
   ### How was this patch tested?
   Unit tests.
   Manually tests with metric configurations sending metrics to a Netcat 
TCP/UDP server
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44811][BUILD] Upgrade Guava to 33.1.0-jre [spark]

2024-05-15 Thread via GitHub



LuciferYang commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-2113953900

   > > should we test 33.2.0?
   > 
   > `33.1.0-jre` is chosen because Spark uses `33.1.0-jre` in other places, 
and after this change, we can unify all Guava versions.
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on PR #46551:
URL: https://github.com/apache/spark/pull/46551#issuecomment-2113936531

   If this might increase the $ cost of ASF INFRA, let's put it on hold until 
the upstream bug is fixed. I second @dongjoon-hyun's concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on PR #46602:
URL: https://github.com/apache/spark/pull/46602#issuecomment-2113931789

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]

2024-05-15 Thread via GitHub



HyukjinKwon closed pull request #46602: [SPARK-48295][PS] Turn on 
`compute.ops_on_diff_frames` by default
URL: https://github.com/apache/spark/pull/46602


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48297][SQL] Fix a regression TRANSFORM clause with char/varchar [spark]

2024-05-15 Thread via GitHub



yaooqinn opened a new pull request, #46603:
URL: https://github.com/apache/spark/pull/46603

   
   
   ### What changes were proposed in this pull request?
   
   TRANSFORM with char/varchar has been accidentally invalidated since 3.1 with 
a scala.MatchError, this PR fixes it
   
   ### Why are the changes needed?
   
   
   bugfix
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   new tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]

2024-05-15 Thread via GitHub



panbingkun commented on code in PR #46591:
URL: https://github.com/apache/spark/pull/46591#discussion_r1602509705


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala:
##
@@ -186,9 +186,6 @@ case class SchemaOfXml(
   @transient
   private lazy val xmlOptions = new XmlOptions(options, "UTC")
 
-  @transient
-  private lazy val xmlFactory = xmlOptions.buildXmlFactory()

Review Comment:
   This variable `xmlFactory` is not used in the expression `SchemaOfXml`, so 
it is deleted here. It has no strong relationship with this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on PR #46598:
URL: https://github.com/apache/spark/pull/46598#issuecomment-2113835891

   Thank @LucaCanali @dongjoon-hyun 
   
   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



yaooqinn closed pull request #46598: [SPARK-48289][DOCKER][TEST] Clean up 
Oracle JDBC tests by skipping redundant SYSTEM password reset
URL: https://github.com/apache/spark/pull/46598


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on code in PR #46598:
URL: https://github.com/apache/spark/pull/46598#discussion_r1602483027


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala:
##
@@ -54,10 +54,10 @@ import org.apache.spark.tags.DockerTest
  * A sequence of commands to build the Oracle Database Free container image:
  *  $ git clone https://github.com/oracle/docker-images.git
  *  $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles
- *  $ ./buildContainerImage.sh -v 23.2.0 -f
- *  $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.2.0-free
+ *  $ ./buildContainerImage.sh -v 23.4.0 -f
+ *  $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.4.0-free
  *
- * This procedure has been validated with Oracle Database Free version 23.2.0,
+ * This procedure has been validated with Oracle Database Free version 23.4.0,
  * and with Oracle Express Edition versions 18.4.0 and 21.3.0

Review Comment:
   ```suggestion
* and with Oracle Express Edition versions 18.4.0 and 21.4.0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on code in PR #46598:
URL: https://github.com/apache/spark/pull/46598#discussion_r1602482135


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -48,11 +48,11 @@ import org.apache.spark.tags.DockerTest
  *
  * A sequence of commands to build the Oracle Database Free container image:
  *  $ git clone https://github.com/oracle/docker-images.git
- *  $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles
- *  $ ./buildContainerImage.sh -v 23.2.0 -f
- *  $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.2.0-free
+ *  $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles0
+ *  $ ./buildContainerImage.sh -v 23.4.0 -f
+ *  $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.4.0-free
  *
- * This procedure has been validated with Oracle Database Free version 23.2.0,
+ * This procedure has been validated with Oracle Database Free version 23.4.0,
  * and with Oracle Express Edition versions 18.4.0 and 21.3.0

Review Comment:
   ```suggestion
* and with Oracle Express Edition versions 18.4.0 and 21.4.0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



yaooqinn commented on code in PR #46598:
URL: https://github.com/apache/spark/pull/46598#discussion_r1602481532


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleNamespaceSuite.scala:
##
@@ -46,10 +46,10 @@ import org.apache.spark.tags.DockerTest
  * A sequence of commands to build the Oracle Database Free container image:
  *  $ git clone https://github.com/oracle/docker-images.git
  *  $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles
- *  $ ./buildContainerImage.sh -v 23.2.0 -f
- *  $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.2.0-free
+ *  $ ./buildContainerImage.sh -v 23.4.0 -f
+ *  $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.4.0-free
  *
- * This procedure has been validated with Oracle Database Free version 23.2.0,
+ * This procedure has been validated with Oracle Database Free version 23.4.0,
  * and with Oracle Express Edition versions 18.4.0 and 21.3.0

Review Comment:
   ```suggestion
* and with Oracle Express Edition versions 18.4.0 and 21.4.0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602450604


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   and adding a test too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602449365


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   Let me just put it there togehter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]

2024-05-15 Thread via GitHub



cloud-fan closed pull request #46552: [SPARK-48252][SQL] Update 
CommonExpressionRef when necessary
URL: https://github.com/apache/spark/pull/46552


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]

2024-05-15 Thread via GitHub



cloud-fan commented on PR #46552:
URL: https://github.com/apache/spark/pull/46552#issuecomment-2113761762

   The last commit just updates the test name, we don't need to re-test. I'm 
merging it to master, thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]

2024-05-15 Thread via GitHub



cloud-fan commented on PR #46552:
URL: https://github.com/apache/spark/pull/46552#issuecomment-2113758865

   oh, I accepted this code-change comment on Github but later on I pushed a 
new commit from local which discarded this change...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



zhengruifeng commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602443672


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   or probably make `checkpointDir` lazy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE][SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1602437890


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -844,6 +844,11 @@ def test_union_classmethod_usage(self):
 def test_isinstance_dataframe(self):
 self.assertIsInstance(self.spark.range(1), DataFrame)
 
+def test_checkpoint_dataframe(self):

Review Comment:
   I enabled some more tests that were skipped



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]

2024-05-15 Thread via GitHub



zhengruifeng commented on code in PR #46602:
URL: https://github.com/apache/spark/pull/46602#discussion_r1602430705


##
python/pyspark/pandas/frame.py:
##
@@ -491,7 +491,8 @@ class DataFrame(Frame, Generic[T]):
 
 >>> import pandas as pd
 >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
->>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))
+>>> with ps.option_context("compute.ops_on_diff_frames", False):
+... ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))

Review Comment:
   I am not sure, @HyukjinKwon wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE][SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1602427305


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -199,6 +200,17 @@ message AnalyzePlanRequest {
 // (Required) The logical plan to get the storage level.
 Relation relation = 1;
   }
+
+  message Checkpoint {

Review Comment:
   Actually taking a look at `ExecutePlanRequest`/`ExecutePlanResponse`, I 
think it's better to keep it in `AnalyzePlanRequest`. `ExecutePlanResponse` 
more assumes that the output is actual output values such as Arrow batch, map, 
etc. after the actual execution.
   
   I think it's better to take `RemoveCachedRemoteRelationCommand` as a 
separate command that is not dedicated to `checkpoint`. It can be used in other 
places if we happen to reuse `CachedRemoteRelationCommand` somewhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE][SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on code in PR #46570:
URL: https://github.com/apache/spark/pull/46570#discussion_r1602427305


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -199,6 +200,17 @@ message AnalyzePlanRequest {
 // (Required) The logical plan to get the storage level.
 Relation relation = 1;
   }
+
+  message Checkpoint {

Review Comment:
   Actually taking a look at `ExecutePlanRequest`/`ExecutePlanResponse`, I 
think it's better to keep it in `AnalyzePlanRequest`. `ExecutePlanResponse` 
assumes that the output is actual output values such as Arrow batch, map, etc. 
vs `AnalyzePlanResponse` can assume it's a DataFrame. This is also consistent 
with `persist`/`cache`.
   
   I think it's better to take `RemoveCachedRemoteRelationCommand` as a 
separate command that is not dedicated to `checkpoint`. It can be used in other 
places if we happen to reuse `CachedRemoteRelationCommand` somewhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46090][SQL] Support plan fragment level SQL configs in AQE [spark]

2024-05-15 Thread via GitHub



ulysses-you commented on PR #44013:
URL: https://github.com/apache/spark/pull/44013#issuecomment-2113733794

   @cloud-fan any concern to land this at 4.0 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



mridulm commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602415826


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   We should add a test - this NPE would have surfaced there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



mridulm commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602414776


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   We should do this only after `_hadoopConfiguration` has been initialized.
   Move it to after line [535 
here](https://github.com/apache/spark/blob/f40ffe0f97a2e753878af8f8f47038272a1e1657/core/src/main/scala/org/apache/spark/SparkContext.scala#L535)
 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



mridulm commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602414644


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   We should do this only after `_hadoopConfiguration` has been initialized.
   Move it to after line [535 
here](https://github.com/apache/spark/blob/f40ffe0f97a2e753878af8f8f47038272a1e1657/core/src/main/scala/org/apache/spark/SparkContext.scala#L535)
 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



mridulm commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1602414644


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging {
   private[spark] def cleaner: Option[ContextCleaner] = _cleaner
 
   private[spark] var checkpointDir: Option[String] = None
+  config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir)

Review Comment:
   We should do this only after `_hadoopConfiguration` has been initialized.
   Move it to after line [535 
here](https://github.com/apache/spark/blob/f40ffe0f97a2e753878af8f8f47038272a1e1657/core/src/main/scala/org/apache/spark/SparkContext.scala#L535)
 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48264][BUILD] Upgrade `datasketches-java` to 6.0.0 [spark]

2024-05-15 Thread via GitHub



panbingkun commented on PR #46563:
URL: https://github.com/apache/spark/pull/46563#issuecomment-2113689996

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46956][SQL] Improve the error prompt when `SaveMode` is null in API `DataFrameWriter.mode` [spark]

2024-05-15 Thread via GitHub



github-actions[bot] closed pull request #44999: [SPARK-46956][SQL] Improve the 
error prompt when `SaveMode` is null in API `DataFrameWriter.mode`
URL: https://github.com/apache/spark/pull/44999


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][Spark 44646] Reduce usage of log4j core [spark]

2024-05-15 Thread via GitHub



github-actions[bot] commented on PR #45001:
URL: https://github.com/apache/spark/pull/45001#issuecomment-2113687236

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42789][SQL] Rewrite multiple GetJsonObject that consumes same JSON to single JsonTuple [spark]

2024-05-15 Thread via GitHub



github-actions[bot] commented on PR #45020:
URL: https://github.com/apache/spark/pull/45020#issuecomment-2113687220

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]

2024-05-15 Thread via GitHub



panbingkun commented on PR #46502:
URL: https://github.com/apache/spark/pull/46502#issuecomment-2113686449

   > @panbingkun let's hold on this until we have a conclusion in #46600
   
   @gengliangwang 
   After #46600, the pr has updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]

2024-05-15 Thread via GitHub



itholic commented on code in PR #46602:
URL: https://github.com/apache/spark/pull/46602#discussion_r1602388729


##
python/pyspark/pandas/frame.py:
##
@@ -491,7 +491,8 @@ class DataFrame(Frame, Generic[T]):
 
 >>> import pandas as pd
 >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
->>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))
+>>> with ps.option_context("compute.ops_on_diff_frames", False):
+... ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))

Review Comment:
   I think maybe now we can just remove the negative cases from the doctests 
and keep only UTs??



##
python/pyspark/pandas/frame.py:
##
@@ -509,7 +510,8 @@ class DataFrame(Frame, Generic[T]):
 
 >>> import pandas as pd
 >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
->>> ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2]))
+>>> with ps.option_context("compute.ops_on_diff_frames", False):
+... ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2]))

Review Comment:
   ditto and all belows?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]

2024-05-15 Thread via GitHub



zhengruifeng commented on PR #46602:
URL: https://github.com/apache/spark/pull/46602#issuecomment-2113677192

   will add it to release note soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]

2024-05-15 Thread via GitHub



zhengruifeng opened a new pull request, #46602:
URL: https://github.com/apache/spark/pull/46602

   ### What changes were proposed in this pull request?
   Turn on `compute.ops_on_diff_frames` by default
   
   
   ### Why are the changes needed?
   1, in most cases, this config need to be turned on to enable computation 
with different dataframes;
   2, enable `compute.ops_on_diff_frames` should not break any workloads, it 
should only enable more;
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, this config is turned on by default
   
   
   ### How was this patch tested?
   updated tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]

2024-05-15 Thread via GitHub



gengliangwang closed pull request #46600: [SPARK-48291][Core] Rename Java 
Logger as SparkLogger
URL: https://github.com/apache/spark/pull/46600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46600:
URL: https://github.com/apache/spark/pull/46600#issuecomment-2113658297

   @dongjoon-hyun @panbingkun Thanks for the review. Merging this one to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



HyukjinKwon commented on PR #46571:
URL: https://github.com/apache/spark/pull/46571#issuecomment-2113647596

   > Also, I'm wondering if we want to show a proper warning or even to raise 
exceptions because this is a critical mistake.
   
   This one, I think it's fine. We already have similar configurations such as 
`spark.log.level` vs `setLogLevel`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46600:
URL: https://github.com/apache/spark/pull/46600#issuecomment-2113647415

   > Of course, if we need to be very explicit about this difference
   
   Yes that's the purpose of this PR. I want the class names to be clear.
   
   > Maybe we need to add some instructions for developers in 
org/apache/spark/internal/README.md
   
   Good point. I will have a follow-up for both scala and java logging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]

2024-05-15 Thread via GitHub



panbingkun commented on PR #46600:
URL: https://github.com/apache/spark/pull/46600#issuecomment-2113630518

   The purpose of the name at that time was to make developers feel `less 
change`, more `natural` and `smooth`. Of course, if we need to be `very 
explicit` about this difference, I'm fine. Maybe we need to add some 
instructions for developers in `org/apache/spark/internal/README.md`?
   
https://github.com/apache/spark/blob/a2d93d104a6c3307f50dfd91f5cc51a97d4c5837/common/utils/src/main/scala/org/apache/spark/internal/README.md?plain=1#L13


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-15 Thread via GitHub



alippai commented on PR #46529:
URL: https://github.com/apache/spark/pull/46529#issuecomment-2113428373

   I wish we had ns support in spark 4.0 as Java, parquet, arrow etc uses it 
natively now, but that’s certainly a different discussion and MRs :)
   
   Thanks for the new API!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

2024-05-15 Thread via GitHub



ianmcook commented on PR #46529:
URL: https://github.com/apache/spark/pull/46529#issuecomment-2113363928

   > What will happen with the nanosecond timestamps? Truncated to milliseconds?
   
   Yes, truncated to milliseconds for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]

2024-05-15 Thread via GitHub



jiangzho commented on code in PR #10:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/10#discussion_r1602137497


##
gradle.properties:
##
@@ -18,17 +18,23 @@
 group=org.apache.spark.k8s.operator
 version=0.1.0
 
+# Caution: fabric8 version should be aligned with Spark dependency
 fabric8Version=6.12.1
 commonsLang3Version=3.14.0
 commonsIOVersion=2.16.1
 lombokVersion=1.18.32
 
+#Spark
+scalaVersion=2.13
+sparkVersion=4.0.0-preview1

Review Comment:
   @dongjoon-hyun - can we use rc1 for submission worker for now, in order to 
unblock the operator module ?
   
   I'll be more than happy to upgrade & fix possible incompatibility when RC2 
is ready.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48293][SS] Add test for when ForeachBatchUserFuncException wraps interrupted exception due to query stop [spark]

2024-05-15 Thread via GitHub



micheal-o opened a new pull request, #46601:
URL: https://github.com/apache/spark/pull/46601

   
   
   ### What changes were proposed in this pull request?
   Add test for when ForeachBatchUserFuncException wraps interrupted exception 
due to query stop
   
   
   ### Why are the changes needed?
   test
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   test added
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #46598:
URL: https://github.com/apache/spark/pull/46598#issuecomment-2113214338

   Feel free to merge, @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46502:
URL: https://github.com/apache/spark/pull/46502#issuecomment-2113168819

   @panbingkun let's hold on this until we have a conclusion in 
https://github.com/apache/spark/pull/46600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46600:
URL: https://github.com/apache/spark/pull/46600#issuecomment-2113167651

   cc @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]

2024-05-15 Thread via GitHub



gengliangwang opened a new pull request, #46600:
URL: https://github.com/apache/spark/pull/46600

   
   
   ### What changes were proposed in this pull request?
   
   Two new classes `org.apache.spark.internal.Logger` and 
`org.apache.spark.internal.LoggerFactory` were introduced from 
https://github.com/apache/spark/pull/46301.
   Given that Logger is a widely recognized **interface** in Log4j, it may lead 
to confusion to have a class with the same name. To avoid this and clarify its 
purpose within the Spark framework, I propose renaming 
`org.apache.spark.internal.Logger` to `org.apache.spark.internal.SparkLogger`. 
Similarly, to maintain consistency, `org.apache.spark.internal.LoggerFactory` 
should be renamed to `org.apache.spark.internal.SparkLoggerFactory`.
   
   
   ### Why are the changes needed?
   
   
   To avoid naming confusion and clarify the java Spark logger purpose within 
the logging framework
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   
   GA tests
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]

2024-05-15 Thread via GitHub



viirya commented on PR #36564:
URL: https://github.com/apache/spark/pull/36564#issuecomment-2113160108

   I created https://issues.apache.org/jira/browse/SPARK-48292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]

2024-05-15 Thread via GitHub



viirya commented on PR #36564:
URL: https://github.com/apache/spark/pull/36564#issuecomment-2113149367

   > To @AngersZh , @viirya , @cloud-fan . Since this PR is too old, shall 
we file a JIRA issue for that comment, [#36564 
(comment)](https://github.com/apache/spark/pull/36564#discussion_r1598660630) ?
   
   Yea, I think so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]

2024-05-15 Thread via GitHub



gengliangwang commented on PR #46562:
URL: https://github.com/apache/spark/pull/46562#issuecomment-2113122398

   @dongjoon-hyun From the recent tests of 
branch-3.5(https://github.com/apache/spark/commits/branch-3.5/), the 
ParquetIOSuite looks fine for now. I am ok with either backporting or not 
backporting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #36564:
URL: https://github.com/apache/spark/pull/36564#issuecomment-2113120262

   To @AngersZh , @viirya , @cloud-fan . 
   Since this PR is too old, shall we file a JIRA issue for that comment, 
https://github.com/apache/spark/pull/36564#discussion_r1598660630 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #46562:
URL: https://github.com/apache/spark/pull/46562#issuecomment-2113116084

   Hi, @gengliangwang and all.
   
   Do you think we can have this at release branches too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #46598:
URL: https://github.com/apache/spark/pull/46598#issuecomment-2113028851

   cc @yaooqinn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on code in PR #46571:
URL: https://github.com/apache/spark/pull/46571#discussion_r1601964897


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -1317,6 +1317,14 @@ package object config {
   s" be less than or equal to 
${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
   .createWithDefault(64 * 1024 * 1024)
 
+  private[spark] val CHECKPOINT_DIR =
+ConfigBuilder("spark.checkpoint.dir")
+  .doc("Equivalent with SparkContext.setCheckpointDir. If set, the path 
becomes " +
+"the directory for checkpointing.")

Review Comment:
   Could you add this to `docs/configuration.md`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48256][BUILD] Add a rule to check file headers for the java side, and fix inconsistent files [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun closed pull request #46557: [SPARK-48256][BUILD] Add a rule to 
check file headers for the java side, and fix inconsistent files
URL: https://github.com/apache/spark/pull/46557


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48288] Add source data type for connector cast expression [spark]

2024-05-15 Thread via GitHub



urosstan-db commented on code in PR #46596:
URL: https://github.com/apache/spark/pull/46596#discussion_r1601958458


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Cast.java:
##
@@ -29,14 +29,31 @@
 @Evolving
 public class Cast extends ExpressionWithToString {
   private Expression expression;
+
+  /**
+   * Original data type of given expression
+   */
+  private DataType expressionDataType;
+
+  /**
+   * Target data type, i.e. data type in which expression will be cast
+   */
   private DataType dataType;
 
+  @Deprecated

Review Comment:
   I agree with you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48288] Add source data type for connector cast expression [spark]

2024-05-15 Thread via GitHub



urosstan-db commented on code in PR #46596:
URL: https://github.com/apache/spark/pull/46596#discussion_r1601958035


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java:
##
@@ -231,10 +231,15 @@ protected String visitBinaryArithmetic(String name, 
String l, String r) {
 return l + " " + name + " " + r;
   }
 
+  @Deprecated
   protected String visitCast(String l, DataType dataType) {

Review Comment:
   Thanks, will be done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #46552:
URL: https://github.com/apache/spark/pull/46552#issuecomment-2113011265

   Hi, @cloud-fan .
   
   It seems that @viirya 's comment is closed without addressing it. Could you 
double-check it?
   - https://github.com/apache/spark/pull/46552#discussion_r1598626620


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #46506:
URL: https://github.com/apache/spark/pull/46506#issuecomment-2112993768

   Merged to master for Apache Spark 4.0.0-preview2-rc2.
   
   Thank you, @cxzl25 , @mridulm , @yaooqinn .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun closed pull request #46506: [SPARK-48218][CORE] 
TransportClientFactory.createClient may NPE cause FetchFailedException
URL: https://github.com/apache/spark/pull/46506


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48049][BUILD] Upgrade Scala to 2.13.14 [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun closed pull request #46288: [SPARK-48049][BUILD] Upgrade Scala to 
2.13.14
URL: https://github.com/apache/spark/pull/46288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun commented on PR #46592:
URL: https://github.com/apache/spark/pull/46592#issuecomment-2112983873

   Merged to master for Apache Spark 4.0.0-preview1-rc2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration [spark]

2024-05-15 Thread via GitHub



dongjoon-hyun closed pull request #46592: [SPARK-48285][SQL][DOCS] Update docs 
for size function and sizeOfNull configuration
URL: https://github.com/apache/spark/pull/46592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47429][SQL] Rename error class to condition [spark]

2024-05-15 Thread via GitHub



nchammas commented on PR #46543:
URL: https://github.com/apache/spark/pull/46543#issuecomment-2112900349

   @MaxGekk @HyukjinKwon @cloud-fan - Before I finish up this PR, I'd like to 
check in with you briefly to make sure you like where this is going.
   
   The main idea in this PR is to deprecate public classes and parameters that 
use the old "error class" terminology and instead direct users to use "error 
condition". I'm trying wherever possible to do this in a backwards-compatible 
way.
   
   [Here's an example][ex] of the approach. In addition to being a bit gentler 
on users migrating to Spark 4.0, this will also help us avoid having to rename 
every instance of the `errorClass` or `error_class` parameters all across the 
codebase in one shot.
   
   Does this look good to you so far?
   
   In a similar vein, shall I keep incorrectly-named classes around but turn 
them into deprecated sub-classes of their correctly-named replacements? For 
example, `ErrorClassesJsonReader` would become a sub-class of 
`ErrorConditionsJsonReader`.
   
   [ex]: 
https://github.com/apache/spark/pull/46543/files#diff-d316cd1a4b3d10d5c8136adc63d0d56c74005a600bb966a1fd485347be1cd7e1R38-R49


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   Each column has separate metadata so map and string can never clash -> here 
is how it would look like in json:
   
   ```json
   {
 "type": "struct",
 "fields": [
   {
 "name": "c1.key",
 "type": "string",
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.key": "ICU.UNICODE_CI"
   }
 }
   },
   {
 "name": "c1",
 "type": {
   "type": "map",
   "keyType": "string",
   "valueType": "string",
   "valueContainsNull": true
 },
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.key": "ICU.UNICODE"
   }
 }
   }
 ]
   }
   ```
   
   "c1.key" will mean different things in each column - for the first it's just 
the entire column and for the second it is the key of the map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695


##
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##
@@ -63,7 +66,60 @@ case class StructField(
 ("name" -> name) ~
   ("type" -> dataType.jsonValue) ~
   ("nullable" -> nullable) ~
-  ("metadata" -> metadata.jsonValue)
+  ("metadata" -> metadataJson)
+  }
+
+  private def metadataJson: JValue = {
+val metadataJsonValue = metadata.jsonValue
+metadataJsonValue match {
+  case JObject(fields) if collationMetadata.nonEmpty =>
+val collationFields = collationMetadata.map(kv => kv._1 -> 
JString(kv._2)).toList
+JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> 
JObject(collationFields)))
+
+  case _ => metadataJsonValue
+}
+  }
+
+  /** Map of field path to collation name. */
+  private lazy val collationMetadata: mutable.Map[String, String] = {
+val fieldToCollationMap = mutable.Map[String, String]()
+
+def visitRecursively(dt: DataType, path: String): Unit = dt match {
+  case at: ArrayType =>
+processDataType(at.elementType, path + ".element")
+
+  case mt: MapType =>
+processDataType(mt.keyType, path + ".key")
+processDataType(mt.valueType, path + ".value")
+
+  case st: StringType if isCollatedString(st) =>
+fieldToCollationMap(path) = collationName(st)
+
+  case _ =>
+}
+
+def processDataType(dt: DataType, path: String): Unit = {
+  if (isCollatedString(dt)) {
+fieldToCollationMap(path) = collationName(dt)
+  } else {
+visitRecursively(dt, path)
+  }
+}
+
+visitRecursively(dataType, name)

Review Comment:
   Each column has separate metadata so map and string can never clash -> here 
is how it would look like in json:
   
   ```json
   {
 "type": "struct",
 "fields": [
   {
 "name": "c1.key",
 "type": "string",
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.key": "ICU.UNICODE_CI"
   }
 }
   },
   {
 "name": "c1",
 "type": {
   "type": "map",
   "keyType": "string",
   "valueType": "string",
   "valueContainsNull": true
 },
 "nullable": true,
 "metadata": {
   "__COLLATIONS": {
 "c1.value": "ICU.UNICODE",
 "c1.key": "ICU.UNICODE"
   }
 }
   }
 ]
   }
   ```
   
   "c1.key" will mean different things in each column - for the first it's just 
the entire column and for the second it is the key of the map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601878212


##
python/pyspark/sql/types.py:
##
@@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> 
DataType:
 return _parse_datatype_json_value(json.loads(json_string))
 
 
-def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType:
+def _parse_datatype_json_value(
+json_value: Union[dict, str],
+fieldPath: str = "",
+collationsMap: Optional[Dict[str, str]] = None,
+) -> DataType:
 if not isinstance(json_value, dict):
 if json_value in _all_atomic_types.keys():
+if collationsMap is not None and fieldPath in collationsMap:
+collationName = collationsMap[fieldPath].split(".")[1]

Review Comment:
   sure!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601839139


##
python/pyspark/sql/types.py:
##
@@ -876,30 +894,86 @@ def __init__(
 self.dataType = dataType
 self.nullable = nullable
 self.metadata = metadata or {}
+self._collationMetadata: Optional[Dict[str, str]] = None
 
 def simpleString(self) -> str:
 return "%s:%s" % (self.name, self.dataType.simpleString())
 
 def __repr__(self) -> str:
 return "StructField('%s', %s, %s)" % (self.name, self.dataType, 
str(self.nullable))
 
+def __eq__(self, other: Any) -> bool:
+# since collationMetadata is lazy evaluated we should not use it in 
equality check

Review Comment:
   I agree, doing this for every json call is annoying but since we don't have 
the security of immutable data like in scala perhaps we should avoid lazy eval



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
+python_datatype = 
_parse_datatype_json_string(scala_datatype.json())
+assert schema == python_datatype
+assert schema == _parse_datatype_json_string(schema.json())
+
+def test_schema_with_collations_on_non_string_types(self):
+from pyspark.sql.types import _parse_datatype_json_string, 
_COLLATIONS_METADATA_KEY
+
+collations_on_int_col_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "c1",
+  "type": "integer",
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "c1": "icu.UNICODE"
+}}
+  }}
+}}
+  ]
+}}
+"""
+
+collations_in_map_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "mapField",
+  "type": {{
+"type": "map",
+"keyType": "string",
+"valueType": "integer",
+"valueContainsNull": true
+  }},
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "mapField.value": "icu.UNICODE"

Review Comment:
   We talked about this one a bit offline, but I would rather tackle this as a 
separate issue than just a collation protocol error. Currently, both python and 
scala code will not fail when encountering duplicate keys; python will just 
pick one to put in the dictionary and scala will have both in the `JObject`. 
What do you think @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860


##
python/pyspark/sql/tests/test_types.py:
##
@@ -549,6 +549,129 @@ def test_convert_list_to_str(self):
 self.assertEqual(df.count(), 1)
 self.assertEqual(df.head(), Row(name="[123]", income=120))
 
+def test_schema_with_collations_json_ser_de(self):
+from pyspark.sql.types import _parse_datatype_json_string
+
+unicode_collation = "UNICODE"
+
+simple_struct = StructType([StructField("c1", 
StringType(unicode_collation))])
+
+nested_struct = StructType([StructField("nested", simple_struct)])
+
+array_in_schema = StructType(
+[StructField("array", ArrayType(StringType(unicode_collation)))]
+)
+
+map_in_schema = StructType(
+[
+StructField(
+"map", MapType(StringType(unicode_collation), 
StringType(unicode_collation))
+)
+]
+)
+
+array_in_map = StructType(
+[
+StructField(
+"arrInMap",
+MapType(
+StringType(unicode_collation), 
ArrayType(StringType(unicode_collation))
+),
+)
+]
+)
+
+nested_array_in_map = StructType(
+[
+StructField(
+"nestedArrayInMap",
+ArrayType(
+MapType(
+StringType(unicode_collation),
+
ArrayType(ArrayType(StringType(unicode_collation))),
+)
+),
+)
+]
+)
+
+schema_with_multiple_fields = StructType(
+simple_struct.fields
++ nested_struct.fields
++ array_in_schema.fields
++ map_in_schema.fields
++ array_in_map.fields
++ nested_array_in_map.fields
+)
+
+schemas = [
+simple_struct,
+nested_struct,
+array_in_schema,
+map_in_schema,
+nested_array_in_map,
+array_in_map,
+schema_with_multiple_fields,
+]
+
+for schema in schemas:
+scala_datatype = 
self.spark._jsparkSession.parseDataType(schema.json())
+python_datatype = 
_parse_datatype_json_string(scala_datatype.json())
+assert schema == python_datatype
+assert schema == _parse_datatype_json_string(schema.json())
+
+def test_schema_with_collations_on_non_string_types(self):
+from pyspark.sql.types import _parse_datatype_json_string, 
_COLLATIONS_METADATA_KEY
+
+collations_on_int_col_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "c1",
+  "type": "integer",
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "c1": "icu.UNICODE"
+}}
+  }}
+}}
+  ]
+}}
+"""
+
+collations_in_map_json = f"""
+{{
+  "type": "struct",
+  "fields": [
+{{
+  "name": "mapField",
+  "type": {{
+"type": "map",
+"keyType": "string",
+"valueType": "integer",
+"valueContainsNull": true
+  }},
+  "nullable": true,
+  "metadata": {{
+"{_COLLATIONS_METADATA_KEY}": {{
+  "mapField.value": "icu.UNICODE"

Review Comment:
   We talked about this one a bit offline, but I would rather tackle this as a 
separate issue than just a collation protocol error. Currently, both python and 
scala code will not fail when encountering duplicate keys; python will just 
pick one to put in the dictionary and scala will have both in the `JObject`. 
What do you think @cloud-fan ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]

2024-05-15 Thread via GitHub



uros-db opened a new pull request, #46599:
URL: https://github.com/apache/spark/pull/46599

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-15 Thread via GitHub



stefankandic commented on code in PR #46280:
URL: https://github.com/apache/spark/pull/46280#discussion_r1601784552


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -36,11 +36,45 @@
  * Provides functionality to the UTF8String object which respects defined 
collation settings.
  */
 public final class CollationFactory {
+
+  /**
+   * Identifier for single a collation.
+   */
+  public static class CollationIdentifier {
+public final String provider;
+public final String name;
+public final String version;

Review Comment:
   I thought that optional is used in java for method return types, and not for 
attributes since they are "optional" anyways?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 180 matches

Mail list logo