[PR] [Only Test] Test without local maven cache [spark]
panbingkun opened a new pull request, #46606: URL: https://github.com/apache/spark/pull/46606 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]
yaooqinn commented on code in PR #46591: URL: https://github.com/apache/spark/pull/46591#discussion_r1602657406 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala: ## @@ -298,40 +294,40 @@ case class StructsToXml( } @transient - lazy val writer = new CharArrayWriter() - - @transient - lazy val inputSchema: StructType = child.dataType.asInstanceOf[StructType] - - @transient - lazy val gen = new StaxXmlGenerator( -inputSchema, writer, new XmlOptions(options, timeZoneId.get), false) - - // This converts rows to the XML output according to the given schema. - @transient - lazy val converter: Any => UTF8String = { -def getAndReset(): UTF8String = { - gen.flush() - val xmlString = writer.toString - writer.reset() - UTF8String.fromString(xmlString) -} -(row: Any) => - gen.write(row.asInstanceOf[InternalRow]) - getAndReset() - } + private lazy val evaluator = new StructsToXmlEvaluator( +child.dataType.asInstanceOf[StructType], options, timeZoneId) override def dataType: DataType = SQLConf.get.defaultStringType override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = copy(timeZoneId = Option(timeZoneId)) - override def nullSafeEval(value: Any): Any = converter(value) + override def nullSafeEval(value: Any): Any = { +UTF8String.fromString(evaluator.evaluate(value)) + } override def inputTypes: Seq[AbstractDataType] = StructType :: Nil override def prettyName: String = "to_xml" override protected def withNewChildInternal(newChild: Expression): StructsToXml = copy(child = newChild) + + override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { Review Comment: We can also revert other refactoring? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]
panbingkun commented on code in PR #46591: URL: https://github.com/apache/spark/pull/46591#discussion_r1602655851 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala: ## @@ -298,40 +294,40 @@ case class StructsToXml( } @transient - lazy val writer = new CharArrayWriter() - - @transient - lazy val inputSchema: StructType = child.dataType.asInstanceOf[StructType] - - @transient - lazy val gen = new StaxXmlGenerator( -inputSchema, writer, new XmlOptions(options, timeZoneId.get), false) - - // This converts rows to the XML output according to the given schema. - @transient - lazy val converter: Any => UTF8String = { -def getAndReset(): UTF8String = { - gen.flush() - val xmlString = writer.toString - writer.reset() - UTF8String.fromString(xmlString) -} -(row: Any) => - gen.write(row.asInstanceOf[InternalRow]) - getAndReset() - } + private lazy val evaluator = new StructsToXmlEvaluator( +child.dataType.asInstanceOf[StructType], options, timeZoneId) override def dataType: DataType = SQLConf.get.defaultStringType override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = copy(timeZoneId = Option(timeZoneId)) - override def nullSafeEval(value: Any): Any = converter(value) + override def nullSafeEval(value: Any): Any = { +UTF8String.fromString(evaluator.evaluate(value)) + } override def inputTypes: Seq[AbstractDataType] = StructType :: Nil override def prettyName: String = "to_xml" override protected def withNewChildInternal(newChild: Expression): StructsToXml = copy(child = newChild) + + override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { Review Comment: yeah, updated. thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 [spark]
yaooqinn commented on PR #46593: URL: https://github.com/apache/spark/pull/46593#issuecomment-2114079862 Merged to master. Thank you @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 [spark]
LuciferYang commented on PR #46593: URL: https://github.com/apache/spark/pull/46593#issuecomment-2114077454 Thanks @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 [spark]
yaooqinn closed pull request #46593: [SPARK-48299][BUILD] Upgrade `scala-maven-plugin` to 4.9.1 URL: https://github.com/apache/spark/pull/46593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]
panbingkun commented on PR #46551: URL: https://github.com/apache/spark/pull/46551#issuecomment-2114068973 > How much performance gap will there be if we completely abandon using `Resolver.mavenLocal` and the current one? Let me test it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47607] Add documentation for Structured logging framework [spark]
gengliangwang commented on PR #46605: URL: https://github.com/apache/spark/pull/46605#issuecomment-2114056759 cc @panbingkun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47607] Add documentation for Structured logging framework [spark]
gengliangwang opened a new pull request, #46605: URL: https://github.com/apache/spark/pull/46605 ### What changes were proposed in this pull request? Add documentation for Structured logging framework ### Why are the changes needed? Provide document for Spark developers ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc preview: https://github.com/apache/spark/assets/1097932/d3c4fcdc-57e4-4af2-8b05-6b4f6731a8c0;> ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]
LuciferYang commented on PR #46551: URL: https://github.com/apache/spark/pull/46551#issuecomment-2114044184 @panbingkun How much performance gap will there be if we completely abandon using `Resolver.mavenLocal` and the current one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project [spark]
wForget commented on PR #34023: URL: https://github.com/apache/spark/pull/34023#issuecomment-2114043502 > Yes, because after pruning, less data are scanned and the non-deterministic function in the SELECT list may return different results if we don't prune ahead. Makes sense, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project [spark]
cloud-fan commented on PR #34023: URL: https://github.com/apache/spark/pull/34023#issuecomment-2114040674 Yes, because after pruning, less data are scanned and the non-deterministic function in the SELECT list may return different results if we don't prune ahead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]
gengliangwang closed pull request #46502: [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` URL: https://github.com/apache/spark/pull/46502 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]
gengliangwang commented on PR #46502: URL: https://github.com/apache/spark/pull/46502#issuecomment-2114022656 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]
gengliangwang commented on code in PR #46502: URL: https://github.com/apache/spark/pull/46502#discussion_r1602614245 ## common/utils/src/main/java/org/apache/spark/internal/SparkLoggerFactory.java: ## @@ -17,15 +17,17 @@ package org.apache.spark.internal; +// checkstyle.off: RegexpSinglelineJava +import org.slf4j.LoggerFactory; +// checkstyle.on: RegexpSinglelineJava + public class SparkLoggerFactory { public static SparkLogger getLogger(String name) { -org.slf4j.Logger slf4jLogger = org.slf4j.LoggerFactory.getLogger(name); -return new SparkLogger(slf4jLogger); +return new SparkLogger(LoggerFactory.getLogger(name)); Review Comment: yeah this looks better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48298][Core] Add TCP mode to StatsD sink [spark]
jiwen624 commented on PR #46604: URL: https://github.com/apache/spark/pull/46604#issuecomment-2114019176 Hi @cloud-fan @dongjoon-hyun @yaooqinn Could you take a look when you get a chance or perhaps pull other people in for review? Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]
yaooqinn commented on code in PR #46591: URL: https://github.com/apache/spark/pull/46591#discussion_r1602584809 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala: ## @@ -298,40 +294,40 @@ case class StructsToXml( } @transient - lazy val writer = new CharArrayWriter() - - @transient - lazy val inputSchema: StructType = child.dataType.asInstanceOf[StructType] - - @transient - lazy val gen = new StaxXmlGenerator( -inputSchema, writer, new XmlOptions(options, timeZoneId.get), false) - - // This converts rows to the XML output according to the given schema. - @transient - lazy val converter: Any => UTF8String = { -def getAndReset(): UTF8String = { - gen.flush() - val xmlString = writer.toString - writer.reset() - UTF8String.fromString(xmlString) -} -(row: Any) => - gen.write(row.asInstanceOf[InternalRow]) - getAndReset() - } + private lazy val evaluator = new StructsToXmlEvaluator( +child.dataType.asInstanceOf[StructType], options, timeZoneId) override def dataType: DataType = SQLConf.get.defaultStringType override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = copy(timeZoneId = Option(timeZoneId)) - override def nullSafeEval(value: Any): Any = converter(value) + override def nullSafeEval(value: Any): Any = { +UTF8String.fromString(evaluator.evaluate(value)) + } override def inputTypes: Seq[AbstractDataType] = StructType :: Nil override def prettyName: String = "to_xml" override protected def withNewChildInternal(newChild: Expression): StructsToXml = copy(child = newChild) + + override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { Review Comment: Simply ```scala override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val expr = ctx.addReferenceObj("this", this) defineCodeGen(ctx, ev, input => s"(UTF8String) $expr.nullSafeEval($input)") } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project [spark]
wForget commented on PR #34023: URL: https://github.com/apache/spark/pull/34023#issuecomment-2113995703 @cloud-fan This change also affects `PhysicalOperation`, causing hive partition pruning to not take effect when there is a non-deterministic project. Is this expected behavior? like: ``` select * from ( select c1, reflect('java.net.URLDecoder', 'decode', c2, 'UTF-8') as c2, dt from test) t1 where dt='2024-05-15' limit 10; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method [spark]
zhengruifeng commented on PR #46595: URL: https://github.com/apache/spark/pull/46595#issuecomment-2113992843 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method [spark]
zhengruifeng closed pull request #46595: [SPARK-48287][PS][CONNECT] Apply the builtin `timestamp_diff` method URL: https://github.com/apache/spark/pull/46595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
yaooqinn commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2113991332 Thank you @xuzifu666. Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
yaooqinn closed pull request #46509: [SPARK-48219][CORE] StreamReader Charset fix with UTF8 URL: https://github.com/apache/spark/pull/46509 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2113986338 XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: https://github.com/apache/hive/pull/5243,so I Think it is nesscery to change it? @yaooqinn @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: https://github.com/apache/hive/pull/5243 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: https://github.com/apache/hive/pull/5243 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48288] Add source data type for connector cast expression [spark]
cloud-fan commented on PR #46596: URL: https://github.com/apache/spark/pull/46596#issuecomment-2113965545 the change LGTM. Please mention that users who override `V2ExpressionSQLBuilder.visitCast` should update their code, in `Does this PR introduce any user-facing change?` section. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]
panbingkun commented on PR #46591: URL: https://github.com/apache/spark/pull/46591#issuecomment-2113962446 cc @yaooqinn @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-48298][Core] Add TCP mode to StatsD sink [spark]
jiwen624 opened a new pull request, #46604: URL: https://github.com/apache/spark/pull/46604 ### What changes were proposed in this pull request? Working on it... ### Why are the changes needed? As mentioned in the Jira ticket: https://issues.apache.org/jira/browse/SPARK-48298 Currently, the StatsdSink in Spark supports UDP mode only, which is the default mode of StatsD. However, in real production environments, we often find that a more reliable transmission of metrics is needed to avoid metrics lose in high-traffic systems. TCP mode is already supported by Statsd: https://github.com/statsd/statsd/blob/master/docs/server.md Prometheus' statsd_exporter: https://github.com/prometheus/statsd_exporter and also many other Statsd-based metrics proxies/receivers. ### Does this PR introduce _any_ user-facing change? Yes. The following new config options are added to `conf/metrics.properties.template`: `*.sink.statsd.protocol` `*.sink.statsd.connTimeoutMs` A new error condition is defined in error-conditions.json for protocol configuration error. ### How was this patch tested? Unit tests. Manually tests with metric configurations sending metrics to a Netcat TCP/UDP server ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44811][BUILD] Upgrade Guava to 33.1.0-jre [spark]
LuciferYang commented on PR #42493: URL: https://github.com/apache/spark/pull/42493#issuecomment-2113953900 > > should we test 33.2.0? > > `33.1.0-jre` is chosen because Spark uses `33.1.0-jre` in other places, and after this change, we can unify all Guava versions. OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48251][BUILD] Disable `maven local cache` on GA's step `MIMA test` [spark]
yaooqinn commented on PR #46551: URL: https://github.com/apache/spark/pull/46551#issuecomment-2113936531 If this might increase the $ cost of ASF INFRA, let's put it on hold until the upstream bug is fixed. I second @dongjoon-hyun's concern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]
HyukjinKwon commented on PR #46602: URL: https://github.com/apache/spark/pull/46602#issuecomment-2113931789 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]
HyukjinKwon closed pull request #46602: [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default URL: https://github.com/apache/spark/pull/46602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48297][SQL] Fix a regression TRANSFORM clause with char/varchar [spark]
yaooqinn opened a new pull request, #46603: URL: https://github.com/apache/spark/pull/46603 ### What changes were proposed in this pull request? TRANSFORM with char/varchar has been accidentally invalidated since 3.1 with a scala.MatchError, this PR fixes it ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48296][SQL] Codegen Support for `to_xml` [spark]
panbingkun commented on code in PR #46591: URL: https://github.com/apache/spark/pull/46591#discussion_r1602509705 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala: ## @@ -186,9 +186,6 @@ case class SchemaOfXml( @transient private lazy val xmlOptions = new XmlOptions(options, "UTC") - @transient - private lazy val xmlFactory = xmlOptions.buildXmlFactory() Review Comment: This variable `xmlFactory` is not used in the expression `SchemaOfXml`, so it is deleted here. It has no strong relationship with this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
yaooqinn commented on PR #46598: URL: https://github.com/apache/spark/pull/46598#issuecomment-2113835891 Thank @LucaCanali @dongjoon-hyun Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
yaooqinn closed pull request #46598: [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset URL: https://github.com/apache/spark/pull/46598 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
yaooqinn commented on code in PR #46598: URL: https://github.com/apache/spark/pull/46598#discussion_r1602483027 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala: ## @@ -54,10 +54,10 @@ import org.apache.spark.tags.DockerTest * A sequence of commands to build the Oracle Database Free container image: * $ git clone https://github.com/oracle/docker-images.git * $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles - * $ ./buildContainerImage.sh -v 23.2.0 -f - * $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.2.0-free + * $ ./buildContainerImage.sh -v 23.4.0 -f + * $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.4.0-free * - * This procedure has been validated with Oracle Database Free version 23.2.0, + * This procedure has been validated with Oracle Database Free version 23.4.0, * and with Oracle Express Edition versions 18.4.0 and 21.3.0 Review Comment: ```suggestion * and with Oracle Express Edition versions 18.4.0 and 21.4.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
yaooqinn commented on code in PR #46598: URL: https://github.com/apache/spark/pull/46598#discussion_r1602482135 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala: ## @@ -48,11 +48,11 @@ import org.apache.spark.tags.DockerTest * * A sequence of commands to build the Oracle Database Free container image: * $ git clone https://github.com/oracle/docker-images.git - * $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles - * $ ./buildContainerImage.sh -v 23.2.0 -f - * $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.2.0-free + * $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles0 + * $ ./buildContainerImage.sh -v 23.4.0 -f + * $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.4.0-free * - * This procedure has been validated with Oracle Database Free version 23.2.0, + * This procedure has been validated with Oracle Database Free version 23.4.0, * and with Oracle Express Edition versions 18.4.0 and 21.3.0 Review Comment: ```suggestion * and with Oracle Express Edition versions 18.4.0 and 21.4.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
yaooqinn commented on code in PR #46598: URL: https://github.com/apache/spark/pull/46598#discussion_r1602481532 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleNamespaceSuite.scala: ## @@ -46,10 +46,10 @@ import org.apache.spark.tags.DockerTest * A sequence of commands to build the Oracle Database Free container image: * $ git clone https://github.com/oracle/docker-images.git * $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles - * $ ./buildContainerImage.sh -v 23.2.0 -f - * $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.2.0-free + * $ ./buildContainerImage.sh -v 23.4.0 -f + * $ export ORACLE_DOCKER_IMAGE_NAME=oracle/database:23.4.0-free * - * This procedure has been validated with Oracle Database Free version 23.2.0, + * This procedure has been validated with Oracle Database Free version 23.4.0, * and with Oracle Express Edition versions 18.4.0 and 21.3.0 Review Comment: ```suggestion * and with Oracle Express Edition versions 18.4.0 and 21.4.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
HyukjinKwon commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602450604 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: and adding a test too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
HyukjinKwon commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602449365 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: Let me just put it there togehter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]
cloud-fan closed pull request #46552: [SPARK-48252][SQL] Update CommonExpressionRef when necessary URL: https://github.com/apache/spark/pull/46552 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]
cloud-fan commented on PR #46552: URL: https://github.com/apache/spark/pull/46552#issuecomment-2113761762 The last commit just updates the test name, we don't need to re-test. I'm merging it to master, thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]
cloud-fan commented on PR #46552: URL: https://github.com/apache/spark/pull/46552#issuecomment-2113758865 oh, I accepted this code-change comment on Github but later on I pushed a new commit from local which discarded this change... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
zhengruifeng commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602443672 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: or probably make `checkpointDir` lazy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DO-NOT-MERGE][SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]
HyukjinKwon commented on code in PR #46570: URL: https://github.com/apache/spark/pull/46570#discussion_r1602437890 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -844,6 +844,11 @@ def test_union_classmethod_usage(self): def test_isinstance_dataframe(self): self.assertIsInstance(self.spark.range(1), DataFrame) +def test_checkpoint_dataframe(self): Review Comment: I enabled some more tests that were skipped -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]
zhengruifeng commented on code in PR #46602: URL: https://github.com/apache/spark/pull/46602#discussion_r1602430705 ## python/pyspark/pandas/frame.py: ## @@ -491,7 +491,8 @@ class DataFrame(Frame, Generic[T]): >>> import pandas as pd >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"]) ->>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2])) +>>> with ps.option_context("compute.ops_on_diff_frames", False): +... ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2])) Review Comment: I am not sure, @HyukjinKwon wdyt? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DO-NOT-MERGE][SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]
HyukjinKwon commented on code in PR #46570: URL: https://github.com/apache/spark/pull/46570#discussion_r1602427305 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -199,6 +200,17 @@ message AnalyzePlanRequest { // (Required) The logical plan to get the storage level. Relation relation = 1; } + + message Checkpoint { Review Comment: Actually taking a look at `ExecutePlanRequest`/`ExecutePlanResponse`, I think it's better to keep it in `AnalyzePlanRequest`. `ExecutePlanResponse` more assumes that the output is actual output values such as Arrow batch, map, etc. after the actual execution. I think it's better to take `RemoveCachedRemoteRelationCommand` as a separate command that is not dedicated to `checkpoint`. It can be used in other places if we happen to reuse `CachedRemoteRelationCommand` somewhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [DO-NOT-MERGE][SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect [spark]
HyukjinKwon commented on code in PR #46570: URL: https://github.com/apache/spark/pull/46570#discussion_r1602427305 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -199,6 +200,17 @@ message AnalyzePlanRequest { // (Required) The logical plan to get the storage level. Relation relation = 1; } + + message Checkpoint { Review Comment: Actually taking a look at `ExecutePlanRequest`/`ExecutePlanResponse`, I think it's better to keep it in `AnalyzePlanRequest`. `ExecutePlanResponse` assumes that the output is actual output values such as Arrow batch, map, etc. vs `AnalyzePlanResponse` can assume it's a DataFrame. This is also consistent with `persist`/`cache`. I think it's better to take `RemoveCachedRemoteRelationCommand` as a separate command that is not dedicated to `checkpoint`. It can be used in other places if we happen to reuse `CachedRemoteRelationCommand` somewhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46090][SQL] Support plan fragment level SQL configs in AQE [spark]
ulysses-you commented on PR #44013: URL: https://github.com/apache/spark/pull/44013#issuecomment-2113733794 @cloud-fan any concern to land this at 4.0 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
mridulm commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602415826 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: We should add a test - this NPE would have surfaced there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
mridulm commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602414776 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: We should do this only after `_hadoopConfiguration` has been initialized. Move it to after line [535 here](https://github.com/apache/spark/blob/f40ffe0f97a2e753878af8f8f47038272a1e1657/core/src/main/scala/org/apache/spark/SparkContext.scala#L535) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
mridulm commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602414644 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: We should do this only after `_hadoopConfiguration` has been initialized. Move it to after line [535 here](https://github.com/apache/spark/blob/f40ffe0f97a2e753878af8f8f47038272a1e1657/core/src/main/scala/org/apache/spark/SparkContext.scala#L535) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
mridulm commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1602414644 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -374,6 +374,7 @@ class SparkContext(config: SparkConf) extends Logging { private[spark] def cleaner: Option[ContextCleaner] = _cleaner private[spark] var checkpointDir: Option[String] = None + config.getOption(CHECKPOINT_DIR.key).foreach(setCheckpointDir) Review Comment: We should do this only after `_hadoopConfiguration` has been initialized. Move it to after line [535 here](https://github.com/apache/spark/blob/f40ffe0f97a2e753878af8f8f47038272a1e1657/core/src/main/scala/org/apache/spark/SparkContext.scala#L535) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48264][BUILD] Upgrade `datasketches-java` to 6.0.0 [spark]
panbingkun commented on PR #46563: URL: https://github.com/apache/spark/pull/46563#issuecomment-2113689996 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46956][SQL] Improve the error prompt when `SaveMode` is null in API `DataFrameWriter.mode` [spark]
github-actions[bot] closed pull request #44999: [SPARK-46956][SQL] Improve the error prompt when `SaveMode` is null in API `DataFrameWriter.mode` URL: https://github.com/apache/spark/pull/44999 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][Spark 44646] Reduce usage of log4j core [spark]
github-actions[bot] commented on PR #45001: URL: https://github.com/apache/spark/pull/45001#issuecomment-2113687236 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42789][SQL] Rewrite multiple GetJsonObject that consumes same JSON to single JsonTuple [spark]
github-actions[bot] commented on PR #45020: URL: https://github.com/apache/spark/pull/45020#issuecomment-2113687220 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]
panbingkun commented on PR #46502: URL: https://github.com/apache/spark/pull/46502#issuecomment-2113686449 > @panbingkun let's hold on this until we have a conclusion in #46600 @gengliangwang After #46600, the pr has updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]
itholic commented on code in PR #46602: URL: https://github.com/apache/spark/pull/46602#discussion_r1602388729 ## python/pyspark/pandas/frame.py: ## @@ -491,7 +491,8 @@ class DataFrame(Frame, Generic[T]): >>> import pandas as pd >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"]) ->>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2])) +>>> with ps.option_context("compute.ops_on_diff_frames", False): +... ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2])) Review Comment: I think maybe now we can just remove the negative cases from the doctests and keep only UTs?? ## python/pyspark/pandas/frame.py: ## @@ -509,7 +510,8 @@ class DataFrame(Frame, Generic[T]): >>> import pandas as pd >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"]) ->>> ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2])) +>>> with ps.option_context("compute.ops_on_diff_frames", False): +... ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2])) Review Comment: ditto and all belows? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]
zhengruifeng commented on PR #46602: URL: https://github.com/apache/spark/pull/46602#issuecomment-2113677192 will add it to release note soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48295][PS] Turn on `compute.ops_on_diff_frames` by default [spark]
zhengruifeng opened a new pull request, #46602: URL: https://github.com/apache/spark/pull/46602 ### What changes were proposed in this pull request? Turn on `compute.ops_on_diff_frames` by default ### Why are the changes needed? 1, in most cases, this config need to be turned on to enable computation with different dataframes; 2, enable `compute.ops_on_diff_frames` should not break any workloads, it should only enable more; ### Does this PR introduce _any_ user-facing change? yes, this config is turned on by default ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]
gengliangwang closed pull request #46600: [SPARK-48291][Core] Rename Java Logger as SparkLogger URL: https://github.com/apache/spark/pull/46600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]
gengliangwang commented on PR #46600: URL: https://github.com/apache/spark/pull/46600#issuecomment-2113658297 @dongjoon-hyun @panbingkun Thanks for the review. Merging this one to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
HyukjinKwon commented on PR #46571: URL: https://github.com/apache/spark/pull/46571#issuecomment-2113647596 > Also, I'm wondering if we want to show a proper warning or even to raise exceptions because this is a critical mistake. This one, I think it's fine. We already have similar configurations such as `spark.log.level` vs `setLogLevel`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]
gengliangwang commented on PR #46600: URL: https://github.com/apache/spark/pull/46600#issuecomment-2113647415 > Of course, if we need to be very explicit about this difference Yes that's the purpose of this PR. I want the class names to be clear. > Maybe we need to add some instructions for developers in org/apache/spark/internal/README.md Good point. I will have a follow-up for both scala and java logging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]
panbingkun commented on PR #46600: URL: https://github.com/apache/spark/pull/46600#issuecomment-2113630518 The purpose of the name at that time was to make developers feel `less change`, more `natural` and `smooth`. Of course, if we need to be `very explicit` about this difference, I'm fine. Maybe we need to add some instructions for developers in `org/apache/spark/internal/README.md`? https://github.com/apache/spark/blob/a2d93d104a6c3307f50dfd91f5cc51a97d4c5837/common/utils/src/main/scala/org/apache/spark/internal/README.md?plain=1#L13 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
alippai commented on PR #46529: URL: https://github.com/apache/spark/pull/46529#issuecomment-2113428373 I wish we had ns support in spark 4.0 as Java, parquet, arrow etc uses it natively now, but that’s certainly a different discussion and MRs :) Thanks for the new API! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]
ianmcook commented on PR #46529: URL: https://github.com/apache/spark/pull/46529#issuecomment-2113363928 > What will happen with the nanosecond timestamps? Truncated to milliseconds? Yes, truncated to milliseconds for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48017] Add Spark application submission worker for operator [spark-kubernetes-operator]
jiangzho commented on code in PR #10: URL: https://github.com/apache/spark-kubernetes-operator/pull/10#discussion_r1602137497 ## gradle.properties: ## @@ -18,17 +18,23 @@ group=org.apache.spark.k8s.operator version=0.1.0 +# Caution: fabric8 version should be aligned with Spark dependency fabric8Version=6.12.1 commonsLang3Version=3.14.0 commonsIOVersion=2.16.1 lombokVersion=1.18.32 +#Spark +scalaVersion=2.13 +sparkVersion=4.0.0-preview1 Review Comment: @dongjoon-hyun - can we use rc1 for submission worker for now, in order to unblock the operator module ? I'll be more than happy to upgrade & fix possible incompatibility when RC2 is ready. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48293][SS] Add test for when ForeachBatchUserFuncException wraps interrupted exception due to query stop [spark]
micheal-o opened a new pull request, #46601: URL: https://github.com/apache/spark/pull/46601 ### What changes were proposed in this pull request? Add test for when ForeachBatchUserFuncException wraps interrupted exception due to query stop ### Why are the changes needed? test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test added ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
dongjoon-hyun commented on PR #46598: URL: https://github.com/apache/spark/pull/46598#issuecomment-2113214338 Feel free to merge, @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` [spark]
gengliangwang commented on PR #46502: URL: https://github.com/apache/spark/pull/46502#issuecomment-2113168819 @panbingkun let's hold on this until we have a conclusion in https://github.com/apache/spark/pull/46600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]
gengliangwang commented on PR #46600: URL: https://github.com/apache/spark/pull/46600#issuecomment-2113167651 cc @panbingkun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48291][Core] Rename Java Logger as SparkLogger [spark]
gengliangwang opened a new pull request, #46600: URL: https://github.com/apache/spark/pull/46600 ### What changes were proposed in this pull request? Two new classes `org.apache.spark.internal.Logger` and `org.apache.spark.internal.LoggerFactory` were introduced from https://github.com/apache/spark/pull/46301. Given that Logger is a widely recognized **interface** in Log4j, it may lead to confusion to have a class with the same name. To avoid this and clarify its purpose within the Spark framework, I propose renaming `org.apache.spark.internal.Logger` to `org.apache.spark.internal.SparkLogger`. Similarly, to maintain consistency, `org.apache.spark.internal.LoggerFactory` should be renamed to `org.apache.spark.internal.SparkLoggerFactory`. ### Why are the changes needed? To avoid naming confusion and clarify the java Spark logger purpose within the logging framework ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]
viirya commented on PR #36564: URL: https://github.com/apache/spark/pull/36564#issuecomment-2113160108 I created https://issues.apache.org/jira/browse/SPARK-48292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]
viirya commented on PR #36564: URL: https://github.com/apache/spark/pull/36564#issuecomment-2113149367 > To @AngersZh , @viirya , @cloud-fan . Since this PR is too old, shall we file a JIRA issue for that comment, [#36564 (comment)](https://github.com/apache/spark/pull/36564#discussion_r1598660630) ? Yea, I think so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]
gengliangwang commented on PR #46562: URL: https://github.com/apache/spark/pull/46562#issuecomment-2113122398 @dongjoon-hyun From the recent tests of branch-3.5(https://github.com/apache/spark/commits/branch-3.5/), the ParquetIOSuite looks fine for now. I am ok with either backporting or not backporting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status [spark]
dongjoon-hyun commented on PR #36564: URL: https://github.com/apache/spark/pull/36564#issuecomment-2113120262 To @AngersZh , @viirya , @cloud-fan . Since this PR is too old, shall we file a JIRA issue for that comment, https://github.com/apache/spark/pull/36564#discussion_r1598660630 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite [spark]
dongjoon-hyun commented on PR #46562: URL: https://github.com/apache/spark/pull/46562#issuecomment-2113116084 Hi, @gengliangwang and all. Do you think we can have this at release branches too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48289][DOCKER][TEST] Clean up Oracle JDBC tests by skipping redundant SYSTEM password reset [spark]
dongjoon-hyun commented on PR #46598: URL: https://github.com/apache/spark/pull/46598#issuecomment-2113028851 cc @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48268][CORE] Add a configuration for SparkContext.setCheckpointDir [spark]
dongjoon-hyun commented on code in PR #46571: URL: https://github.com/apache/spark/pull/46571#discussion_r1601964897 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -1317,6 +1317,14 @@ package object config { s" be less than or equal to ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.") .createWithDefault(64 * 1024 * 1024) + private[spark] val CHECKPOINT_DIR = +ConfigBuilder("spark.checkpoint.dir") + .doc("Equivalent with SparkContext.setCheckpointDir. If set, the path becomes " + +"the directory for checkpointing.") Review Comment: Could you add this to `docs/configuration.md`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48256][BUILD] Add a rule to check file headers for the java side, and fix inconsistent files [spark]
dongjoon-hyun closed pull request #46557: [SPARK-48256][BUILD] Add a rule to check file headers for the java side, and fix inconsistent files URL: https://github.com/apache/spark/pull/46557 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48288] Add source data type for connector cast expression [spark]
urosstan-db commented on code in PR #46596: URL: https://github.com/apache/spark/pull/46596#discussion_r1601958458 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Cast.java: ## @@ -29,14 +29,31 @@ @Evolving public class Cast extends ExpressionWithToString { private Expression expression; + + /** + * Original data type of given expression + */ + private DataType expressionDataType; + + /** + * Target data type, i.e. data type in which expression will be cast + */ private DataType dataType; + @Deprecated Review Comment: I agree with you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48288] Add source data type for connector cast expression [spark]
urosstan-db commented on code in PR #46596: URL: https://github.com/apache/spark/pull/46596#discussion_r1601958035 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java: ## @@ -231,10 +231,15 @@ protected String visitBinaryArithmetic(String name, String l, String r) { return l + " " + name + " " + r; } + @Deprecated protected String visitCast(String l, DataType dataType) { Review Comment: Thanks, will be done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48252][SQL] Update CommonExpressionRef when necessary [spark]
dongjoon-hyun commented on PR #46552: URL: https://github.com/apache/spark/pull/46552#issuecomment-2113011265 Hi, @cloud-fan . It seems that @viirya 's comment is closed without addressing it. Could you double-check it? - https://github.com/apache/spark/pull/46552#discussion_r1598626620 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException [spark]
dongjoon-hyun commented on PR #46506: URL: https://github.com/apache/spark/pull/46506#issuecomment-2112993768 Merged to master for Apache Spark 4.0.0-preview2-rc2. Thank you, @cxzl25 , @mridulm , @yaooqinn . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException [spark]
dongjoon-hyun closed pull request #46506: [SPARK-48218][CORE] TransportClientFactory.createClient may NPE cause FetchFailedException URL: https://github.com/apache/spark/pull/46506 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48049][BUILD] Upgrade Scala to 2.13.14 [spark]
dongjoon-hyun closed pull request #46288: [SPARK-48049][BUILD] Upgrade Scala to 2.13.14 URL: https://github.com/apache/spark/pull/46288 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration [spark]
dongjoon-hyun commented on PR #46592: URL: https://github.com/apache/spark/pull/46592#issuecomment-2112983873 Merged to master for Apache Spark 4.0.0-preview1-rc2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration [spark]
dongjoon-hyun closed pull request #46592: [SPARK-48285][SQL][DOCS] Update docs for size function and sizeOfNull configuration URL: https://github.com/apache/spark/pull/46592 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47429][SQL] Rename error class to condition [spark]
nchammas commented on PR #46543: URL: https://github.com/apache/spark/pull/46543#issuecomment-2112900349 @MaxGekk @HyukjinKwon @cloud-fan - Before I finish up this PR, I'd like to check in with you briefly to make sure you like where this is going. The main idea in this PR is to deprecate public classes and parameters that use the old "error class" terminology and instead direct users to use "error condition". I'm trying wherever possible to do this in a backwards-compatible way. [Here's an example][ex] of the approach. In addition to being a bit gentler on users migrating to Spark 4.0, this will also help us avoid having to rename every instance of the `errorClass` or `error_class` parameters all across the codebase in one shot. Does this look good to you so far? In a similar vein, shall I keep incorrectly-named classes around but turn them into deprecated sub-classes of their correctly-named replacements? For example, `ErrorClassesJsonReader` would become a sub-class of `ErrorConditionsJsonReader`. [ex]: https://github.com/apache/spark/pull/46543/files#diff-d316cd1a4b3d10d5c8136adc63d0d56c74005a600bb966a1fd485347be1cd7e1R38-R49 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: Each column has separate metadata so map and string can never clash -> here is how it would look like in json: ```json { "type": "struct", "fields": [ { "name": "c1.key", "type": "string", "nullable": true, "metadata": { "__COLLATIONS": { "c1.key": "ICU.UNICODE_CI" } } }, { "name": "c1", "type": { "type": "map", "keyType": "string", "valueType": "string", "valueContainsNull": true }, "nullable": true, "metadata": { "__COLLATIONS": { "c1.key": "ICU.UNICODE" } } } ] } ``` "c1.key" will mean different things in each column - for the first it's just the entire column and for the second it is the key of the map -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601881695 ## sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala: ## @@ -63,7 +66,60 @@ case class StructField( ("name" -> name) ~ ("type" -> dataType.jsonValue) ~ ("nullable" -> nullable) ~ - ("metadata" -> metadata.jsonValue) + ("metadata" -> metadataJson) + } + + private def metadataJson: JValue = { +val metadataJsonValue = metadata.jsonValue +metadataJsonValue match { + case JObject(fields) if collationMetadata.nonEmpty => +val collationFields = collationMetadata.map(kv => kv._1 -> JString(kv._2)).toList +JObject(fields :+ (DataType.COLLATIONS_METADATA_KEY -> JObject(collationFields))) + + case _ => metadataJsonValue +} + } + + /** Map of field path to collation name. */ + private lazy val collationMetadata: mutable.Map[String, String] = { +val fieldToCollationMap = mutable.Map[String, String]() + +def visitRecursively(dt: DataType, path: String): Unit = dt match { + case at: ArrayType => +processDataType(at.elementType, path + ".element") + + case mt: MapType => +processDataType(mt.keyType, path + ".key") +processDataType(mt.valueType, path + ".value") + + case st: StringType if isCollatedString(st) => +fieldToCollationMap(path) = collationName(st) + + case _ => +} + +def processDataType(dt: DataType, path: String): Unit = { + if (isCollatedString(dt)) { +fieldToCollationMap(path) = collationName(dt) + } else { +visitRecursively(dt, path) + } +} + +visitRecursively(dataType, name) Review Comment: Each column has separate metadata so map and string can never clash -> here is how it would look like in json: ```json { "type": "struct", "fields": [ { "name": "c1.key", "type": "string", "nullable": true, "metadata": { "__COLLATIONS": { "c1.key": "ICU.UNICODE_CI" } } }, { "name": "c1", "type": { "type": "map", "keyType": "string", "valueType": "string", "valueContainsNull": true }, "nullable": true, "metadata": { "__COLLATIONS": { "c1.value": "ICU.UNICODE", "c1.key": "ICU.UNICODE" } } } ] } ``` "c1.key" will mean different things in each column - for the first it's just the entire column and for the second it is the key of the map -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601878212 ## python/pyspark/sql/types.py: ## @@ -1702,9 +1791,16 @@ def _parse_datatype_json_string(json_string: str) -> DataType: return _parse_datatype_json_value(json.loads(json_string)) -def _parse_datatype_json_value(json_value: Union[dict, str]) -> DataType: +def _parse_datatype_json_value( +json_value: Union[dict, str], +fieldPath: str = "", +collationsMap: Optional[Dict[str, str]] = None, +) -> DataType: if not isinstance(json_value, dict): if json_value in _all_atomic_types.keys(): +if collationsMap is not None and fieldPath in collationsMap: +collationName = collationsMap[fieldPath].split(".")[1] Review Comment: sure! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601839139 ## python/pyspark/sql/types.py: ## @@ -876,30 +894,86 @@ def __init__( self.dataType = dataType self.nullable = nullable self.metadata = metadata or {} +self._collationMetadata: Optional[Dict[str, str]] = None def simpleString(self) -> str: return "%s:%s" % (self.name, self.dataType.simpleString()) def __repr__(self) -> str: return "StructField('%s', %s, %s)" % (self.name, self.dataType, str(self.nullable)) +def __eq__(self, other: Any) -> bool: +# since collationMetadata is lazy evaluated we should not use it in equality check Review Comment: I agree, doing this for every json call is annoying but since we don't have the security of immutable data like in scala perhaps we should avoid lazy eval -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) +python_datatype = _parse_datatype_json_string(scala_datatype.json()) +assert schema == python_datatype +assert schema == _parse_datatype_json_string(schema.json()) + +def test_schema_with_collations_on_non_string_types(self): +from pyspark.sql.types import _parse_datatype_json_string, _COLLATIONS_METADATA_KEY + +collations_on_int_col_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "c1", + "type": "integer", + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "c1": "icu.UNICODE" +}} + }} +}} + ] +}} +""" + +collations_in_map_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "mapField", + "type": {{ +"type": "map", +"keyType": "string", +"valueType": "integer", +"valueContainsNull": true + }}, + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "mapField.value": "icu.UNICODE" Review Comment: We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the `JObject`. What do you think @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601793860 ## python/pyspark/sql/tests/test_types.py: ## @@ -549,6 +549,129 @@ def test_convert_list_to_str(self): self.assertEqual(df.count(), 1) self.assertEqual(df.head(), Row(name="[123]", income=120)) +def test_schema_with_collations_json_ser_de(self): +from pyspark.sql.types import _parse_datatype_json_string + +unicode_collation = "UNICODE" + +simple_struct = StructType([StructField("c1", StringType(unicode_collation))]) + +nested_struct = StructType([StructField("nested", simple_struct)]) + +array_in_schema = StructType( +[StructField("array", ArrayType(StringType(unicode_collation)))] +) + +map_in_schema = StructType( +[ +StructField( +"map", MapType(StringType(unicode_collation), StringType(unicode_collation)) +) +] +) + +array_in_map = StructType( +[ +StructField( +"arrInMap", +MapType( +StringType(unicode_collation), ArrayType(StringType(unicode_collation)) +), +) +] +) + +nested_array_in_map = StructType( +[ +StructField( +"nestedArrayInMap", +ArrayType( +MapType( +StringType(unicode_collation), + ArrayType(ArrayType(StringType(unicode_collation))), +) +), +) +] +) + +schema_with_multiple_fields = StructType( +simple_struct.fields ++ nested_struct.fields ++ array_in_schema.fields ++ map_in_schema.fields ++ array_in_map.fields ++ nested_array_in_map.fields +) + +schemas = [ +simple_struct, +nested_struct, +array_in_schema, +map_in_schema, +nested_array_in_map, +array_in_map, +schema_with_multiple_fields, +] + +for schema in schemas: +scala_datatype = self.spark._jsparkSession.parseDataType(schema.json()) +python_datatype = _parse_datatype_json_string(scala_datatype.json()) +assert schema == python_datatype +assert schema == _parse_datatype_json_string(schema.json()) + +def test_schema_with_collations_on_non_string_types(self): +from pyspark.sql.types import _parse_datatype_json_string, _COLLATIONS_METADATA_KEY + +collations_on_int_col_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "c1", + "type": "integer", + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "c1": "icu.UNICODE" +}} + }} +}} + ] +}} +""" + +collations_in_map_json = f""" +{{ + "type": "struct", + "fields": [ +{{ + "name": "mapField", + "type": {{ +"type": "map", +"keyType": "string", +"valueType": "integer", +"valueContainsNull": true + }}, + "nullable": true, + "metadata": {{ +"{_COLLATIONS_METADATA_KEY}": {{ + "mapField.value": "icu.UNICODE" Review Comment: We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the `JObject`. What do you think @cloud-fan ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-48000][SQL] Enable hash join support for all collations (StringType) [spark]
uros-db opened a new pull request, #46599: URL: https://github.com/apache/spark/pull/46599 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]
stefankandic commented on code in PR #46280: URL: https://github.com/apache/spark/pull/46280#discussion_r1601784552 ## common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java: ## @@ -36,11 +36,45 @@ * Provides functionality to the UTF8String object which respects defined collation settings. */ public final class CollationFactory { + + /** + * Identifier for single a collation. + */ + public static class CollationIdentifier { +public final String provider; +public final String name; +public final String version; Review Comment: I thought that optional is used in java for method return types, and not for attributes since they are "optional" anyways? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org