[jira] [Commented] (SPARK-40422) Upgrade hive to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-40422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620774#comment-17620774 ] Bilna commented on SPARK-40422: --- [~srowen] In the mvn dependency tree I can see google-gson is coming through apache hive. that is the reason I have requested to upgrade the hive version. Can you please tell me which JIRA fixed the GSON version > Upgrade hive to 4.0.0 > - > > Key: SPARK-40422 > URL: https://issues.apache.org/jira/browse/SPARK-40422 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bilna >Priority: Major > > Upgrade hive to 4.0.0 to avoid security vulnerability CVE-2022-25647 through > google-gson:2.2.4. In hive:4.0.0, the google-gson is upgraded to 2.8.9 for > which CVE is not reported yet. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40457) upgrade jackson data mapper to latest
[ https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620766#comment-17620766 ] Bilna commented on SPARK-40457: --- [~hyukjin.kwon] Understood. So I think I can mark this as false positive. Thanks for the link > upgrade jackson data mapper to latest > -- > > Key: SPARK-40457 > URL: https://issues.apache.org/jira/browse/SPARK-40457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bilna >Priority: Major > > Upgrade jackson-mapper-asl to the latest to resolve CVE-2019-10172 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40758) Upgrade Apache zookeeper to get rid of CVE-2020-10663
[ https://issues.apache.org/jira/browse/SPARK-40758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620765#comment-17620765 ] Bilna commented on SPARK-40758: --- https://issues.apache.org/jira/browse/ZOOKEEPER-3933 This link says the reported CVE is false positive. So I think we can close this. > Upgrade Apache zookeeper to get rid of CVE-2020-10663 > - > > Key: SPARK-40758 > URL: https://issues.apache.org/jira/browse/SPARK-40758 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bilna >Priority: Major > > In order to resolve security vulnerability CVE-2020-10663, upgrade Apache > zookeeper to 3.8.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40852) Implement `DataFrame.summary`
[ https://issues.apache.org/jira/browse/SPARK-40852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40852: - Assignee: Ruifeng Zheng > Implement `DataFrame.summary` > - > > Key: SPARK-40852 > URL: https://issues.apache.org/jira/browse/SPARK-40852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40852) Implement `DataFrame.summary`
Ruifeng Zheng created SPARK-40852: - Summary: Implement `DataFrame.summary` Key: SPARK-40852 URL: https://issues.apache.org/jira/browse/SPARK-40852 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40768) Migrate type check failures of bloom_filter_agg() onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620741#comment-17620741 ] Apache Spark commented on SPARK-40768: -- User 'lvshaokang' has created a pull request for this issue: https://github.com/apache/spark/pull/38315 > Migrate type check failures of bloom_filter_agg() onto error classes > > > Key: SPARK-40768 > URL: https://issues.apache.org/jira/browse/SPARK-40768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in > bloom_filter_agg(): > https://github.com/apache/spark/blob/1f4e4c812a9dc6d7e35631c1663c1ba6f6d9b721/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L66-L76 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40768) Migrate type check failures of bloom_filter_agg() onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40768: Assignee: Apache Spark > Migrate type check failures of bloom_filter_agg() onto error classes > > > Key: SPARK-40768 > URL: https://issues.apache.org/jira/browse/SPARK-40768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in > bloom_filter_agg(): > https://github.com/apache/spark/blob/1f4e4c812a9dc6d7e35631c1663c1ba6f6d9b721/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L66-L76 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40768) Migrate type check failures of bloom_filter_agg() onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40768: Assignee: (was: Apache Spark) > Migrate type check failures of bloom_filter_agg() onto error classes > > > Key: SPARK-40768 > URL: https://issues.apache.org/jira/browse/SPARK-40768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in > bloom_filter_agg(): > https://github.com/apache/spark/blob/1f4e4c812a9dc6d7e35631c1663c1ba6f6d9b721/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L66-L76 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40768) Migrate type check failures of bloom_filter_agg() onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620739#comment-17620739 ] Apache Spark commented on SPARK-40768: -- User 'lvshaokang' has created a pull request for this issue: https://github.com/apache/spark/pull/38315 > Migrate type check failures of bloom_filter_agg() onto error classes > > > Key: SPARK-40768 > URL: https://issues.apache.org/jira/browse/SPARK-40768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in > bloom_filter_agg(): > https://github.com/apache/spark/blob/1f4e4c812a9dc6d7e35631c1663c1ba6f6d9b721/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L66-L76 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40813) Add limit and offset to Connect DSL
[ https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620712#comment-17620712 ] Apache Spark commented on SPARK-40813: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38314 > Add limit and offset to Connect DSL > --- > > Key: SPARK-40813 > URL: https://issues.apache.org/jira/browse/SPARK-40813 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40813) Add limit and offset to Connect DSL
[ https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620711#comment-17620711 ] Apache Spark commented on SPARK-40813: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38314 > Add limit and offset to Connect DSL > --- > > Key: SPARK-40813 > URL: https://issues.apache.org/jira/browse/SPARK-40813 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40851) TimestampFormatter behavior changed when using the latest Java
Yang Jie created SPARK-40851: Summary: TimestampFormatter behavior changed when using the latest Java Key: SPARK-40851 URL: https://issues.apache.org/jira/browse/SPARK-40851 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Yang Jie {code:java} [info] *** 12 TESTS FAILED *** [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 [error] Failed tests: [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} We can reproduce this issue using Java 8u352/11.0.17/17.0.5, the test errors are similar to the following: run {code:java} build/sbt clean "catalyst/testOnly *CastWithAnsiOffSuite" {code} with 8u352: {code:java} [info] - SPARK-35711: cast timestamp without time zone to timestamp with local time zone *** FAILED *** (190 milliseconds) [info] Incorrect evaluation (codegen off): cast(0001-01-01 00:00:00 as timestamp), actual: -6213561782000, expected: -621355968 (ExpressionEvalHelper.scala:209) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluationWithoutCodegen(CastSuiteBase.scala:49) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluation(CastSuiteBase.scala:49) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198(CastSuiteBase.scala:893) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198$adapted(CastSuiteBase.scala:890) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$197(CastSuiteBase.scala:890) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196(CastSuiteBase.scala:890) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196$adapted(CastSuiteBase.scala:888) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$195(CastSuiteBase.scala:888) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.B
[jira] [Updated] (SPARK-40851) TimestampFormatter behavior changed when using the latest Java 8/11/17
[ https://issues.apache.org/jira/browse/SPARK-40851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40851: - Summary: TimestampFormatter behavior changed when using the latest Java 8/11/17 (was: TimestampFormatter behavior changed when using the latest Java) > TimestampFormatter behavior changed when using the latest Java 8/11/17 > -- > > Key: SPARK-40851 > URL: https://issues.apache.org/jira/browse/SPARK-40851 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Blocker > > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > We can reproduce this issue using Java 8u352/11.0.17/17.0.5, the test errors > are similar to the following: > run > {code:java} > build/sbt clean "catalyst/testOnly *CastWithAnsiOffSuite" {code} > with 8u352: > {code:java} > [info] - SPARK-35711: cast timestamp without time zone to timestamp with > local time zone *** FAILED *** (190 milliseconds) > [info] Incorrect evaluation (codegen off): cast(0001-01-01 00:00:00 as > timestamp), actual: -6213561782000, expected: -621355968 > (ExpressionEvalHelper.scala:209) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) > [info] at org.scalatest.Assertions.fail(Assertions.scala:933) > [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) > [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209) > [info] at > org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluationWithoutCodegen(CastSuiteBase.scala:49) > [info] at > org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87) > [info] at > org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluation(CastSuiteBase.scala:49) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198(CastSuiteBase.scala:893) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198$adapted(CastSuiteBase.scala:890) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$197(CastSuiteBase.scala:890) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196(CastSuiteBase.scala:890) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196$adapted(CastSuiteBase.scala:888) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$195(CastSuiteBase.scala:888) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSu
[jira] [Created] (SPARK-40850) Tests for Spark SQL Intrepetered Queries may execute Codegen
Holden Karau created SPARK-40850: Summary: Tests for Spark SQL Intrepetered Queries may execute Codegen Key: SPARK-40850 URL: https://issues.apache.org/jira/browse/SPARK-40850 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.3.0, 3.3.1 Reporter: Holden Karau We also need to set SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false" in PlanTest -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40847) SPARK: Load Data from Dataframe or RDD to DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-40847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40847. -- Resolution: Invalid Let's ask questions into Spark user mailing list before filing it as an issue. You would be able to get a better answer there. > SPARK: Load Data from Dataframe or RDD to DynamoDB > --- > > Key: SPARK-40847 > URL: https://issues.apache.org/jira/browse/SPARK-40847 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Vivek Garg >Priority: Major > Labels: spark > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 | A | B | C | null > 19 | X | Y | null | null > 21 | R | null | null | null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > > MapPartitionsRDD[10] at rdd at :41 > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => > { var ddbMap = new HashMap[String, AttributeValue]() var ClientNum = new > AttributeValue() ClientNum.setN(a.get(0).toString) ddbMap.put("ClientNum", > ClientNum) var Value_1 = new AttributeValue() Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) ddbMap.put("Value_2", Value_2) var Value_3 = > new AttributeValue() Value_3.setS(a.get(3).toString) ddbMap.put("Value_3", > Value_3) var Value_4 = new AttributeValue() Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) var item = new DynamoDBItemWritable() > item.setItem(ddbMap) (new Text(""), item) } > ) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thank you. > [Power BI > Certification|https://www.igmguru.com/data-science-bi/power-bi-certification-training/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40539) PySpark readwriter API parity for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-40539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40539: --- Assignee: Rui Wang > PySpark readwriter API parity for Spark Connect > --- > > Key: SPARK-40539 > URL: https://issues.apache.org/jira/browse/SPARK-40539 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > > Spark Connect / PySpark ReadWriter parity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39590) Python API Parity in Structure Streaming
[ https://issues.apache.org/jira/browse/SPARK-39590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-39590. -- Resolution: Duplicate Closed as duplicated. > Python API Parity in Structure Streaming > > > Key: SPARK-39590 > URL: https://issues.apache.org/jira/browse/SPARK-39590 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > > New APIs in Structured Streaming tend to get added to Java/Scala first. This > creates a situation where the Python API have fallen behind. For example > map/flatMapGroupsWithState is not supported in the Pyspark. We need Pyspark > API to catch up with the Java/Scala APIs and, where necessary, provide > tighter integrations with native python data processing frameworks such as > Pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40539) PySpark readwriter API parity for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-40539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40539. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38086 [https://github.com/apache/spark/pull/38086] > PySpark readwriter API parity for Spark Connect > --- > > Key: SPARK-40539 > URL: https://issues.apache.org/jira/browse/SPARK-40539 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > Fix For: 3.4.0 > > > Spark Connect / PySpark ReadWriter parity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-40025: - Description: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-40431 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-40849 - Async log purge SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency was: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-40849 - Async log purge SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Project Lightspeed is an umbrella project aimed at improving a couple of key > aspects of Spark Streaming: > * Improving the latency and ensuring it is predictable > * Enhancing functionality for processing data with new operators and APIs > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-40431 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-40849 - Async log purge > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40656) Schema-registry support for Protobuf format
[ https://issues.apache.org/jira/browse/SPARK-40656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghu Angadi resolved SPARK-40656. -- Resolution: Won't Do > Schema-registry support for Protobuf format > --- > > Key: SPARK-40656 > URL: https://issues.apache.org/jira/browse/SPARK-40656 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Add support for reading protobuf schema (definition) from Confluent > schema-registry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40659) Schema evolution for protobuf (and Avro too?)
[ https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620661#comment-17620661 ] Raghu Angadi commented on SPARK-40659: -- Right, I don't think that is the reason. Databricks might port Avro schema-registry support to open-source later. May be the team that added it didn't get around to open-sourcing it. > Schema evolution for protobuf (and Avro too?) > - > > Key: SPARK-40659 > URL: https://issues.apache.org/jira/browse/SPARK-40659 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Protobuf & Avro should support schema evolution in streaming. We need to > throw a specific error message when we detect newer version of the the schema > in schema registry. > A couple of options for detecting version change at runtime: > * How do we detect newer version from schema registry? It is contacted only > during planning currently. > * We could detect version id in coming messages. > ** What if the id in the incoming message is newer than what our > schema-registry reports after the restart? > *** This indicates delayed syncs between customers schema-registry servers > (should be rare). We can keep erroring out until it is fixed. > *** Make sure we log the schema id used during planning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40658) Protobuf v2 & v3 support
[ https://issues.apache.org/jira/browse/SPARK-40658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620659#comment-17620659 ] Raghu Angadi commented on SPARK-40658: -- Thats awesome! Lets look at both and merge them. > Protobuf v2 & v3 support > > > Key: SPARK-40658 > URL: https://issues.apache.org/jira/browse/SPARK-40658 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > We want to ensure Protobuf functions support both Protobuf version 2 and > version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40658) Protobuf v2 & v3 support
[ https://issues.apache.org/jira/browse/SPARK-40658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620657#comment-17620657 ] Mohan Parthasarathy commented on SPARK-40658: - [~rangadi] I did base it off from your latest PR. I am essentially running most of the protobuf functions suite for v2 and v3. I also added a test case for defaultValues. Will issue a PR once merged. > Protobuf v2 & v3 support > > > Key: SPARK-40658 > URL: https://issues.apache.org/jira/browse/SPARK-40658 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > We want to ensure Protobuf functions support both Protobuf version 2 and > version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40658) Protobuf v2 & v3 support
[ https://issues.apache.org/jira/browse/SPARK-40658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620655#comment-17620655 ] Raghu Angadi commented on SPARK-40658: -- [~mposdev21] more tests are good. Lets add them. In my branch on top of [Java support PR|https://github.com/apache/spark/pull/38286], I am running pretty much all of the current tests with V2 and V3 protobufs (both with java class & descriptor files). I will send that PR soon after Java support PR merges. I am able to get maven build to generate V2 and V3 classes and descriptor sets. Haven't figured out how to do the same with SBT. > Protobuf v2 & v3 support > > > Key: SPARK-40658 > URL: https://issues.apache.org/jira/browse/SPARK-40658 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > We want to ensure Protobuf functions support both Protobuf version 2 and > version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40846) GA test failed with Java 8u352
[ https://issues.apache.org/jira/browse/SPARK-40846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40846: Assignee: Yang Jie > GA test failed with Java 8u352 > -- > > Key: SPARK-40846 > URL: https://issues.apache.org/jira/browse/SPARK-40846 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > catalyst test failed > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > run TimestampFormatterSuite with 8u352 locally: > > {code:java} > [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** > (21 milliseconds) > [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal > 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) > [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:2
[jira] [Resolved] (SPARK-40846) GA test failed with Java 8u352
[ https://issues.apache.org/jira/browse/SPARK-40846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40846. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38311 [https://github.com/apache/spark/pull/38311] > GA test failed with Java 8u352 > -- > > Key: SPARK-40846 > URL: https://issues.apache.org/jira/browse/SPARK-40846 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > catalyst test failed > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > run TimestampFormatterSuite with 8u352 locally: > > {code:java} > [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** > (21 milliseconds) > [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal > 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) > [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:10
[jira] [Commented] (SPARK-40658) Protobuf v2 & v3 support
[ https://issues.apache.org/jira/browse/SPARK-40658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620651#comment-17620651 ] Mohan Parthasarathy commented on SPARK-40658: - [~rangadi] I started adding some test cases recently and mostly working except for one test case. We can discuss more about the test cases if you are interested. > Protobuf v2 & v3 support > > > Key: SPARK-40658 > URL: https://issues.apache.org/jira/browse/SPARK-40658 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > We want to ensure Protobuf functions support both Protobuf version 2 and > version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40849) Async log purge
[ https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620641#comment-17620641 ] Apache Spark commented on SPARK-40849: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38313 > Async log purge > --- > > Key: SPARK-40849 > URL: https://issues.apache.org/jira/browse/SPARK-40849 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Purging old entries in both the offset log and commit log will be done > asynchronously. > > For every micro-batch, older entries in both offset log and commit log are > deleted. This is done so that the offset log and commit log do not > continually grow. Please reference logic here > > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539] > > > The time spent performing these log purges is grouped with the “walCommit” > execution time in the StreamingProgressListener metrics. Around two thirds > of the “walCommit” execution time is performing these purge operations thus > making these operations asynchronous will also reduce latency. Also, we do > not necessarily need to perform the purges every micro-batch. When these > purges are executed asynchronously, they do not need to block micro-batch > execution and we don’t need to start another purge until the current one is > finished. The purges can happen essentially in the background. We will just > have to synchronize the purges with the offset WAL commits and completion > commits so that we don’t have concurrent modifications of the offset log and > commit log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40849) Async log purge
[ https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40849: Assignee: Apache Spark > Async log purge > --- > > Key: SPARK-40849 > URL: https://issues.apache.org/jira/browse/SPARK-40849 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Assignee: Apache Spark >Priority: Major > > Purging old entries in both the offset log and commit log will be done > asynchronously. > > For every micro-batch, older entries in both offset log and commit log are > deleted. This is done so that the offset log and commit log do not > continually grow. Please reference logic here > > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539] > > > The time spent performing these log purges is grouped with the “walCommit” > execution time in the StreamingProgressListener metrics. Around two thirds > of the “walCommit” execution time is performing these purge operations thus > making these operations asynchronous will also reduce latency. Also, we do > not necessarily need to perform the purges every micro-batch. When these > purges are executed asynchronously, they do not need to block micro-batch > execution and we don’t need to start another purge until the current one is > finished. The purges can happen essentially in the background. We will just > have to synchronize the purges with the offset WAL commits and completion > commits so that we don’t have concurrent modifications of the offset log and > commit log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40849) Async log purge
[ https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40849: Assignee: (was: Apache Spark) > Async log purge > --- > > Key: SPARK-40849 > URL: https://issues.apache.org/jira/browse/SPARK-40849 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Purging old entries in both the offset log and commit log will be done > asynchronously. > > For every micro-batch, older entries in both offset log and commit log are > deleted. This is done so that the offset log and commit log do not > continually grow. Please reference logic here > > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539] > > > The time spent performing these log purges is grouped with the “walCommit” > execution time in the StreamingProgressListener metrics. Around two thirds > of the “walCommit” execution time is performing these purge operations thus > making these operations asynchronous will also reduce latency. Also, we do > not necessarily need to perform the purges every micro-batch. When these > purges are executed asynchronously, they do not need to block micro-batch > execution and we don’t need to start another purge until the current one is > finished. The purges can happen essentially in the background. We will just > have to synchronize the purges with the offset WAL commits and completion > commits so that we don’t have concurrent modifications of the offset log and > commit log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40849) Async log purge
[ https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40849: -- Description: Purging old entries in both the offset log and commit log will be done asynchronously. For every micro-batch, older entries in both offset log and commit log are deleted. This is done so that the offset log and commit log do not continually grow. Please reference logic here [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539] The time spent performing these log purges is grouped with the “walCommit” execution time in the StreamingProgressListener metrics. Around two thirds of the “walCommit” execution time is performing these purge operations thus making these operations asynchronous will also reduce latency. Also, we do not necessarily need to perform the purges every micro-batch. When these purges are executed asynchronously, they do not need to block micro-batch execution and we don’t need to start another purge until the current one is finished. The purges can happen essentially in the background. We will just have to synchronize the purges with the offset WAL commits and completion commits so that we don’t have concurrent modifications of the offset log and commit log. was:Purging old entries in both the offset log and commit log will be done asynchronously > Async log purge > --- > > Key: SPARK-40849 > URL: https://issues.apache.org/jira/browse/SPARK-40849 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Purging old entries in both the offset log and commit log will be done > asynchronously. > > For every micro-batch, older entries in both offset log and commit log are > deleted. This is done so that the offset log and commit log do not > continually grow. Please reference logic here > > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539] > > > The time spent performing these log purges is grouped with the “walCommit” > execution time in the StreamingProgressListener metrics. Around two thirds > of the “walCommit” execution time is performing these purge operations thus > making these operations asynchronous will also reduce latency. Also, we do > not necessarily need to perform the purges every micro-batch. When these > purges are executed asynchronously, they do not need to block micro-batch > execution and we don’t need to start another purge until the current one is > finished. The purges can happen essentially in the background. We will just > have to synchronize the purges with the offset WAL commits and completion > commits so that we don’t have concurrent modifications of the offset log and > commit log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40849) Async log purge
[ https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40849: -- Description: Purging old entries in both the offset log and commit log will be done asynchronously > Async log purge > --- > > Key: SPARK-40849 > URL: https://issues.apache.org/jira/browse/SPARK-40849 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Purging old entries in both the offset log and commit log will be done > asynchronously -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40025: -- Description: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-40849 SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency was: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Project Lightspeed is an umbrella project aimed at improving a couple of key > aspects of Spark Streaming: > * Improving the latency and ensuring it is predictable > * Enhancing functionality for processing data with new operators and APIs > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-39590 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-40849 > > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40025: -- Description: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-40849 - Async log purge SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency was: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-40849 SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Project Lightspeed is an umbrella project aimed at improving a couple of key > aspects of Spark Streaming: > * Improving the latency and ensuring it is predictable > * Enhancing functionality for processing data with new operators and APIs > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-39590 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-40849 - Async log purge > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40849) Async log purge
Boyang Jerry Peng created SPARK-40849: - Summary: Async log purge Key: SPARK-40849 URL: https://issues.apache.org/jira/browse/SPARK-40849 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Boyang Jerry Peng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40025) Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-40025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-40025: -- Description: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency was: Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming: * Improving the latency and ensuring it is predictable * Enhancing functionality for processing data with new operators and APIs Umbrella Jira to track all tickets under Project Lightspeed SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency > Project Lightspeed: Faster and Simpler Stream Processing with Apache Spark > -- > > Key: SPARK-40025 > URL: https://issues.apache.org/jira/browse/SPARK-40025 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Boyang Jerry Peng >Priority: Major > > Project Lightspeed is an umbrella project aimed at improving a couple of key > aspects of Spark Streaming: > * Improving the latency and ensuring it is predictable > * Enhancing functionality for processing data with new operators and APIs > > Umbrella Jira to track all tickets under Project Lightspeed > SPARK-39585 - Multiple Stateful Operators in Structured Streaming > SPARK-39586 - Advanced Windowing in Structured Streaming > SPARK-39587 - Schema Evolution for Stateful Pipelines > SPARK-39589 - Asynchronous I/O support > SPARK-39590 - Python API for Arbitrary Stateful Processing > SPARK-39591 - Offset Management Improvements > SPARK-39592 - Asynchronous State Checkpointing > SPARK-39593 - Configurable State Checkpointing Frequency -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40659) Schema evolution for protobuf (and Avro too?)
[ https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620581#comment-17620581 ] Sandish Kumar HN commented on SPARK-40659: -- [~rangadi] is the reason being confluent schema registry not open-sourced? I see that Apache Flink uses a confluent schema registry [https://github.com/apache/flink/blob/master/flink-formats/flink-avro-confluent-registry/pom.xml#L39] > Schema evolution for protobuf (and Avro too?) > - > > Key: SPARK-40659 > URL: https://issues.apache.org/jira/browse/SPARK-40659 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Protobuf & Avro should support schema evolution in streaming. We need to > throw a specific error message when we detect newer version of the the schema > in schema registry. > A couple of options for detecting version change at runtime: > * How do we detect newer version from schema registry? It is contacted only > during planning currently. > * We could detect version id in coming messages. > ** What if the id in the incoming message is newer than what our > schema-registry reports after the restart? > *** This indicates delayed syncs between customers schema-registry servers > (should be rare). We can keep erroring out until it is fixed. > *** Make sure we log the schema id used during planning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40844) Flip the default value of Kafka offset fetching config
[ https://issues.apache.org/jira/browse/SPARK-40844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-40844: - Labels: release-notes (was: ) > Flip the default value of Kafka offset fetching config > -- > > Key: SPARK-40844 > URL: https://issues.apache.org/jira/browse/SPARK-40844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Labels: release-notes > Fix For: 3.4.0 > > > Discussion thread: > [https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40844) Flip the default value of Kafka offset fetching config
[ https://issues.apache.org/jira/browse/SPARK-40844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-40844: Assignee: Jungtaek Lim > Flip the default value of Kafka offset fetching config > -- > > Key: SPARK-40844 > URL: https://issues.apache.org/jira/browse/SPARK-40844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > Discussion thread: > [https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40844) Flip the default value of Kafka offset fetching config
[ https://issues.apache.org/jira/browse/SPARK-40844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-40844. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38306 [https://github.com/apache/spark/pull/38306] > Flip the default value of Kafka offset fetching config > -- > > Key: SPARK-40844 > URL: https://issues.apache.org/jira/browse/SPARK-40844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.4.0 > > > Discussion thread: > [https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40659) Schema evolution for protobuf (and Avro too?)
[ https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620557#comment-17620557 ] Raghu Angadi commented on SPARK-40659: -- Schema-support for Avro is not in open-source Spark. When we implement automatic schema evolution, we will do that for both Avro and Protobuf. I think we should close this for now. WDYT? When we back port schema-registry support, we will do it for both Avro and Protobuf together. > Schema evolution for protobuf (and Avro too?) > - > > Key: SPARK-40659 > URL: https://issues.apache.org/jira/browse/SPARK-40659 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Protobuf & Avro should support schema evolution in streaming. We need to > throw a specific error message when we detect newer version of the the schema > in schema registry. > A couple of options for detecting version change at runtime: > * How do we detect newer version from schema registry? It is contacted only > during planning currently. > * We could detect version id in coming messages. > ** What if the id in the incoming message is newer than what our > schema-registry reports after the restart? > *** This indicates delayed syncs between customers schema-registry servers > (should be rare). We can keep erroring out until it is fixed. > *** Make sure we log the schema id used during planning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40656) Schema-registry support for Protobuf format
[ https://issues.apache.org/jira/browse/SPARK-40656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620556#comment-17620556 ] Raghu Angadi commented on SPARK-40656: -- Schema-support for Avro is not here. I will close this. When we back port schema-registry support, we will do it for both Avro and Protobuf together. Committers, please close this as "Won't Do". > Schema-registry support for Protobuf format > --- > > Key: SPARK-40656 > URL: https://issues.apache.org/jira/browse/SPARK-40656 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Add support for reading protobuf schema (definition) from Confluent > schema-registry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40777) Use error classes for Protobuf exceptions
[ https://issues.apache.org/jira/browse/SPARK-40777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620553#comment-17620553 ] Raghu Angadi commented on SPARK-40777: -- I filed a separate ticket SPARK-40848 for generating descriptor files. I will do that. We don't need to do that in here. > Use error classes for Protobuf exceptions > - > > Key: SPARK-40777 > URL: https://issues.apache.org/jira/browse/SPARK-40777 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > We should use error classes for all the exceptions. > A follow up from Protobuf PR [https://github.com/apache/spark/pull/37972] > > cc: [~sanysand...@gmail.com] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40658) Protobuf v2 & v3 support
[ https://issues.apache.org/jira/browse/SPARK-40658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620551#comment-17620551 ] Raghu Angadi commented on SPARK-40658: -- [~sanysand...@gmail.com] , [~mposdev21] I am working on this. I think we can support both without any issue. > Protobuf v2 & v3 support > > > Key: SPARK-40658 > URL: https://issues.apache.org/jira/browse/SPARK-40658 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > We want to ensure Protobuf functions support both Protobuf version 2 and > version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40848) Protobuf: Generate descriptor files at build time
Raghu Angadi created SPARK-40848: Summary: Protobuf: Generate descriptor files at build time Key: SPARK-40848 URL: https://issues.apache.org/jira/browse/SPARK-40848 Project: Spark Issue Type: Improvement Components: Protobuf Affects Versions: 3.3.0 Reporter: Raghu Angadi Generate descriptor files during the build rather than pre-creating them. [~rangadi] will do this. cc: [~sanysand...@gmail.com] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620511#comment-17620511 ] Pankaj Nagla commented on SPARK-34827: -- Thank you for sharing such good information. Very informative and effective post. [Msbi Training|https://www.igmguru.com/data-science-bi/msbi-certification-training/] offers the best solutions for Business Intelligence and data mining. MSBI uses Visual Studio data tools and SQL servers to make great decisions in our business activities. > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40791) The semantics of `F` in `DateTimeFormatter` have changed
[ https://issues.apache.org/jira/browse/SPARK-40791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620487#comment-17620487 ] Dongjoon Hyun edited comment on SPARK-40791 at 10/19/22 6:12 PM: - According to the PR comment, Java 11 has the same issue, [~LuciferYang]? bq. hmm... the latest 11(11.0.17) and 17(17.0.5) have the same issue ... was (Author: dongjoon): According to the PR comment, Java 11 has the same issue, [~LuciferYang]? > hmm... the latest 11(11.0.17) and 17(17.0.5) have the same issue ... > The semantics of `F` in `DateTimeFormatter` have changed > > > Key: SPARK-40791 > URL: https://issues.apache.org/jira/browse/SPARK-40791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > val createSql = > """ > |create temporary view v as select col from values > | (timestamp '1582-06-01 11:33:33.123UTC+08'), > | (timestamp '1970-01-01 00:00:00.000Europe/Paris'), > | (timestamp '1970-12-31 23:59:59.999Asia/Srednekolymsk'), > | (timestamp '1996-04-01 00:33:33.123Australia/Darwin'), > | (timestamp '2018-11-17 13:33:33.123Z'), > | (timestamp '2020-01-01 01:33:33.123Asia/Shanghai'), > | (timestamp '2100-01-01 01:33:33.123America/Los_Angeles') t(col) > | """.stripMargin > sql(createSql) > withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> false.toString) { > val rows = sql("select col, date_format(col, 'F') from v").collect() > // scalastyle:off > rows.foreach(println) > } {code} > > Before Java 19, the result is > > {code:java} > [1582-05-31 19:40:35.123,3] > [1969-12-31 15:00:00.0,3] > [1970-12-31 04:59:59.999,3] > [1996-03-31 07:03:33.123,3] > [2018-11-17 05:33:33.123,3] > [2019-12-31 09:33:33.123,3] > [2100-01-01 01:33:33.123,1] {code} > Java 19 > > {code:java} > [1582-05-31 19:40:35.123,5] > [1969-12-31 15:00:00.0,5] > [1970-12-31 04:59:59.999,5] > [1996-03-31 07:03:33.123,5] > [2018-11-17 05:33:33.123,3] > [2019-12-31 09:33:33.123,5] > [2100-01-01 01:33:33.123,1] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40791) The semantics of `F` in `DateTimeFormatter` have changed
[ https://issues.apache.org/jira/browse/SPARK-40791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620487#comment-17620487 ] Dongjoon Hyun commented on SPARK-40791: --- According to the PR comment, Java 11 has the same issue, [~LuciferYang]? > hmm... the latest 11(11.0.17) and 17(17.0.5) have the same issue ... > The semantics of `F` in `DateTimeFormatter` have changed > > > Key: SPARK-40791 > URL: https://issues.apache.org/jira/browse/SPARK-40791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > val createSql = > """ > |create temporary view v as select col from values > | (timestamp '1582-06-01 11:33:33.123UTC+08'), > | (timestamp '1970-01-01 00:00:00.000Europe/Paris'), > | (timestamp '1970-12-31 23:59:59.999Asia/Srednekolymsk'), > | (timestamp '1996-04-01 00:33:33.123Australia/Darwin'), > | (timestamp '2018-11-17 13:33:33.123Z'), > | (timestamp '2020-01-01 01:33:33.123Asia/Shanghai'), > | (timestamp '2100-01-01 01:33:33.123America/Los_Angeles') t(col) > | """.stripMargin > sql(createSql) > withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> false.toString) { > val rows = sql("select col, date_format(col, 'F') from v").collect() > // scalastyle:off > rows.foreach(println) > } {code} > > Before Java 19, the result is > > {code:java} > [1582-05-31 19:40:35.123,3] > [1969-12-31 15:00:00.0,3] > [1970-12-31 04:59:59.999,3] > [1996-03-31 07:03:33.123,3] > [2018-11-17 05:33:33.123,3] > [2019-12-31 09:33:33.123,3] > [2100-01-01 01:33:33.123,1] {code} > Java 19 > > {code:java} > [1582-05-31 19:40:35.123,5] > [1969-12-31 15:00:00.0,5] > [1970-12-31 04:59:59.999,5] > [1996-03-31 07:03:33.123,5] > [2018-11-17 05:33:33.123,3] > [2019-12-31 09:33:33.123,5] > [2100-01-01 01:33:33.123,1] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40847) SPARK: Load Data from Dataframe or RDD to DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-40847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Garg updated SPARK-40847: --- Description: I am using spark 2.1 on EMR and i have a dataframe like this: ClientNum | Value_1 | Value_2 | Value_3 | Value_4 14 | A | B | C | null 19 | X | Y | null | null 21 | R | null | null | null I want to load data into DynamoDB table with ClientNum as key fetching: Analyze Your Data on Amazon DynamoDB with apche Spark11 Using Spark SQL for ETL3 here is my code that I tried to solve: var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.servicename", "dynamodb") jobConf.set("dynamodb.input.tableName", "table_name") jobConf.set("dynamodb.output.tableName", "table_name") jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") jobConf.set("dynamodb.regionid", "eu-west-1") jobConf.set("dynamodb.throughput.read", "1") jobConf.set("dynamodb.throughput.read.percent", "1") jobConf.set("dynamodb.throughput.write", "1") jobConf.set("dynamodb.throughput.write.percent", "1") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") #Import Data val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(path) I performed a transformation to have an RDD that matches the types that the DynamoDB custom output format knows how to write. The custom output format expects a tuple containing the Text and DynamoDBItemWritable types. Create a new RDD with those types in it, in the following map call: #Convert the dataframe to rdd val df_rdd = df.rdd > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 #Print first rdd df_rdd.take(1) > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) var ddbInsertFormattedRDD = df_rdd.map(a => { var ddbMap = new HashMap[String, AttributeValue]() var ClientNum = new AttributeValue() ClientNum.setN(a.get(0).toString) ddbMap.put("ClientNum", ClientNum) var Value_1 = new AttributeValue() Value_1.setS(a.get(1).toString) ddbMap.put("Value_1", Value_1) var Value_2 = new AttributeValue() Value_2.setS(a.get(2).toString) ddbMap.put("Value_2", Value_2) var Value_3 = new AttributeValue() Value_3.setS(a.get(3).toString) ddbMap.put("Value_3", Value_3) var Value_4 = new AttributeValue() Value_4.setS(a.get(4).toString) ddbMap.put("Value_4", Value_4) var item = new DynamoDBItemWritable() item.setItem(ddbMap) (new Text(""), item) } ) This last call uses the job configuration that defines the EMR-DDB connector to write out the new RDD you created in the expected format: ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) fails with the follwoing error: Caused by: java.lang.NullPointerException null values caused the error, if I try with ClientNum and Value_1 it works data is correctly inserted on DynamoDB table. Thank you. [Power BI Certification|https://www.igmguru.com/data-science-bi/power-bi-certification-training/] was: I am using spark 2.1 on EMR and i have a dataframe like this: ClientNum | Value_1 | Value_2 | Value_3 | Value_4 14 | A | B | C | null 19 | X | Y | null | null 21 | R | null | null | null I want to load data into DynamoDB table with ClientNum as key fetching: Analyze Your Data on Amazon DynamoDB with apche Spark11 Using Spark SQL for ETL3 here is my code that I tried to solve: var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.servicename", "dynamodb") jobConf.set("dynamodb.input.tableName", "table_name") jobConf.set("dynamodb.output.tableName", "table_name") jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") jobConf.set("dynamodb.regionid", "eu-west-1") jobConf.set("dynamodb.throughput.read", "1") jobConf.set("dynamodb.throughput.read.percent", "1") jobConf.set("dynamodb.throughput.write", "1") jobConf.set("dynamodb.throughput.write.percent", "1") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") #Import Data val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(path) I performed a transformation to have an RDD that matches the types that the DynamoDB custom output format knows how to write. The custom output format expects a tuple containing the Text and DynamoDBItemWritable types. Create a new RDD with those types in it, in the following map call: #Convert the dataframe to rdd val df_rdd = df.rdd > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 #Print first rdd df_rdd.take(1) > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) var ddbInsertFormatted
[jira] [Created] (SPARK-40847) SPARK: Load Data from Dataframe or RDD to DynamoDB
Vivek Garg created SPARK-40847: -- Summary: SPARK: Load Data from Dataframe or RDD to DynamoDB Key: SPARK-40847 URL: https://issues.apache.org/jira/browse/SPARK-40847 Project: Spark Issue Type: Question Components: Deploy Affects Versions: 2.1.1 Reporter: Vivek Garg I am using spark 2.1 on EMR and i have a dataframe like this: ClientNum | Value_1 | Value_2 | Value_3 | Value_4 14 | A | B | C | null 19 | X | Y | null | null 21 | R | null | null | null I want to load data into DynamoDB table with ClientNum as key fetching: Analyze Your Data on Amazon DynamoDB with apche Spark11 Using Spark SQL for ETL3 here is my code that I tried to solve: var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.servicename", "dynamodb") jobConf.set("dynamodb.input.tableName", "table_name") jobConf.set("dynamodb.output.tableName", "table_name") jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") jobConf.set("dynamodb.regionid", "eu-west-1") jobConf.set("dynamodb.throughput.read", "1") jobConf.set("dynamodb.throughput.read.percent", "1") jobConf.set("dynamodb.throughput.write", "1") jobConf.set("dynamodb.throughput.write.percent", "1") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") #Import Data val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(path) I performed a transformation to have an RDD that matches the types that the DynamoDB custom output format knows how to write. The custom output format expects a tuple containing the Text and DynamoDBItemWritable types. Create a new RDD with those types in it, in the following map call: #Convert the dataframe to rdd val df_rdd = df.rdd > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 #Print first rdd df_rdd.take(1) > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) var ddbInsertFormattedRDD = df_rdd.map(a => { var ddbMap = new HashMap[String, AttributeValue]() var ClientNum = new AttributeValue() ClientNum.setN(a.get(0).toString) ddbMap.put("ClientNum", ClientNum) var Value_1 = new AttributeValue() Value_1.setS(a.get(1).toString) ddbMap.put("Value_1", Value_1) var Value_2 = new AttributeValue() Value_2.setS(a.get(2).toString) ddbMap.put("Value_2", Value_2) var Value_3 = new AttributeValue() Value_3.setS(a.get(3).toString) ddbMap.put("Value_3", Value_3) var Value_4 = new AttributeValue() Value_4.setS(a.get(4).toString) ddbMap.put("Value_4", Value_4) var item = new DynamoDBItemWritable() item.setItem(ddbMap) (new Text(""), item) } ) This last call uses the job configuration that defines the EMR-DDB connector to write out the new RDD you created in the expected format: ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) fails with the follwoing error: Caused by: java.lang.NullPointerException null values caused the error, if I try with ClientNum and Value_1 it works data is correctly inserted on DynamoDB table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40819: Assignee: (was: Apache Spark) > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Alfred Davidson >Priority: Critical > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40819: Assignee: Apache Spark > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Alfred Davidson >Assignee: Apache Spark >Priority: Critical > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620298#comment-17620298 ] Apache Spark commented on SPARK-40819: -- User 'awdavidson' has created a pull request for this issue: https://github.com/apache/spark/pull/38312 > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Alfred Davidson >Priority: Critical > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40846) GA test failed with Java 8u352
[ https://issues.apache.org/jira/browse/SPARK-40846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620272#comment-17620272 ] Apache Spark commented on SPARK-40846: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38311 > GA test failed with Java 8u352 > -- > > Key: SPARK-40846 > URL: https://issues.apache.org/jira/browse/SPARK-40846 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > catalyst test failed > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > run TimestampFormatterSuite with 8u352 locally: > > {code:java} > [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** > (21 milliseconds) > [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal > 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) > [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Tran
[jira] [Assigned] (SPARK-40846) GA test failed with Java 8u352
[ https://issues.apache.org/jira/browse/SPARK-40846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40846: Assignee: (was: Apache Spark) > GA test failed with Java 8u352 > -- > > Key: SPARK-40846 > URL: https://issues.apache.org/jira/browse/SPARK-40846 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > catalyst test failed > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > run TimestampFormatterSuite with 8u352 locally: > > {code:java} > [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** > (21 milliseconds) > [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal > 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) > [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at
[jira] [Assigned] (SPARK-40846) GA test failed with Java 8u352
[ https://issues.apache.org/jira/browse/SPARK-40846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40846: Assignee: Apache Spark > GA test failed with Java 8u352 > -- > > Key: SPARK-40846 > URL: https://issues.apache.org/jira/browse/SPARK-40846 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > catalyst test failed > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > run TimestampFormatterSuite with 8u352 locally: > > {code:java} > [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** > (21 milliseconds) > [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal > 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) > [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer
[jira] [Commented] (SPARK-40846) GA test failed with Java 8u352
[ https://issues.apache.org/jira/browse/SPARK-40846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620260#comment-17620260 ] Apache Spark commented on SPARK-40846: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38311 > GA test failed with Java 8u352 > -- > > Key: SPARK-40846 > URL: https://issues.apache.org/jira/browse/SPARK-40846 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > catalyst test failed > {code:java} > [info] *** 12 TESTS FAILED *** > [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 > [error] Failed tests: > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite > [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite > [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite > [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite > [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} > run TimestampFormatterSuite with 8u352 locally: > > {code:java} > [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** > (21 milliseconds) > [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal > 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) > [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) > [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) > [info] at > org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at > org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) > [info] at > org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) > [info] at > org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Tran
[jira] [Created] (SPARK-40846) GA test failed with Java 8u352
Yang Jie created SPARK-40846: Summary: GA test failed with Java 8u352 Key: SPARK-40846 URL: https://issues.apache.org/jira/browse/SPARK-40846 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.4.0 Reporter: Yang Jie catalyst test failed {code:java} [info] *** 12 TESTS FAILED *** [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 [error] Failed tests: [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite {code} run TimestampFormatterSuite with 8u352 locally: {code:java} [info] - SPARK-31557: rebasing in legacy formatters/parsers *** FAILED *** (21 milliseconds) [info] zoneId = Antarctica/Vostok 1000-01-01T06:52:23 did not equal 1000-01-01T01:02:03 (TimestampFormatterSuite.scala:281) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$33(TimestampFormatterSuite.scala:281) [info] at scala.collection.Iterator.foreach(Iterator.scala:943) [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$31(TimestampFormatterSuite.scala:280) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$30(TimestampFormatterSuite.scala:271) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$29(TimestampFormatterSuite.scala:271) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28(TimestampFormatterSuite.scala:270) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$28$adapted(TimestampFormatterSuite.scala:268) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$27(TimestampFormatterSuite.scala:268) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.withSQLConf(TimestampFormatterSuite.scala:31) [info] at org.apache.spark.sql.catalyst.util.TimestampFormatterSuite.$anonfun$new$26(TimestampFormatterSuite.scala:268) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.Super
[jira] [Commented] (SPARK-40734) KafkaMicroBatchV2SourceWithAdminSuite failed
[ https://issues.apache.org/jira/browse/SPARK-40734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620167#comment-17620167 ] Yang Jie commented on SPARK-40734: -- It looks like flaky test, re-run will succeed, but the inducement for failure has not been found yet {code:java} - ensure stream-stream self-join generates only one offset in log and correct metrics *** FAILED *** Timed out waiting for stream: The code passed to failAfter did not complete within 30 seconds. java.base/java.lang.Thread.getStackTrace(Thread.java:2550) org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:277) org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) org.apache.spark.sql.kafka010.KafkaSourceTest.failAfter(KafkaMicroBatchSourceSuite.scala:53) org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$7(StreamTest.scala:479) org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$7$adapted(StreamTest.scala:478) scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) Caused by: null java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1766) org.apache.spark.sql.execution.streaming.StreamExecution.awaitOffset(StreamExecution.scala:465) org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$8(StreamTest.scala:480) scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282) org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) org.apache.spark.sql.kafka010.KafkaSourceTest.failAfter(KafkaMicroBatchSourceSuite.scala:53) org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$7(StreamTest.scala:479) == Progress == AssertOnQuery(, ) AddKafkaData(topics = Set(topic-51), data = WrappedArray(1, 2), message = ) => CheckAnswer: [1,1,1],[2,2,2] AddKafkaData(topics = Set(topic-51), data = WrappedArray(6, 3), message = ) CheckAnswer: [1,1,1],[2,2,2],[3,3,3],[1,6,1],[1,1,6],[1,6,6] AssertOnQuery(, ) == Stream == Output Mode: Append Stream state: {KafkaV2[Subscribe[topic-51]]: {"topic-51":{"1":0,"0":1}}} Thread state: alive Thread stack trace: java.base/java.lang.ProcessImpl.forkAndExec(Native Method) java.base/java.lang.ProcessImpl.(ProcessImpl.java:319) java.base/java.lang.ProcessImpl.start(ProcessImpl.java:249) java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110) java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073) org.apache.hadoop.util.Shell.runCommand(Shell.java:937) org.apache.hadoop.util.Shell.run(Shell.java:900) org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212) org.apache.hadoop.util.Shell.execCommand(Shell.java:1306) org.apache.hadoop.util.Shell.execCommand(Shell.java:1288) org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:324) org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:294) org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:439) org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:428) org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:459) org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1305) org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:102) org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:360) org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400) org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:626) org.apache.hadoop.fs.FileContext$3.next(FileContext.java:701) org.apache.hadoop.fs.FileContext$3.next(FileContext.java:697) org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) org.apache.hadoop.fs.FileContext.create(FileContext.java:703) org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:359) org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:140) org.apache.spa
[jira] [Updated] (SPARK-40734) KafkaMicroBatchV2SourceWithAdminSuite failed
[ https://issues.apache.org/jira/browse/SPARK-40734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40734: - Summary: KafkaMicroBatchV2SourceWithAdminSuite failed (was: KafkaMicroBatchSourceSuite failed) > KafkaMicroBatchV2SourceWithAdminSuite failed > > > Key: SPARK-40734 > URL: https://issues.apache.org/jira/browse/SPARK-40734 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > "ensure stream-stream self-join generates only one offset in log and correct > metrics" failed > Failure reason to be supplemented -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40734) KafkaMicroBatchV2SourceWithAdminSuite failed
[ https://issues.apache.org/jira/browse/SPARK-40734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40734: - Description: - ensure stream-stream self-join generates only one offset in log and correct metrics *** FAILED *** - read Kafka transactional messages: read_committed *** FAILED *** Failure reason to be supplemented was: "ensure stream-stream self-join generates only one offset in log and correct metrics" failed Failure reason to be supplemented > KafkaMicroBatchV2SourceWithAdminSuite failed > > > Key: SPARK-40734 > URL: https://issues.apache.org/jira/browse/SPARK-40734 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > - ensure stream-stream self-join generates only one offset in log and correct > metrics *** FAILED *** > - read Kafka transactional messages: read_committed *** FAILED *** > Failure reason to be supplemented -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40753) Fix bug in test case for catalog directory operation
[ https://issues.apache.org/jira/browse/SPARK-40753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40753: --- Assignee: xiaoping.huang > Fix bug in test case for catalog directory operation > > > Key: SPARK-40753 > URL: https://issues.apache.org/jira/browse/SPARK-40753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaoping.huang >Assignee: xiaoping.huang >Priority: Minor > > The implementation class of ExternalCatalog will perform folder operations > when performing operations such as create/drop database/table/partition. The > test case creates a folder in advance when obtaining the DB/Partition path > URI, resulting in the result of the test case is not convincing enough force. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40753) Fix bug in test case for catalog directory operation
[ https://issues.apache.org/jira/browse/SPARK-40753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40753. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38206 [https://github.com/apache/spark/pull/38206] > Fix bug in test case for catalog directory operation > > > Key: SPARK-40753 > URL: https://issues.apache.org/jira/browse/SPARK-40753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaoping.huang >Assignee: xiaoping.huang >Priority: Minor > Fix For: 3.4.0 > > > The implementation class of ExternalCatalog will perform folder operations > when performing operations such as create/drop database/table/partition. The > test case creates a folder in advance when obtaining the DB/Partition path > URI, resulting in the result of the test case is not convincing enough force. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40820) Creating StructType from Json
[ https://issues.apache.org/jira/browse/SPARK-40820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620099#comment-17620099 ] Anthony Wainer Cachay Guivin commented on SPARK-40820: -- Here an example, many dataframes are being created from a schema, this schema is created from a Json. The input parameters to create a schema is StructType.fromJson(json), this internally uses StructField.fromJson(). The issue is when the StructField parses the Json, which forces to define the nullable and metadata attributes inside. ![image]([https://user-images.githubusercontent.com/7476964/196637396-d437278c-f462-41dd-8323-3d613c05214b.png]) it is understandable that name and type are mandatory, but the others should be optional. The current parsing does not allow this. If more than 1000 fields are defined, this would be a headache and unnecessary metadata. > Creating StructType from Json > - > > Key: SPARK-40820 > URL: https://issues.apache.org/jira/browse/SPARK-40820 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Anthony Wainer Cachay Guivin >Priority: Minor > > When create a StructType from a Python dictionary you utilize the > [StructType.fromJson|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L569-L571] > method. > To create a schema can be created as follows from the code below, but it > requires to put inside the json: Nullable and Metadata, this is inconsistent > because within the DataType class this by default. > {code:python} > json = { > "name": "name", > "type": "string" > } > StructField.fromJson(json) > {code} > Error: > {code:python} > from pyspark.sql.types import StructField > json = { > "name": "name", > "type": "string" > } > StructField.fromJson(json) > >> > Traceback (most recent call last): > File "code.py", line 90, in runcode > exec(code, self.locals) > File "", line 1, in > File "pyspark/sql/types.py", line 583, in fromJson > json["nullable"], > KeyError: 'nullable' {code} > > Proposed coding solution: > Instead use indexes for getting from a dictionary, it would be better to use > .get > {code:python} > def fromJson(cls, json: Dict[str, Any]) -> "StructField": > return StructField( > json["name"], > _parse_datatype_json_value(json["type"]), > json.get("nullable"), > json.get("metadata"), > ) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40813) Add limit and offset to Connect DSL
[ https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40813: Assignee: Rui Wang > Add limit and offset to Connect DSL > --- > > Key: SPARK-40813 > URL: https://issues.apache.org/jira/browse/SPARK-40813 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40813) Add limit and offset to Connect DSL
[ https://issues.apache.org/jira/browse/SPARK-40813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40813. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38275 [https://github.com/apache/spark/pull/38275] > Add limit and offset to Connect DSL > --- > > Key: SPARK-40813 > URL: https://issues.apache.org/jira/browse/SPARK-40813 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40839) [Python] Implement `DataFrame.sample`
[ https://issues.apache.org/jira/browse/SPARK-40839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620086#comment-17620086 ] Apache Spark commented on SPARK-40839: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38310 > [Python] Implement `DataFrame.sample` > - > > Key: SPARK-40839 > URL: https://issues.apache.org/jira/browse/SPARK-40839 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40839) [Python] Implement `DataFrame.sample`
[ https://issues.apache.org/jira/browse/SPARK-40839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40839: Assignee: Apache Spark (was: Ruifeng Zheng) > [Python] Implement `DataFrame.sample` > - > > Key: SPARK-40839 > URL: https://issues.apache.org/jira/browse/SPARK-40839 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40839) [Python] Implement `DataFrame.sample`
[ https://issues.apache.org/jira/browse/SPARK-40839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40839: Assignee: Ruifeng Zheng (was: Apache Spark) > [Python] Implement `DataFrame.sample` > - > > Key: SPARK-40839 > URL: https://issues.apache.org/jira/browse/SPARK-40839 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40779) Fix `corrwith` to work properly with different anchor.
[ https://issues.apache.org/jira/browse/SPARK-40779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40779: Assignee: Haejoon Lee > Fix `corrwith` to work properly with different anchor. > -- > > Key: SPARK-40779 > URL: https://issues.apache.org/jira/browse/SPARK-40779 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > DataFrame.corrwith is not working properly when different anchor in pandas > 1.5.0 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40779) Fix `corrwith` to work properly with different anchor.
[ https://issues.apache.org/jira/browse/SPARK-40779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40779. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38292 [https://github.com/apache/spark/pull/38292] > Fix `corrwith` to work properly with different anchor. > -- > > Key: SPARK-40779 > URL: https://issues.apache.org/jira/browse/SPARK-40779 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > DataFrame.corrwith is not working properly when different anchor in pandas > 1.5.0 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40845) Add template support for SPARK_GPG_KEY
Yikun Jiang created SPARK-40845: --- Summary: Add template support for SPARK_GPG_KEY Key: SPARK-40845 URL: https://issues.apache.org/jira/browse/SPARK-40845 Project: Spark Issue Type: Sub-task Components: Spark Docker Affects Versions: 3.4.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40823) Connect Proto should carry unparsed identifiers
[ https://issues.apache.org/jira/browse/SPARK-40823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620072#comment-17620072 ] Apache Spark commented on SPARK-40823: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38309 > Connect Proto should carry unparsed identifiers > --- > > Key: SPARK-40823 > URL: https://issues.apache.org/jira/browse/SPARK-40823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40823) Connect Proto should carry unparsed identifiers
[ https://issues.apache.org/jira/browse/SPARK-40823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620070#comment-17620070 ] Apache Spark commented on SPARK-40823: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38309 > Connect Proto should carry unparsed identifiers > --- > > Key: SPARK-40823 > URL: https://issues.apache.org/jira/browse/SPARK-40823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40823) Connect Proto should carry unparsed identifiers
[ https://issues.apache.org/jira/browse/SPARK-40823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620067#comment-17620067 ] Apache Spark commented on SPARK-40823: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38308 > Connect Proto should carry unparsed identifiers > --- > > Key: SPARK-40823 > URL: https://issues.apache.org/jira/browse/SPARK-40823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40844) Flip the default value of Kafka offset fetching config
[ https://issues.apache.org/jira/browse/SPARK-40844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40844: Assignee: (was: Apache Spark) > Flip the default value of Kafka offset fetching config > -- > > Key: SPARK-40844 > URL: https://issues.apache.org/jira/browse/SPARK-40844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Discussion thread: > [https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40844) Flip the default value of Kafka offset fetching config
[ https://issues.apache.org/jira/browse/SPARK-40844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620051#comment-17620051 ] Apache Spark commented on SPARK-40844: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/38306 > Flip the default value of Kafka offset fetching config > -- > > Key: SPARK-40844 > URL: https://issues.apache.org/jira/browse/SPARK-40844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Discussion thread: > [https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40844) Flip the default value of Kafka offset fetching config
[ https://issues.apache.org/jira/browse/SPARK-40844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40844: Assignee: Apache Spark > Flip the default value of Kafka offset fetching config > -- > > Key: SPARK-40844 > URL: https://issues.apache.org/jira/browse/SPARK-40844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Discussion thread: > [https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org