[jira] [Resolved] (SPARK-44837) Improve error message for ALTER TABLE ALTER COLUMN on partition columns in non-delta tables
[ https://issues.apache.org/jira/browse/SPARK-44837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44837. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42524 [https://github.com/apache/spark/pull/42524] > Improve error message for ALTER TABLE ALTER COLUMN on partition columns in > non-delta tables > --- > > Key: SPARK-44837 > URL: https://issues.apache.org/jira/browse/SPARK-44837 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 4.0.0 >Reporter: Michael Zhang >Assignee: Michael Zhang >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > > {code:java} > -- hive table > sql("create table some_table (x int, y int, z int) using parquet PARTITIONED > BY (x, y) " + > "location '/Users/someone/runtime/tmp-data/some_table'") > sql("alter table some_table alter column x comment 'some-comment'").collect() > Can't find column `x` given table data columns [`z`].{code} > Improve error message to indicate to users that the command is not supported > on partition columns in non-delta tables. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45608) Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error classes
[ https://issues.apache.org/jira/browse/SPARK-45608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1574#comment-1574 ] Max Gekk commented on SPARK-45608: -- The ticket came from https://github.com/apache/spark/pull/43451#discussion_r1365683194 > Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error > classes > - > > Key: SPARK-45608 > URL: https://issues.apache.org/jira/browse/SPARK-45608 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > SchemaColumnConvertNotSupportedException is not currently part of > SparkThrowable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45366) Remove productHash from TreeNode
[ https://issues.apache.org/jira/browse/SPARK-45366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan resolved SPARK-45366. - Resolution: Duplicate > Remove productHash from TreeNode > > > Key: SPARK-45366 > URL: https://issues.apache.org/jira/browse/SPARK-45366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44753) XML: Add Python and sparkR binding including Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-44753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44753: --- Labels: pull-request-available (was: ) > XML: Add Python and sparkR binding including Spark Connect > -- > > Key: SPARK-44753 > URL: https://issues.apache.org/jira/browse/SPARK-44753 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45289) ClassCastException when reading Delta table on AWS S3
[ https://issues.apache.org/jira/browse/SPARK-45289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanawat Panmongkol resolved SPARK-45289. Resolution: Fixed > ClassCastException when reading Delta table on AWS S3 > - > > Key: SPARK-45289 > URL: https://issues.apache.org/jira/browse/SPARK-45289 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.5.0 > Environment: Spark version: 3.5.0 > Deployment mode: spark-shell > OS: Ubuntu (Docker image) > Java/JVM version: OpenJDK 11 > Packages: hadoop-aws:3.3.4, delta-core_2.12:2.4.0 >Reporter: Tanawat Panmongkol >Priority: Major > > When attempting to read a Delta table from S3 using version 3.5.0, a > _*{{ClassCastException}}*_ occurs involving > {{_*org.apache.hadoop.fs.s3a.S3AFileStatus*_}} and > {_}*{{org.apache.spark.sql.execution.datasources.FileStatusWithMetadata}}*{_}. > The error appears to be related to the new feature SPARK-43039. > _*Steps to Reproduce:*_ > {code:java} > export AWS_ACCESS_KEY_ID='' > export AWS_SECRET_ACCESS_KEY='' > export AWS_REGION='' > docker run --rm -it apache/spark:3.5.0-scala2.12-java11-ubuntu > /opt/spark/bin/spark-shell \ > --packages > 'org.apache.hadoop:hadoop-aws:3.3.4,io.delta:delta-core_2.12:2.4.0' \ > --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > \ > --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ > --conf "spark.hadoop.aws.region=$AWS_REGION" \ > --conf "spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID" \ > --conf "spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY" \ > --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ > --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ > --conf "spark.hadoop.fs.s3a.path.style.access=true" \ > --conf "spark.hadoop.fs.s3a.connection.ssl.enabled=true" \ > --conf "spark.jars.ivy=/tmp/ivy/cache"{code} > {code:java} > scala> > spark.read.format("delta").load("s3:").show() > {code} > *Logs:* > {code:java} > java.lang.ClassCastException: class org.apache.hadoop.fs.s3a.S3AFileStatus > cannot be cast to class > org.apache.spark.sql.execution.datasources.FileStatusWithMetadata > (org.apache.hadoop.fs.s3a.S3AFileStatus is in unnamed module of loader > scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @4552f905; > org.apache.spark.sql.execution.datasources.FileStatusWithMetadata is in > unnamed module of loader 'app') > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.FileSourceScanLike.$anonfun$setFilesNumAndSizeMetric$2(DataSourceScanExec.scala:466) > at > org.apache.spark.sql.execution.FileSourceScanLike.$anonfun$setFilesNumAndSizeMetric$2$adapted(DataSourceScanExec.scala:466) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.execution.FileSourceScanLike.setFilesNumAndSizeMetric(DataSourceScanExec.scala:466) > at > org.apache.spark.sql.execution.FileSourceScanLike.selectedPartitions(DataSourceScanExec.scala:257) > at > org.apache.spark.sql.execution.FileSourceScanLike.selectedPartitions$(DataSourceScanExec.scala:251) > at > org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanLike.dynamicallySelectedPartitions(DataSourceScanExec.scala:286) > at > org.apache.spark.sql.execution.FileSourceScanLike.dynamicallySelectedPartitions$(DataSourceScanExec.scala:267) > at > org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions$lzycompute(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:553) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:537) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:575) > at >
[jira] [Created] (SPARK-45614) Assign name to _LEGACY_ERROR_TEMP_215[6,7,8]
Deng Ziming created SPARK-45614: --- Summary: Assign name to _LEGACY_ERROR_TEMP_215[6,7,8] Key: SPARK-45614 URL: https://issues.apache.org/jira/browse/SPARK-45614 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Deng Ziming Assignee: Deng Ziming Fix For: 4.0.0 Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2153* defined in {*}core/src/main/resources/error/error-classes.json{*}. The name should be short but complete (look at the example in error-classes.json). Add a test which triggers the error from user code if such test still doesn't exist. Check exception fields by using {*}checkError(){*}. The last function checks valuable error fields only, and avoids dependencies from error text message. In this way, tech editors can modify error format in error-classes.json, and don't worry of Spark's internal tests. Migrate other tests that might trigger the error onto checkError(). If you cannot reproduce the error from user space (using SQL query), replace the error by an internal error, see {*}SparkException.internalError(){*}. Improve the error message format in error-classes.json if the current is not clear. Propose a solution to users how to avoid and fix such kind of errors. Please, look at the PR below as examples: * [https://github.com/apache/spark/pull/38685] * [https://github.com/apache/spark/pull/38656] * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45591) Upgrade ASM to 9.6
[ https://issues.apache.org/jira/browse/SPARK-45591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45591. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43431 [https://github.com/apache/spark/pull/43431] > Upgrade ASM to 9.6 > -- > > Key: SPARK-45591 > URL: https://issues.apache.org/jira/browse/SPARK-45591 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45610) Handle "Auto-application to `()` is deprecated."
[ https://issues.apache.org/jira/browse/SPARK-45610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1539#comment-1539 ] Yang Jie edited comment on SPARK-45610 at 10/20/23 2:31 AM: Okay, I can start preparing this PR. was (Author: luciferyang): Okay, I can start preparing this PR. > Handle "Auto-application to `()` is deprecated." > > > Key: SPARK-45610 > URL: https://issues.apache.org/jira/browse/SPARK-45610 > Project: Spark > Issue Type: Sub-task > Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > For the following case, a compile warning will be issued in Scala 2.13: > > {code:java} > Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). > Type in expressions for evaluation. Or try :help. > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > class Foo > scala> val foo = new Foo > val foo: Foo = Foo@7061622 > scala> val ret = foo.isEmpty > ^ > warning: Auto-application to `()` is deprecated. Supply the empty > argument list `()` explicitly to invoke method isEmpty, > or remove the empty argument list from its definition (Java-defined > methods are exempt). > In Scala 3, an unapplied method like this will be eta-expanded into a > function. [quickfixable] > val ret: Boolean = true {code} > But for Scala 3, it is a compile error: > {code:java} > Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). > Type in expressions for evaluation. Or try :help. > > > > > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > // defined class Foo > > > > > scala> val foo = new Foo > val foo: Foo = Foo@591f6f83 > > > > > scala> val ret = foo.isEmpty > -- [E100] Syntax Error: > > 1 |val ret = foo.isEmpty > | ^^^ > | method isEmpty in class Foo must be called with () argument > | > | longer explanation available when compiling with `-explain` > 1 error found {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45610) Handle "Auto-application to `()` is deprecated."
[ https://issues.apache.org/jira/browse/SPARK-45610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1539#comment-1539 ] Yang Jie commented on SPARK-45610: -- Okay, I can start preparing this PR. > Handle "Auto-application to `()` is deprecated." > > > Key: SPARK-45610 > URL: https://issues.apache.org/jira/browse/SPARK-45610 > Project: Spark > Issue Type: Sub-task > Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > For the following case, a compile warning will be issued in Scala 2.13: > > {code:java} > Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). > Type in expressions for evaluation. Or try :help. > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > class Foo > scala> val foo = new Foo > val foo: Foo = Foo@7061622 > scala> val ret = foo.isEmpty > ^ > warning: Auto-application to `()` is deprecated. Supply the empty > argument list `()` explicitly to invoke method isEmpty, > or remove the empty argument list from its definition (Java-defined > methods are exempt). > In Scala 3, an unapplied method like this will be eta-expanded into a > function. [quickfixable] > val ret: Boolean = true {code} > But for Scala 3, it is a compile error: > {code:java} > Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). > Type in expressions for evaluation. Or try :help. > > > > > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > // defined class Foo > > > > > scala> val foo = new Foo > val foo: Foo = Foo@591f6f83 > > > > > scala> val ret = foo.isEmpty > -- [E100] Syntax Error: > > 1 |val ret = foo.isEmpty > | ^^^ > | method isEmpty in class Foo must be called with () argument > | > | longer explanation available when compiling with `-explain` > 1 error found {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44405) Reduce code duplication in group-based DELETE and MERGE tests
[ https://issues.apache.org/jira/browse/SPARK-44405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1538#comment-1538 ] Min Zhao commented on SPARK-44405: -- Hello, are you working on it and would you like me to try it? > Reduce code duplication in group-based DELETE and MERGE tests > - > > Key: SPARK-44405 > URL: https://issues.apache.org/jira/browse/SPARK-44405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Priority: Major > > See [this|https://github.com/apache/spark/pull/41600#discussion_r1230014119] > discussion. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions
[ https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JacobZheng resolved SPARK-45601. Fix Version/s: 3.3.0 Resolution: Resolved > stackoverflow when executing rule ExtractWindowExpressions > -- > > Key: SPARK-45601 > URL: https://issues.apache.org/jira/browse/SPARK-45601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: JacobZheng >Priority: Major > Fix For: 3.3.0 > > > I am encountering stackoverflow errors while executing the following test > case. I looked at the source code and it is ExtractWindowExpressions not > extracting the window correctly and encountering a dead loop at > resolveOperatorsDownWithPruning that is causing it. > {code:scala} > test("agg filter contains window") { > val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") > .withColumn("test", > expr("count(col1) filter (where min(col1) over(partition by col2 > order by col3)>1)")) > src.show() > } > {code} > Now my question is this kind of in agg filter (window) is the correct usage? > Or should I add a check like spark sql and throw an error "It is not allowed > to use window functions inside WHERE clause"? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions
[ https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1536#comment-1536 ] JacobZheng commented on SPARK-45601: Got it, Thanks [~bersprockets] > stackoverflow when executing rule ExtractWindowExpressions > -- > > Key: SPARK-45601 > URL: https://issues.apache.org/jira/browse/SPARK-45601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: JacobZheng >Priority: Major > > I am encountering stackoverflow errors while executing the following test > case. I looked at the source code and it is ExtractWindowExpressions not > extracting the window correctly and encountering a dead loop at > resolveOperatorsDownWithPruning that is causing it. > {code:scala} > test("agg filter contains window") { > val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") > .withColumn("test", > expr("count(col1) filter (where min(col1) over(partition by col2 > order by col3)>1)")) > src.show() > } > {code} > Now my question is this kind of in agg filter (window) is the correct usage? > Or should I add a check like spark sql and throw an error "It is not allowed > to use window functions inside WHERE clause"? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45613) Expose DeterministicLevel as a DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-45613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45613: --- Labels: pull-request-available (was: ) > Expose DeterministicLevel as a DeveloperApi > --- > > Key: SPARK-45613 > URL: https://issues.apache.org/jira/browse/SPARK-45613 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Mridul Muralidharan >Priority: Major > Labels: pull-request-available > > {{RDD.getOutputDeterministicLevel}} is a {{DeveloperApi}} which users can > override to specify the {{DeterministicLevel}} of the {{RDD}}. > Unfortunately, {{DeterministicLevel}} itself is {{private[spark]}}. > Expose {{DeterministicLevel}} to allow users to users this method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45613) Expose DeterministicLevel as a DeveloperApi
Mridul Muralidharan created SPARK-45613: --- Summary: Expose DeterministicLevel as a DeveloperApi Key: SPARK-45613 URL: https://issues.apache.org/jira/browse/SPARK-45613 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0, 3.4.0, 4.0.0 Reporter: Mridul Muralidharan {{RDD.getOutputDeterministicLevel}} is a {{DeveloperApi}} which users can override to specify the {{DeterministicLevel}} of the {{RDD}}. Unfortunately, {{DeterministicLevel}} itself is {{private[spark]}}. Expose {{DeterministicLevel}} to allow users to users this method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45603) merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry
[ https://issues.apache.org/jira/browse/SPARK-45603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45603. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43447 [https://github.com/apache/spark/pull/43447] > merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry > > > Key: SPARK-45603 > URL: https://issues.apache.org/jira/browse/SPARK-45603 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45603) merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry
[ https://issues.apache.org/jira/browse/SPARK-45603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45603: - Assignee: Kent Yao > merge_spark_pr shall notice us about GITHUB_OAUTH_KEY expiry > > > Key: SPARK-45603 > URL: https://issues.apache.org/jira/browse/SPARK-45603 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45612) Allow cached RDDs to be migrated to fallback storage during decommission
[ https://issues.apache.org/jira/browse/SPARK-45612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45612: --- Labels: pull-request-available (was: ) > Allow cached RDDs to be migrated to fallback storage during decommission > > > Key: SPARK-45612 > URL: https://issues.apache.org/jira/browse/SPARK-45612 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Frank Yin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45584) Execution fails when there are subqueries in TakeOrderedAndProjectExec
[ https://issues.apache.org/jira/browse/SPARK-45584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-45584: --- Assignee: Allison Wang > Execution fails when there are subqueries in TakeOrderedAndProjectExec > -- > > Key: SPARK-45584 > URL: https://issues.apache.org/jira/browse/SPARK-45584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > > When there are subqueries in TakeOrderedAndProjectExec, the query can throw > this exception: > java.lang.IllegalArgumentException: requirement failed: Subquery > subquery#242, [id=#109|#109] has not finished > This is because TakeOrderedAndProjectExec does not wait for subquery > execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45584) Execution fails when there are subqueries in TakeOrderedAndProjectExec
[ https://issues.apache.org/jira/browse/SPARK-45584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45584. - Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43419 [https://github.com/apache/spark/pull/43419] > Execution fails when there are subqueries in TakeOrderedAndProjectExec > -- > > Key: SPARK-45584 > URL: https://issues.apache.org/jira/browse/SPARK-45584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > When there are subqueries in TakeOrderedAndProjectExec, the query can throw > this exception: > java.lang.IllegalArgumentException: requirement failed: Subquery > subquery#242, [id=#109|#109] has not finished > This is because TakeOrderedAndProjectExec does not wait for subquery > execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45611. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43442 [https://github.com/apache/spark/pull/43442] > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Assignee: Mete Can Akar >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: image-2023-10-19-19-46-22-918.png > > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > From the official documentation: > !image-2023-10-19-19-46-22-918.png|width=633,height=365! > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45611: Assignee: Mete Can Akar > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Assignee: Mete Can Akar >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-19-19-46-22-918.png > > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > From the official documentation: > !image-2023-10-19-19-46-22-918.png|width=633,height=365! > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45428. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43454 [https://github.com/apache/spark/pull/43454] > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45611: --- Labels: pull-request-available (was: ) > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > Labels: pull-request-available > Attachments: image-2023-10-19-19-46-22-918.png > > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > From the official documentation: > !image-2023-10-19-19-46-22-918.png|width=633,height=365! > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45612) Allow cached RDDs to be migrated to fallback storage during decommission
Frank Yin created SPARK-45612: - Summary: Allow cached RDDs to be migrated to fallback storage during decommission Key: SPARK-45612 URL: https://issues.apache.org/jira/browse/SPARK-45612 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Frank Yin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1502#comment-1502 ] Huw commented on SPARK-45583: - Ahh, apologies, it looks like I was running 3.4.1 when I found this issue. Testing in 3.5 it does appear to be resolved. > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Huw >Priority: Major > Fix For: 3.5.0 > > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huw updated SPARK-45583: Affects Version/s: 3.4.1 (was: 3.5.0) > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Huw >Priority: Major > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huw updated SPARK-45583: Fix Version/s: 3.5.0 > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Huw >Priority: Major > Fix For: 3.5.0 > > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` val path = "/tmp/sample_parquet_file" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() ``` Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: val path = "/tmp/sample_parquet_file" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > ``` > val path = "/tmp/sample_parquet_file" > spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS > field").write.parquet(path) > spark.read.schema("field ARRAY").parquet(path).collect() > ``` > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: val path = "/tmp/sample_parquet_file" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: {{{}{}}}``` val path = "/tmp/zamil/timestamp" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() ``` Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > val path = "/tmp/sample_parquet_file" > spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS > field").write.parquet(path) > spark.read.schema("field ARRAY").parquet(path).collect() > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{{}{}}}``` val path = "/tmp/zamil/timestamp" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() ``` Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: {{val path = "/tmp/someparquetfile"}} {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path)}} {{spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > {{{}{}}}``` > val path = "/tmp/zamil/timestamp" > spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS > field").write.parquet(path) > spark.read.schema("field ARRAY").parquet(path).collect() > ``` > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{val path = "/tmp/someparquetfile"}} {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path)}} {{spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: {{val path = "/tmp/someparquetfile" spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path) spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > {{val path = "/tmp/someparquetfile"}} > {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS > field").write.mode("overwrite").parquet(path)}} > {{spark.read.schema("field array").parquet(path).collect()}} > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{val path = "/tmp/someparquetfile" spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path) spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: ``` val path = "/tmp/someparquetfile" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > {{val path = "/tmp/someparquetfile" > spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS > field").write.mode("overwrite").parquet(path) > spark.read.schema("field array").parquet(path).collect()}} > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mete Can Akar updated SPARK-45611: -- Description: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} >From the official documentation: !image-2023-10-19-19-46-22-918.png|width=633,height=365! {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. was: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} !image-2023-10-19-19-46-22-918.png|width=633,height=365! {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > Attachments: image-2023-10-19-19-46-22-918.png > > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > From the official documentation: > !image-2023-10-19-19-46-22-918.png|width=633,height=365! > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mete Can Akar updated SPARK-45611: -- Description: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} !image-2023-10-19-19-46-22-918.png|width=633,height=365! {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. was: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > Attachments: image-2023-10-19-19-46-22-918.png > > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > !image-2023-10-19-19-46-22-918.png|width=633,height=365! > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mete Can Akar updated SPARK-45611: -- Attachment: image-2023-10-19-19-46-22-918.png > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > Attachments: image-2023-10-19-19-46-22-918.png > > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mete Can Akar updated SPARK-45611: -- Description: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442].}}{}}}}}{}}} was: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {{ {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} }} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442].}}{}}}}}{}}} > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR > [https://github.com/apache/spark/pull/43442].}}{}}}}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mete Can Akar updated SPARK-45611: -- Description: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {{ {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} }} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442].}}{}}}}}{}}} was: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {{}} As a solution, I proposed the PR https://github.com/apache/spark/pull/43442.{{{}{}}}{{{}{}}} > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > {{ > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > }} > As a solution, I proposed the PR > [https://github.com/apache/spark/pull/43442].}}{}}}}}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
[ https://issues.apache.org/jira/browse/SPARK-45611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mete Can Akar updated SPARK-45611: -- Description: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. was: In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {code:java} df = spark.createDataFrame([('2015-04-08',)], ['dt']) df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() [Row(date='04/08/2015')] {code} As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442].}}{}}}}}{}}} > spark.python.pyspark.sql.functions Typo at date_format Function > --- > > Key: SPARK-45611 > URL: https://issues.apache.org/jira/browse/SPARK-45611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Mete Can Akar >Priority: Minor > > In the spark.python.pyspark.sql.functions module, at the {{date_format}} > method's doctest, there is a typo in the year format. > Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected > output{{{}[Row(date='04/08/2015')]{}}} indicates the following format > {{"MM/dd/".}} > > {code:java} > df = spark.createDataFrame([('2015-04-08',)], ['dt']) > df.select(date_format('dt', 'MM/dd/yyy').alias('date')).collect() > [Row(date='04/08/2015')] > {code} > > As a solution, I proposed the PR [https://github.com/apache/spark/pull/43442]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45611) spark.python.pyspark.sql.functions Typo at date_format Function
Mete Can Akar created SPARK-45611: - Summary: spark.python.pyspark.sql.functions Typo at date_format Function Key: SPARK-45611 URL: https://issues.apache.org/jira/browse/SPARK-45611 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.0 Reporter: Mete Can Akar In the spark.python.pyspark.sql.functions module, at the {{date_format}} method's doctest, there is a typo in the year format. Instead of '{{{}MM/dd/yyy'{}}}, it should be {{'MM/dd/'}} as the expected output{{{}[Row(date='04/08/2015')]{}}} indicates the following format {{"MM/dd/".}} {{}} As a solution, I proposed the PR https://github.com/apache/spark/pull/43442.{{{}{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45610) Handle "Auto-application to `()` is deprecated."
[ https://issues.apache.org/jira/browse/SPARK-45610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1405#comment-1405 ] Sean R. Owen commented on SPARK-45610: -- I think it's better to make big changes at major version boundaries. I'd expect we support Scala 3 at some point for Spark 4.x. Therefore I think it'd be OK to proceed with these changes now for 4.0. > Handle "Auto-application to `()` is deprecated." > > > Key: SPARK-45610 > URL: https://issues.apache.org/jira/browse/SPARK-45610 > Project: Spark > Issue Type: Sub-task > Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > For the following case, a compile warning will be issued in Scala 2.13: > > {code:java} > Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). > Type in expressions for evaluation. Or try :help. > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > class Foo > scala> val foo = new Foo > val foo: Foo = Foo@7061622 > scala> val ret = foo.isEmpty > ^ > warning: Auto-application to `()` is deprecated. Supply the empty > argument list `()` explicitly to invoke method isEmpty, > or remove the empty argument list from its definition (Java-defined > methods are exempt). > In Scala 3, an unapplied method like this will be eta-expanded into a > function. [quickfixable] > val ret: Boolean = true {code} > But for Scala 3, it is a compile error: > {code:java} > Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). > Type in expressions for evaluation. Or try :help. > > > > > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > // defined class Foo > > > > > scala> val foo = new Foo > val foo: Foo = Foo@591f6f83 > > > > > scala> val ret = foo.isEmpty > -- [E100] Syntax Error: > > 1 |val ret = foo.isEmpty > | ^^^ > | method isEmpty in class Foo must be called with () argument > | > | longer explanation available when compiling with `-explain` > 1 error found {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45610) Handle "Auto-application to `()` is deprecated."
[ https://issues.apache.org/jira/browse/SPARK-45610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1396#comment-1396 ] Yang Jie commented on SPARK-45610: -- In Spark, this involves a massive amount of cases. Since this is a compile error for Scala 3, it seems that we will have to fix this when we prepare to support Scala 3. As the plan to support Scala 3 is not clear at the moment, should we wait until the schedule for supporting Scala 3 is confirmed before we proceed with the fix? I would like to know your thoughts. [~srowen] [~dongjoon] [~gurwls223] > Handle "Auto-application to `()` is deprecated." > > > Key: SPARK-45610 > URL: https://issues.apache.org/jira/browse/SPARK-45610 > Project: Spark > Issue Type: Sub-task > Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > For the following case, a compile warning will be issued in Scala 2.13: > > {code:java} > Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). > Type in expressions for evaluation. Or try :help. > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > class Foo > scala> val foo = new Foo > val foo: Foo = Foo@7061622 > scala> val ret = foo.isEmpty > ^ > warning: Auto-application to `()` is deprecated. Supply the empty > argument list `()` explicitly to invoke method isEmpty, > or remove the empty argument list from its definition (Java-defined > methods are exempt). > In Scala 3, an unapplied method like this will be eta-expanded into a > function. [quickfixable] > val ret: Boolean = true {code} > But for Scala 3, it is a compile error: > {code:java} > Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). > Type in expressions for evaluation. Or try :help. > > > > > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > // defined class Foo > > > > > scala> val foo = new Foo > val foo: Foo = Foo@591f6f83 > > > > > scala> val ret = foo.isEmpty > -- [E100] Syntax Error: > > 1 |val ret = foo.isEmpty > | ^^^ > | method isEmpty in class Foo must be called with () argument > | > | longer explanation available when compiling with `-explain` > 1 error found {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45610) Handle "Auto-application to `()` is deprecated."
[ https://issues.apache.org/jira/browse/SPARK-45610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45610: - Description: For the following case, a compile warning will be issued in Scala 2.13: {code:java} Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). Type in expressions for evaluation. Or try :help. scala> class Foo { | def isEmpty(): Boolean = true | def isTrue(x: Boolean): Boolean = x | } class Foo scala> val foo = new Foo val foo: Foo = Foo@7061622 scala> val ret = foo.isEmpty ^ warning: Auto-application to `()` is deprecated. Supply the empty argument list `()` explicitly to invoke method isEmpty, or remove the empty argument list from its definition (Java-defined methods are exempt). In Scala 3, an unapplied method like this will be eta-expanded into a function. [quickfixable] val ret: Boolean = true {code} But for Scala 3, it is a compile error: {code:java} Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). Type in expressions for evaluation. Or try :help. scala> class Foo { | def isEmpty(): Boolean = true | def isTrue(x: Boolean): Boolean = x | } // defined class Foo scala> val foo = new Foo val foo: Foo = Foo@591f6f83 scala> val ret = foo.isEmpty -- [E100] Syntax Error: 1 |val ret = foo.isEmpty | ^^^ | method isEmpty in class Foo must be called with () argument | | longer explanation available when compiling with `-explain` 1 error found {code} > Handle "Auto-application to `()` is deprecated." > > > Key: SPARK-45610 > URL: https://issues.apache.org/jira/browse/SPARK-45610 > Project: Spark > Issue Type: Sub-task > Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > For the following case, a compile warning will be issued in Scala 2.13: > > {code:java} > Welcome to Scala 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.8). > Type in expressions for evaluation. Or try :help. > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > class Foo > scala> val foo = new Foo > val foo: Foo = Foo@7061622 > scala> val ret = foo.isEmpty > ^ > warning: Auto-application to `()` is deprecated. Supply the empty > argument list `()` explicitly to invoke method isEmpty, > or remove the empty argument list from its definition (Java-defined > methods are exempt). > In Scala 3, an unapplied method like this will be eta-expanded into a > function. [quickfixable] > val ret: Boolean = true {code} > But for Scala 3, it is a compile error: > {code:java} > Welcome to Scala 3.3.1 (17.0.8, Java OpenJDK 64-Bit Server VM). > Type in expressions for evaluation. Or try :help. > > > > > scala> class Foo { > | def isEmpty(): Boolean = true > | def isTrue(x: Boolean): Boolean = x > | } > // defined class Foo > > > > > scala> val foo = new Foo > val foo: Foo = Foo@591f6f83 > > > > > scala> val ret = foo.isEmpty > -- [E100] Syntax Error: > > 1 |val ret =
[jira] [Created] (SPARK-45610) Handle "Auto-application to `()` is deprecated."
Yang Jie created SPARK-45610: Summary: Handle "Auto-application to `()` is deprecated." Key: SPARK-45610 URL: https://issues.apache.org/jira/browse/SPARK-45610 Project: Spark Issue Type: Sub-task Components: GraphX, MLlib, Spark Core, SQL, Structured Streaming Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25078) Standalone does not work with spark.authenticate.secret and deploy-mode=cluster
[ https://issues.apache.org/jira/browse/SPARK-25078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1377#comment-1377 ] Yaroslav commented on SPARK-25078: -- Hi, this issue is still reproducible. In SPARK-8129 they changed the way Worker sends "spark.authenticate.secret" value to Driver from Java options to environment variable to be more secure (because other processes can freely view this java option while only the process owner can see its environment variables). So the sender should [add|https://github.com/apache/spark/blob/v3.5.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L89-L92] the value to environment and the receiver should take it from there, not from spark config. They have created this universal method [getSecretKey |https://github.com/apache/spark/blob/v3.5.0/core/src/main/scala/org/apache/spark/SecurityManager.scala#L282-L307]which can get the value either from config or from env. But for some reason inside initializeAuth() they still [search|https://github.com/apache/spark/blob/v3.5.0/core/src/main/scala/org/apache/spark/SecurityManager.scala#L337] this key in spark config, which fails and throws such error. Doing such change would fix that and I suppose getSecretKey method was created exactly for such kind of use: {code:java} - require(sparkConf.contains(SPARK_AUTH_SECRET_CONF), + require(getSecretKey() != null, {code} I guess it won't affect anything since even if key is in the config and not in the environment, this method will still try to search there and return the value. Whilst searching only in config does not cover all cases. So [~irashid] , [~maropu] could you please review status of this issue since it's Marked as Resolved (Incomplete) while the error is still easily reproducible and easily fixable as well? Thanks! > Standalone does not work with spark.authenticate.secret and > deploy-mode=cluster > --- > > Key: SPARK-25078 > URL: https://issues.apache.org/jira/browse/SPARK-25078 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: bulk-closed > > When running a spark standalone cluster with spark.authenticate.secret setup, > you cannot submit a program in cluster mode, even with the right secret. The > driver fails with: > {noformat} > 18/08/09 08:17:21 INFO SecurityManager: SecurityManager: authentication > enabled; ui acls disabled; users with view permissions: Set(systest); groups > with view permissions: Set(); users with modify permissions: Set(systest); > groups with modify permissions: Set() > 18/08/09 08:17:21 ERROR SparkContext: Error initializing SparkContext. > java.lang.IllegalArgumentException: requirement failed: A secret key must be > specified via the spark.authenticate.secret config. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.SecurityManager.initializeAuth(SecurityManager.scala:361) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:238) > at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175) > at > org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:424) > ... > {noformat} > but its actually doing the wrong check in > {{SecurityManager.initializeAuth()}}. The secret is there, its just in an > environment variable {{_SPARK_AUTH_SECRET}} (so its not visible to another > process). > *Workaround*: In your program, you can pass in a dummy secret to your spark > conf. It doesn't matter what it is at all, later it'll be ignored and when > establishing connections, the secret from the env variable will be used. Eg. > {noformat} > val conf = new SparkConf() > conf.setIfMissing("spark.authenticate.secret", "doesn't matter") > val sc = new SparkContext(conf) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45609) Include SqlState in SparkThrowable proto message
[ https://issues.apache.org/jira/browse/SPARK-45609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45609: --- Labels: pull-request-available (was: ) > Include SqlState in SparkThrowable proto message > > > Key: SPARK-45609 > URL: https://issues.apache.org/jira/browse/SPARK-45609 > Project: Spark > Issue Type: Test > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yihong He >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45609) Include SqlState in SparkThrowable proto message
Yihong He created SPARK-45609: - Summary: Include SqlState in SparkThrowable proto message Key: SPARK-45609 URL: https://issues.apache.org/jira/browse/SPARK-45609 Project: Spark Issue Type: Test Components: Connect Affects Versions: 4.0.0 Reporter: Yihong He -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1246#comment-1246 ] Faiz Halde edited comment on SPARK-45598 at 10/19/23 4:04 PM: -- Hi [~sdaberdaku] , corrected the title. I tested it with 3.0.0 delta. What I meant was, delta table does not work with {*}spark connect{*}. It does work with vanilla spark 3.5.0 otherwise was (Author: JIRAUSER300204): Hi [~sdaberdaku] , corrected the title. I tested it with 3.0.0 delta. What I meant was, delta table does not work with {*}spark connect{*}. It does work with spark 3.5.0 otherwise > Delta table 3.0.0 not working with Spark Connect 3.5.0 > -- > > Key: SPARK-45598 > URL: https://issues.apache.org/jira/browse/SPARK-45598 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > Spark version 3.5.0 > Spark Connect version 3.5.0 > Delta table 3.0-rc2 > Spark connect server was started using > *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages > org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 > --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf > 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}* > {{Connect client depends on}} > *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"* > *and the connect libraries* > > When trying to run a simple job that writes to a delta table > {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{val data = spark.read.json("profiles.json")}} > {{data.write.format("delta").save("/tmp/delta")}} > > {{Error log in connect client}} > {{Exception in thread "main" org.apache.spark.SparkException: > io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: > cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 > in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} > {{ at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} > {{ at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{...}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} > {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} > {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} > {{ at >
[jira] [Updated] (SPARK-45368) Remove scala2.12 compatibility logic for DoubleType, FloatType, Decimal
[ https://issues.apache.org/jira/browse/SPARK-45368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45368: --- Labels: pull-request-available (was: ) > Remove scala2.12 compatibility logic for DoubleType, FloatType, Decimal > --- > > Key: SPARK-45368 > URL: https://issues.apache.org/jira/browse/SPARK-45368 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions
[ https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1304#comment-1304 ] Bruce Robbins commented on SPARK-45601: --- Possibly SPARK-38666 > stackoverflow when executing rule ExtractWindowExpressions > -- > > Key: SPARK-45601 > URL: https://issues.apache.org/jira/browse/SPARK-45601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: JacobZheng >Priority: Major > > I am encountering stackoverflow errors while executing the following test > case. I looked at the source code and it is ExtractWindowExpressions not > extracting the window correctly and encountering a dead loop at > resolveOperatorsDownWithPruning that is causing it. > {code:scala} > test("agg filter contains window") { > val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") > .withColumn("test", > expr("count(col1) filter (where min(col1) over(partition by col2 > order by col3)>1)")) > src.show() > } > {code} > Now my question is this kind of in agg filter (window) is the correct usage? > Or should I add a check like spark sql and throw an error "It is not allowed > to use window functions inside WHERE clause"? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45608) Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error classes
Zamil Majdy created SPARK-45608: --- Summary: Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error classes Key: SPARK-45608 URL: https://issues.apache.org/jira/browse/SPARK-45608 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.5.0 Reporter: Zamil Majdy SchemaColumnConvertNotSupportedException is not currently part of SparkThrowable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45569) Assign name to _LEGACY_ERROR_TEMP_2152
[ https://issues.apache.org/jira/browse/SPARK-45569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deng Ziming updated SPARK-45569: Description: Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2152* defined in {*}core/src/main/resources/error/error-classes.json{*}. The name should be short but complete (look at the example in error-classes.json). Add a test which triggers the error from user code if such test still doesn't exist. Check exception fields by using {*}checkError(){*}. The last function checks valuable error fields only, and avoids dependencies from error text message. In this way, tech editors can modify error format in error-classes.json, and don't worry of Spark's internal tests. Migrate other tests that might trigger the error onto checkError(). If you cannot reproduce the error from user space (using SQL query), replace the error by an internal error, see {*}SparkException.internalError(){*}. Improve the error message format in error-classes.json if the current is not clear. Propose a solution to users how to avoid and fix such kind of errors. Please, look at the PR below as examples: * [https://github.com/apache/spark/pull/38685] * [https://github.com/apache/spark/pull/38656] * [https://github.com/apache/spark/pull/38490] was: in DatasetSuite test("CLASS_UNSUPPORTED_BY_MAP_OBJECTS when creating dataset") , we are using _LEGACY_ERROR_TEMP_2151, We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`. *NOTE:* Please reply to this ticket before start working on it, to avoid working on same ticket at a time > Assign name to _LEGACY_ERROR_TEMP_2152 > -- > > Key: SPARK-45569 > URL: https://issues.apache.org/jira/browse/SPARK-45569 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deng Ziming >Assignee: Deng Ziming >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2152* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45573) Assign name to _LEGACY_ERROR_TEMP_2153
[ https://issues.apache.org/jira/browse/SPARK-45573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deng Ziming updated SPARK-45573: Description: Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2153* defined in {*}core/src/main/resources/error/error-classes.json{*}. The name should be short but complete (look at the example in error-classes.json). Add a test which triggers the error from user code if such test still doesn't exist. Check exception fields by using {*}checkError(){*}. The last function checks valuable error fields only, and avoids dependencies from error text message. In this way, tech editors can modify error format in error-classes.json, and don't worry of Spark's internal tests. Migrate other tests that might trigger the error onto checkError(). If you cannot reproduce the error from user space (using SQL query), replace the error by an internal error, see {*}SparkException.internalError(){*}. Improve the error message format in error-classes.json if the current is not clear. Propose a solution to users how to avoid and fix such kind of errors. Please, look at the PR below as examples: * [https://github.com/apache/spark/pull/38685] * [https://github.com/apache/spark/pull/38656] * [https://github.com/apache/spark/pull/38490] was: in DatasetSuite test("CLASS_UNSUPPORTED_BY_MAP_OBJECTS when creating dataset") , we are using _LEGACY_ERROR_TEMP_2151, We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`. *NOTE:* Please reply to this ticket before start working on it, to avoid working on same ticket at a time > Assign name to _LEGACY_ERROR_TEMP_2153 > -- > > Key: SPARK-45573 > URL: https://issues.apache.org/jira/browse/SPARK-45573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deng Ziming >Assignee: Deng Ziming >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2153* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1246#comment-1246 ] Faiz Halde commented on SPARK-45598: Hi [~sdaberdaku] , corrected the title. I tested it with 3.0.0 delta. What I meant was, delta table does not work with {*}spark connect{*}. It does work with spark 3.5.0 otherwise > Delta table 3.0.0 not working with Spark Connect 3.5.0 > -- > > Key: SPARK-45598 > URL: https://issues.apache.org/jira/browse/SPARK-45598 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > Spark version 3.5.0 > Spark Connect version 3.5.0 > Delta table 3.0-rc2 > Spark connect server was started using > *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages > org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 > --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf > 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}* > {{Connect client depends on}} > *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"* > *and the connect libraries* > > When trying to run a simple job that writes to a delta table > {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{val data = spark.read.json("profiles.json")}} > {{data.write.format("delta").save("/tmp/delta")}} > > {{Error log in connect client}} > {{Exception in thread "main" org.apache.spark.SparkException: > io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: > cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 > in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} > {{ at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} > {{ at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{...}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} > {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} > {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}} > {{ at
[jira] [Updated] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-45598: --- Summary: Delta table 3.0.0 not working with Spark Connect 3.5.0 (was: Delta table 3.0-rc2 not working with Spark Connect 3.5.0) > Delta table 3.0.0 not working with Spark Connect 3.5.0 > -- > > Key: SPARK-45598 > URL: https://issues.apache.org/jira/browse/SPARK-45598 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > Spark version 3.5.0 > Spark Connect version 3.5.0 > Delta table 3.0-rc2 > Spark connect server was started using > *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages > org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 > --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf > 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}* > {{Connect client depends on}} > *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"* > *and the connect libraries* > > When trying to run a simple job that writes to a delta table > {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{val data = spark.read.json("profiles.json")}} > {{data.write.format("delta").save("/tmp/delta")}} > > {{Error log in connect client}} > {{Exception in thread "main" org.apache.spark.SparkException: > io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: > cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 > in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} > {{ at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} > {{ at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{...}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} > {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} > {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}} > {{ at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}} > {{ at
[jira] [Resolved] (SPARK-45573) Assign name to _LEGACY_ERROR_TEMP_2153
[ https://issues.apache.org/jira/browse/SPARK-45573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45573. -- Resolution: Fixed > Assign name to _LEGACY_ERROR_TEMP_2153 > -- > > Key: SPARK-45573 > URL: https://issues.apache.org/jira/browse/SPARK-45573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deng Ziming >Assignee: Deng Ziming >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > in DatasetSuite test("CLASS_UNSUPPORTED_BY_MAP_OBJECTS when creating > dataset") , we are using _LEGACY_ERROR_TEMP_2151, We should use proper error > class name rather than `_LEGACY_ERROR_TEMP_xxx`. > > *NOTE:* Please reply to this ticket before start working on it, to avoid > working on same ticket at a time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45573) Assign name to _LEGACY_ERROR_TEMP_2153
[ https://issues.apache.org/jira/browse/SPARK-45573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1238#comment-1238 ] Max Gekk commented on SPARK-45573: -- Resolved by https://github.com/apache/spark/pull/43414 > Assign name to _LEGACY_ERROR_TEMP_2153 > -- > > Key: SPARK-45573 > URL: https://issues.apache.org/jira/browse/SPARK-45573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deng Ziming >Assignee: Deng Ziming >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > in DatasetSuite test("CLASS_UNSUPPORTED_BY_MAP_OBJECTS when creating > dataset") , we are using _LEGACY_ERROR_TEMP_2151, We should use proper error > class name rather than `_LEGACY_ERROR_TEMP_xxx`. > > *NOTE:* Please reply to this ticket before start working on it, to avoid > working on same ticket at a time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions
[ https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng resolved SPARK-45543. Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 43385 [https://github.com/apache/spark/pull/43385] > InferWindowGroupLimit causes bug if the other window functions haven't the > same window frame as the rank-like functions > --- > > Key: SPARK-45543 > URL: https://issues.apache.org/jira/browse/SPARK-45543 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 3.5.0 >Reporter: Ron Serruya >Assignee: Jiaan Geng >Priority: Critical > Labels: correctness, data-loss, pull-request-available > Fix For: 3.5.1, 4.0.0 > > > First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not > very knowledgeable about spark internals, I hope I diagnosed the problem > correctly > I found the degradation in spark version 3.5.0: > When using multiple windows that share the same partition and ordering (but > with different "frame boundaries", where one window is a ranking function, > "WindowGroupLimit" is added to the plan causing wrong values to be created > from the other windows. > *This behavior didn't exist in versions 3.3 and 3.4.* > Example: > > {code:python} > import pysparkfrom pyspark.sql import functions as F, Window > df = spark.createDataFrame([ > {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020}, > {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022}, > {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023}, > {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021}, > ]) > # Create first window for row number > window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year')) > # Create additional window from the first window with unbounded frame > unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing) > # Try to keep the first row by year, and also collect all scores into a list > df2 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(unbound_spec) > ){code} > So far everything works, and if we display df2: > > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Dave|1 |2|2022|2 |[3, 2, 1] | > |Dave|1 |1|2020|3 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > > However, once we filter to keep only the first row number: > > {noformat} > df2.filter("rn=1").show(truncate=False) > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+{noformat} > As you can see just filtering changed the "all_scores" array for Dave. > (This example uses `collect_list`, however, the same result happens with > other functions, such as max, min, count, etc) > > Now, if instead of using the two windows we used, I will use the first window > and a window with different ordering, or create a completely new window with > same partition but no ordering, it will work fine: > {code:python} > new_window = Window.partitionBy('row_id', > 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) > df3 = df.withColumn( > 'rn', > F.row_number().over(window_spec) > ).withColumn( > 'all_scores', > F.collect_list('score').over(new_window) > ) > df3.filter("rn=1").show(truncate=False){code} > {noformat} > ++--+-++---+--+ > |name|row_id|score|year|rn |all_scores| > ++--+-++---+--+ > |Dave|1 |3|2023|1 |[3, 2, 1] | > |Amy |2 |6|2021|1 |[6] | > ++--+-++---+--+ > {noformat} > In addition, if we use all 3 windows to create 3 different columns, it will > also work ok. So it seems the issue happens only when all the windows used > share the same partition and ordering. > Here is the final plan for the faulty dataframe: > {noformat} > df2.filter("rn=1").explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Filter (rn#9 = 1) > +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L > DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), > currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) > windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS
[jira] [Commented] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1172#comment-1172 ] Sebastian Daberdaku commented on SPARK-45598: - Hello [~haldefaiz], you need to use the latest delta-spark version 3.0.0 which came out just yesterday. It now supports delta with Spark 3.5.0. [https://github.com/delta-io/delta/releases/tag/v3.0.0] > Delta table 3.0-rc2 not working with Spark Connect 3.5.0 > > > Key: SPARK-45598 > URL: https://issues.apache.org/jira/browse/SPARK-45598 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Faiz Halde >Priority: Major > > Spark version 3.5.0 > Spark Connect version 3.5.0 > Delta table 3.0-rc2 > Spark connect server was started using > *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages > org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 > --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf > 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}* > {{Connect client depends on}} > *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"* > *and the connect libraries* > > When trying to run a simple job that writes to a delta table > {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}} > {{val data = spark.read.json("profiles.json")}} > {{data.write.format("delta").save("/tmp/delta")}} > > {{Error log in connect client}} > {{Exception in thread "main" org.apache.spark.SparkException: > io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: > cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 > in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}} > {{ at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}} > {{ at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{ at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}} > {{ at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}} > {{ at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}} > {{ at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}} > {{ at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}} > {{...}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}} > {{ at scala.collection.Iterator.foreach(Iterator.scala:943)}} > {{ at scala.collection.Iterator.foreach$(Iterator.scala:943)}} > {{ at > org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}} > {{ at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}} > {{ at
[jira] [Commented] (SPARK-45289) ClassCastException when reading Delta table on AWS S3
[ https://issues.apache.org/jira/browse/SPARK-45289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1170#comment-1170 ] Sebastian Daberdaku commented on SPARK-45289: - Hello [~tanawatpan], you need to use the latest delta-spark version 3.0.0 which came out just yesterday. It now supports delta with Spark 3.5.0. https://github.com/delta-io/delta/releases/tag/v3.0.0 > ClassCastException when reading Delta table on AWS S3 > - > > Key: SPARK-45289 > URL: https://issues.apache.org/jira/browse/SPARK-45289 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.5.0 > Environment: Spark version: 3.5.0 > Deployment mode: spark-shell > OS: Ubuntu (Docker image) > Java/JVM version: OpenJDK 11 > Packages: hadoop-aws:3.3.4, delta-core_2.12:2.4.0 >Reporter: Tanawat Panmongkol >Priority: Major > > When attempting to read a Delta table from S3 using version 3.5.0, a > _*{{ClassCastException}}*_ occurs involving > {{_*org.apache.hadoop.fs.s3a.S3AFileStatus*_}} and > {_}*{{org.apache.spark.sql.execution.datasources.FileStatusWithMetadata}}*{_}. > The error appears to be related to the new feature SPARK-43039. > _*Steps to Reproduce:*_ > {code:java} > export AWS_ACCESS_KEY_ID='' > export AWS_SECRET_ACCESS_KEY='' > export AWS_REGION='' > docker run --rm -it apache/spark:3.5.0-scala2.12-java11-ubuntu > /opt/spark/bin/spark-shell \ > --packages > 'org.apache.hadoop:hadoop-aws:3.3.4,io.delta:delta-core_2.12:2.4.0' \ > --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > \ > --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ > --conf "spark.hadoop.aws.region=$AWS_REGION" \ > --conf "spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID" \ > --conf "spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY" \ > --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ > --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ > --conf "spark.hadoop.fs.s3a.path.style.access=true" \ > --conf "spark.hadoop.fs.s3a.connection.ssl.enabled=true" \ > --conf "spark.jars.ivy=/tmp/ivy/cache"{code} > {code:java} > scala> > spark.read.format("delta").load("s3:").show() > {code} > *Logs:* > {code:java} > java.lang.ClassCastException: class org.apache.hadoop.fs.s3a.S3AFileStatus > cannot be cast to class > org.apache.spark.sql.execution.datasources.FileStatusWithMetadata > (org.apache.hadoop.fs.s3a.S3AFileStatus is in unnamed module of loader > scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @4552f905; > org.apache.spark.sql.execution.datasources.FileStatusWithMetadata is in > unnamed module of loader 'app') > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.FileSourceScanLike.$anonfun$setFilesNumAndSizeMetric$2(DataSourceScanExec.scala:466) > at > org.apache.spark.sql.execution.FileSourceScanLike.$anonfun$setFilesNumAndSizeMetric$2$adapted(DataSourceScanExec.scala:466) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.execution.FileSourceScanLike.setFilesNumAndSizeMetric(DataSourceScanExec.scala:466) > at > org.apache.spark.sql.execution.FileSourceScanLike.selectedPartitions(DataSourceScanExec.scala:257) > at > org.apache.spark.sql.execution.FileSourceScanLike.selectedPartitions$(DataSourceScanExec.scala:251) > at > org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanLike.dynamicallySelectedPartitions(DataSourceScanExec.scala:286) > at > org.apache.spark.sql.execution.FileSourceScanLike.dynamicallySelectedPartitions$(DataSourceScanExec.scala:267) > at > org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions$lzycompute(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions(DataSourceScanExec.scala:506) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:553) > at >
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: BingKun Pan (was: Apache Spark) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: Apache Spark (was: BingKun Pan) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: Apache Spark (was: BingKun Pan) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: BingKun Pan (was: Apache Spark) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: Apache Spark (was: BingKun Pan) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: BingKun Pan (was: Apache Spark) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45605) Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
[ https://issues.apache.org/jira/browse/SPARK-45605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45605: -- Assignee: (was: Apache Spark) >Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` > -- > > Key: SPARK-45605 > URL: https://issues.apache.org/jira/browse/SPARK-45605 > Project: Spark > Issue Type: Sub-task > Components: Connect, DStreams, Examples, MLlib, Spark Core, SS >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > {code:java} > @deprecated("Use .view.mapValues(f). A future version will include a strict > version of this method (for now, .view.mapValues(f).toMap).", "2.13.0") > def mapValues[W](f: V => W): MapView[K, W] = new MapView.MapValues(this, f) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: BingKun Pan (was: Apache Spark) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: Apache Spark (was: BingKun Pan) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: BingKun Pan (was: Apache Spark) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45428: -- Assignee: Apache Spark (was: BingKun Pan) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45428) Add Matomo analytics to all released docs pages
[ https://issues.apache.org/jira/browse/SPARK-45428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45428: --- Labels: pull-request-available (was: ) > Add Matomo analytics to all released docs pages > --- > > Key: SPARK-45428 > URL: https://issues.apache.org/jira/browse/SPARK-45428 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Labels: pull-request-available > > Matomo analytics has been added to some pages of the Spark website. Here is > Sean's initial PR: > [https://github.com/apache/spark-website/pull/479.|https://www.google.com/url?q=https://github.com/apache/spark-website/pull/479=D=docs=1696544881650480=AOvVaw11SNfWcd4UJzlO8EJvzdoe] > You can find analytics for Spark website here: https://analytics.apache.org > We need to add this to all API pages. This is very important for us to > prioritize documentation improvements and search engine optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
[ https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45594: -- Assignee: (was: Apache Spark) > Auto repartition before writing data into partitioned or bucket table > -- > > Key: SPARK-45594 > URL: https://issues.apache.org/jira/browse/SPARK-45594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > Labels: pull-request-available > > Now, when writing data into partitioned table, there will be at least > *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, > there will be at least *bucketNums * shuffleNum* files. > We can shuffle by the dynamic partitions or bucket columns before writing > data into the table and will create ShuffleNum files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
[ https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45594: -- Assignee: Apache Spark > Auto repartition before writing data into partitioned or bucket table > -- > > Key: SPARK-45594 > URL: https://issues.apache.org/jira/browse/SPARK-45594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Now, when writing data into partitioned table, there will be at least > *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, > there will be at least *bucketNums * shuffleNum* files. > We can shuffle by the dynamic partitions or bucket columns before writing > data into the table and will create ShuffleNum files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
[ https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45594: -- Assignee: (was: Apache Spark) > Auto repartition before writing data into partitioned or bucket table > -- > > Key: SPARK-45594 > URL: https://issues.apache.org/jira/browse/SPARK-45594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > Labels: pull-request-available > > Now, when writing data into partitioned table, there will be at least > *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, > there will be at least *bucketNums * shuffleNum* files. > We can shuffle by the dynamic partitions or bucket columns before writing > data into the table and will create ShuffleNum files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45607) Collapse repartition operators with a project
[ https://issues.apache.org/jira/browse/SPARK-45607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45607: -- Assignee: (was: Apache Spark) > Collapse repartition operators with a project > - > > Key: SPARK-45607 > URL: https://issues.apache.org/jira/browse/SPARK-45607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > Labels: pull-request-available > > We can collapse two repartition operators with a project between them. > For example: > df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") > is same to > df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45594) Auto repartition before writing data into partitioned or bucket table
[ https://issues.apache.org/jira/browse/SPARK-45594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45594: -- Assignee: Apache Spark > Auto repartition before writing data into partitioned or bucket table > -- > > Key: SPARK-45594 > URL: https://issues.apache.org/jira/browse/SPARK-45594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Now, when writing data into partitioned table, there will be at least > *dynamicPartitions * ShuffleNum* files; when writing data into bucket table, > there will be at least *bucketNums * shuffleNum* files. > We can shuffle by the dynamic partitions or bucket columns before writing > data into the table and will create ShuffleNum files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45607) Collapse repartition operators with a project
[ https://issues.apache.org/jira/browse/SPARK-45607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45607: -- Assignee: Apache Spark > Collapse repartition operators with a project > - > > Key: SPARK-45607 > URL: https://issues.apache.org/jira/browse/SPARK-45607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > We can collapse two repartition operators with a project between them. > For example: > df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") > is same to > df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45607) Collapse repartition operators with a project
[ https://issues.apache.org/jira/browse/SPARK-45607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45607: --- Labels: pull-request-available (was: ) > Collapse repartition operators with a project > - > > Key: SPARK-45607 > URL: https://issues.apache.org/jira/browse/SPARK-45607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > Labels: pull-request-available > > We can collapse two repartition operators with a project between them. > For example: > df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") > is same to > df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45607) Collapse repartition operators with a project
[ https://issues.apache.org/jira/browse/SPARK-45607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-45607: Summary: Collapse repartition operators with a project (was: Collapse repartition operators with project) > Collapse repartition operators with a project > - > > Key: SPARK-45607 > URL: https://issues.apache.org/jira/browse/SPARK-45607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > > We can collapse two repartition operators with a project between them. > For example: > df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") > is same to > df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45607) Collapse repartition operators with project
[ https://issues.apache.org/jira/browse/SPARK-45607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-45607: Summary: Collapse repartition operators with project (was: Collapse repartitions with project) > Collapse repartition operators with project > --- > > Key: SPARK-45607 > URL: https://issues.apache.org/jira/browse/SPARK-45607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > > We can collapse two repartition operators with a project between them. > For example: > df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") > is same to > df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45607) Collapse repartitions with project
[ https://issues.apache.org/jira/browse/SPARK-45607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-45607: Summary: Collapse repartitions with project (was: Collapse repartition with project) > Collapse repartitions with project > -- > > Key: SPARK-45607 > URL: https://issues.apache.org/jira/browse/SPARK-45607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wan Kun >Priority: Major > > We can collapse two repartition operators with a project between them. > For example: > df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") > is same to > df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45607) Collapse repartition with project
Wan Kun created SPARK-45607: --- Summary: Collapse repartition with project Key: SPARK-45607 URL: https://issues.apache.org/jira/browse/SPARK-45607 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wan Kun We can collapse two repartition operators with a project between them. For example: df.repartition($"a").select($"a", $"b", $"a" + $"b").repartition($"b") is same to df.select($"a", $"b", $"a" + $"b").repartition($"b") -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45604: --- Labels: pull-request-available (was: ) > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > ``` > val path = "/tmp/someparquetfile" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45554) Introduce flexible parameter to assertSchemaEqual
[ https://issues.apache.org/jira/browse/SPARK-45554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45554: --- Labels: pull-request-available (was: ) > Introduce flexible parameter to assertSchemaEqual > - > > Key: SPARK-45554 > URL: https://issues.apache.org/jira/browse/SPARK-45554 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Add new parameter ignoreColumnNames to the assertSchemaEqual. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45606) Release restrictions on multi-layer runtime filter
[ https://issues.apache.org/jira/browse/SPARK-45606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45606: --- Labels: pull-request-available (was: ) > Release restrictions on multi-layer runtime filter > -- > > Key: SPARK-45606 > URL: https://issues.apache.org/jira/browse/SPARK-45606 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports > insert runtime filter for application side of shuffle join on single-layer. > Considered it's not worth to insert more runtime filter if one side of the > shuffle join already exists runtime filter, Spark restricts it. > After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports > insert runtime filter for one side of any shuffle join on multi-layer. But > the restrictions on multi-layer runtime filter looks outdated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45605) Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
[ https://issues.apache.org/jira/browse/SPARK-45605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45605: --- Labels: pull-request-available (was: ) >Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` > -- > > Key: SPARK-45605 > URL: https://issues.apache.org/jira/browse/SPARK-45605 > Project: Spark > Issue Type: Sub-task > Components: Connect, DStreams, Examples, MLlib, Spark Core, SS >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > {code:java} > @deprecated("Use .view.mapValues(f). A future version will include a strict > version of this method (for now, .view.mapValues(f).toMap).", "2.13.0") > def mapValues[W](f: V => W): MapView[K, W] = new MapView.MapValues(this, f) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45605) Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
[ https://issues.apache.org/jira/browse/SPARK-45605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45605: - Description: {code:java} @deprecated("Use .view.mapValues(f). A future version will include a strict version of this method (for now, .view.mapValues(f).toMap).", "2.13.0") def mapValues[W](f: V => W): MapView[K, W] = new MapView.MapValues(this, f) {code} was: {code:java} // code placeholder {code} >Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` > -- > > Key: SPARK-45605 > URL: https://issues.apache.org/jira/browse/SPARK-45605 > Project: Spark > Issue Type: Sub-task > Components: Connect, DStreams, Examples, MLlib, Spark Core, SS >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > @deprecated("Use .view.mapValues(f). A future version will include a strict > version of this method (for now, .view.mapValues(f).toMap).", "2.13.0") > def mapValues[W](f: V => W): MapView[K, W] = new MapView.MapValues(this, f) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45606) Release restrictions on multi-layer runtime filter
Jiaan Geng created SPARK-45606: -- Summary: Release restrictions on multi-layer runtime filter Key: SPARK-45606 URL: https://issues.apache.org/jira/browse/SPARK-45606 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports insert runtime filter for application side of shuffle join on single-layer. Considered it's not worth to insert more runtime filter if one side of the shuffle join already exists runtime filter, Spark restricts it. After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports insert runtime filter for one side of any shuffle join on multi-layer. But the restrictions on multi-layer runtime filter looks outdated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45605) Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
[ https://issues.apache.org/jira/browse/SPARK-45605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45605: - Description: {code:java} // code placeholder {code} >Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` > -- > > Key: SPARK-45605 > URL: https://issues.apache.org/jira/browse/SPARK-45605 > Project: Spark > Issue Type: Sub-task > Components: Connect, DStreams, Examples, MLlib, Spark Core, SS >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > // code placeholder > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45605) Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
Yang Jie created SPARK-45605: Summary:Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues` Key: SPARK-45605 URL: https://issues.apache.org/jira/browse/SPARK-45605 Project: Spark Issue Type: Sub-task Components: SS, Connect, DStreams, Examples, MLlib, Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` val path = "/tmp/someparquetfile" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: ``` val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > ``` > val path = "/tmp/someparquetfile" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > ``` > spark.conf.set("spark.databricks.photon.enabled", "false") > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > ``` > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: {{```}} spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > > ``` > spark.conf.set("spark.databricks.photon.enabled", "false") > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
Zamil Majdy created SPARK-45604: --- Summary: Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader Key: SPARK-45604 URL: https://issues.apache.org/jira/browse/SPARK-45604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Zamil Majdy Repro: {{{}```{}}}{{{}{}}} spark.conf.set("spark.databricks.photon.enabled", "false") {{}} val path = "/tmp/somepath" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") {{}} df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}{{{}```{}}} Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{```}} spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: {{{}```{}}}{{{}{}}} spark.conf.set("spark.databricks.photon.enabled", "false") {{}} val path = "/tmp/somepath" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") {{}} df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}{{{}```{}}} Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > > {{```}} > spark.conf.set("spark.databricks.photon.enabled", "false") > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > {{{}{}}}``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44734) Add documentation for type casting rules in Python UDFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1064#comment-1064 ] BingKun Pan commented on SPARK-44734: - [~phildakin] Sorry, I didn't see that the previous PR is actually strongly related to this PR. For completeness, you can continue with this PR and I will stop this work. > Add documentation for type casting rules in Python UDFs/UDTFs > - > > Key: SPARK-44734 > URL: https://issues.apache.org/jira/browse/SPARK-44734 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > In addition to type mappings between Spark data types and Python data types > (SPARK-44733), we should add the type casting rules for regular and > arrow-optimized Python UDFs/UDTFs. > We currently have this table in code: > * Arrow: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L311-L329] > * Python UDF: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/udf.py#L101-L116] > We should add a proper documentation page for the type casting rules. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45569) Assign name to _LEGACY_ERROR_TEMP_2152
[ https://issues.apache.org/jira/browse/SPARK-45569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45569. -- Resolution: Fixed Issue resolved by pull request 43414 [https://github.com/apache/spark/pull/43414] > Assign name to _LEGACY_ERROR_TEMP_2152 > -- > > Key: SPARK-45569 > URL: https://issues.apache.org/jira/browse/SPARK-45569 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Deng Ziming >Assignee: Deng Ziming >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > in DatasetSuite test("CLASS_UNSUPPORTED_BY_MAP_OBJECTS when creating > dataset") , we are using _LEGACY_ERROR_TEMP_2151, We should use proper error > class name rather than `_LEGACY_ERROR_TEMP_xxx`. > > *NOTE:* Please reply to this ticket before start working on it, to avoid > working on same ticket at a time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org