[jira] [Assigned] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
[ https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40215: Assignee: Apache Spark > Add SQL configs to control CSV/JSON date and timestamp parsing behaviour > > > Key: SPARK-40215 > URL: https://issues.apache.org/jira/browse/SPARK-40215 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
[ https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40215: Assignee: (was: Apache Spark) > Add SQL configs to control CSV/JSON date and timestamp parsing behaviour > > > Key: SPARK-40215 > URL: https://issues.apache.org/jira/browse/SPARK-40215 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-40039. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37474 [https://github.com/apache/spark/pull/37474] > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.4.0 > > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40212) SparkSQL castPartValue does not properly handle byte & short
[ https://issues.apache.org/jira/browse/SPARK-40212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584613#comment-17584613 ] Yuming Wang commented on SPARK-40212: - How to reproduce this issue? > SparkSQL castPartValue does not properly handle byte & short > > > Key: SPARK-40212 > URL: https://issues.apache.org/jira/browse/SPARK-40212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Brennan Stein >Priority: Major > > Reading in a parquet file partitioned on disk by a `Byte`-type column fails > with the following exception: > > {code:java} > [info] Cause: java.lang.ClassCastException: java.lang.Integer cannot be > cast to java.lang.Byte > [info] at scala.runtime.BoxesRunTime.unboxToByte(BoxesRunTime.java:95) > [info] at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte(rows.scala:39) > [info] at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte$(rows.scala:39) > [info] at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getByte(rows.scala:195) > [info] at > org.apache.spark.sql.catalyst.expressions.JoinedRow.getByte(JoinedRow.scala:86) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_6$(Unknown > Source) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > [info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$8(ParquetFileFormat.scala:385) > [info] at > org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.next(RecordReaderIterator.scala:62) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:189) > [info] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > [info] at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > [info] at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > [info] at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) > [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > [info] at org.apache.spark.scheduler.Task.run(Task.scala:136) > [info] at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) > [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) > [info] at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:748) {code} > I believe the issue to stem from > [PartitioningUtils::castPartValueToDesiredType|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L533] > returning an Integer for ByteType and ShortType (which then fails to unbox > to the expected type): > > {code:java} > case ByteType | ShortType | IntegerType => Integer.parseInt(value) {code} > > The issue appears to have been introduced in [this > commit|https://github.com/apache/spark/commit/fc29c91f27d866502f5b6cc4261d4943b57e] > so likely affects Spark 3.2 as well, though I've only tested on 3.3.0. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
Ivan Sadikov created SPARK-40215: Summary: Add SQL configs to control CSV/JSON date and timestamp parsing behaviour Key: SPARK-40215 URL: https://issues.apache.org/jira/browse/SPARK-40215 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Ivan Sadikov -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
[ https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584612#comment-17584612 ] Ivan Sadikov commented on SPARK-40215: -- Follow-up. > Add SQL configs to control CSV/JSON date and timestamp parsing behaviour > > > Key: SPARK-40215 > URL: https://issues.apache.org/jira/browse/SPARK-40215 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters
[ https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40213. -- Fix Version/s: 3.4.0 3.3.1 3.2.3 Assignee: Linhong Liu Resolution: Fixed Fixed in https://github.com/apache/spark/pull/37651 > Incorrect ASCII value for Latin-1 Supplement characters > --- > > Key: SPARK-40213 > URL: https://issues.apache.org/jira/browse/SPARK-40213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.2 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > the `ascii()` built-in function in spark doesn't support Latin-1 Supplement > characters which value between [128, 256). Instead, it produces a wrong > value, -62 or -61 for all the chars. But the `chr()` built-in function > supports value in [0, 256) and normally `ascii` should be the inverse of > `chr()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40214) Add `get` to dataframe functions
[ https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584589#comment-17584589 ] Apache Spark commented on SPARK-40214: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37652 > Add `get` to dataframe functions > > > Key: SPARK-40214 > URL: https://issues.apache.org/jira/browse/SPARK-40214 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40214) Add `get` to dataframe functions
[ https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40214: Assignee: Apache Spark (was: Ruifeng Zheng) > Add `get` to dataframe functions > > > Key: SPARK-40214 > URL: https://issues.apache.org/jira/browse/SPARK-40214 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40214) Add `get` to dataframe functions
[ https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40214: Assignee: Ruifeng Zheng (was: Apache Spark) > Add `get` to dataframe functions > > > Key: SPARK-40214 > URL: https://issues.apache.org/jira/browse/SPARK-40214 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40214) Add `get` to dataframe functions
[ https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584588#comment-17584588 ] Apache Spark commented on SPARK-40214: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37652 > Add `get` to dataframe functions > > > Key: SPARK-40214 > URL: https://issues.apache.org/jira/browse/SPARK-40214 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40214) Add `get` to dataframe functions
[ https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40214: - Assignee: Ruifeng Zheng > Add `get` to dataframe functions > > > Key: SPARK-40214 > URL: https://issues.apache.org/jira/browse/SPARK-40214 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40214) Add `get` to dataframe functions
Ruifeng Zheng created SPARK-40214: - Summary: Add `get` to dataframe functions Key: SPARK-40214 URL: https://issues.apache.org/jira/browse/SPARK-40214 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40202) Allow a dictionary in SparkSession.config in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40202: Assignee: Hyukjin Kwon > Allow a dictionary in SparkSession.config in PySpark > > > Key: SPARK-40202 > URL: https://issues.apache.org/jira/browse/SPARK-40202 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-40163 added a new signature in SparkSession.conf. We should better have > the same one in PySpark too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40202) Allow a dictionary in SparkSession.config in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40202. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37642 [https://github.com/apache/spark/pull/37642] > Allow a dictionary in SparkSession.config in PySpark > > > Key: SPARK-40202 > URL: https://issues.apache.org/jira/browse/SPARK-40202 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > SPARK-40163 added a new signature in SparkSession.conf. We should better have > the same one in PySpark too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40055) listCatalogs should also return spark_catalog even spark_catalog implementation is defaultSessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-40055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40055. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37488 [https://github.com/apache/spark/pull/37488] > listCatalogs should also return spark_catalog even spark_catalog > implementation is defaultSessionCatalog > > > Key: SPARK-40055 > URL: https://issues.apache.org/jira/browse/SPARK-40055 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40055) listCatalogs should also return spark_catalog even spark_catalog implementation is defaultSessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-40055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40055: --- Assignee: Rui Wang > listCatalogs should also return spark_catalog even spark_catalog > implementation is defaultSessionCatalog > > > Key: SPARK-40055 > URL: https://issues.apache.org/jira/browse/SPARK-40055 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters
[ https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40213: Assignee: (was: Apache Spark) > Incorrect ASCII value for Latin-1 Supplement characters > --- > > Key: SPARK-40213 > URL: https://issues.apache.org/jira/browse/SPARK-40213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.2 >Reporter: Linhong Liu >Priority: Major > > the `ascii()` built-in function in spark doesn't support Latin-1 Supplement > characters which value between [128, 256). Instead, it produces a wrong > value, -62 or -61 for all the chars. But the `chr()` built-in function > supports value in [0, 256) and normally `ascii` should be the inverse of > `chr()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters
[ https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40213: Assignee: Apache Spark > Incorrect ASCII value for Latin-1 Supplement characters > --- > > Key: SPARK-40213 > URL: https://issues.apache.org/jira/browse/SPARK-40213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.2 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > > the `ascii()` built-in function in spark doesn't support Latin-1 Supplement > characters which value between [128, 256). Instead, it produces a wrong > value, -62 or -61 for all the chars. But the `chr()` built-in function > supports value in [0, 256) and normally `ascii` should be the inverse of > `chr()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters
[ https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584529#comment-17584529 ] Apache Spark commented on SPARK-40213: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37651 > Incorrect ASCII value for Latin-1 Supplement characters > --- > > Key: SPARK-40213 > URL: https://issues.apache.org/jira/browse/SPARK-40213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.2 >Reporter: Linhong Liu >Priority: Major > > the `ascii()` built-in function in spark doesn't support Latin-1 Supplement > characters which value between [128, 256). Instead, it produces a wrong > value, -62 or -61 for all the chars. But the `chr()` built-in function > supports value in [0, 256) and normally `ascii` should be the inverse of > `chr()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters
Linhong Liu created SPARK-40213: --- Summary: Incorrect ASCII value for Latin-1 Supplement characters Key: SPARK-40213 URL: https://issues.apache.org/jira/browse/SPARK-40213 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.2 Reporter: Linhong Liu the `ascii()` built-in function in spark doesn't support Latin-1 Supplement characters which value between [128, 256). Instead, it produces a wrong value, -62 or -61 for all the chars. But the `chr()` built-in function supports value in [0, 256) and normally `ascii` should be the inverse of `chr()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40212) SparkSQL castPartValue does not properly handle byte & short
Brennan Stein created SPARK-40212: - Summary: SparkSQL castPartValue does not properly handle byte & short Key: SPARK-40212 URL: https://issues.apache.org/jira/browse/SPARK-40212 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Brennan Stein Reading in a parquet file partitioned on disk by a `Byte`-type column fails with the following exception: {code:java} [info] Cause: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Byte [info] at scala.runtime.BoxesRunTime.unboxToByte(BoxesRunTime.java:95) [info] at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte(rows.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte$(rows.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getByte(rows.scala:195) [info] at org.apache.spark.sql.catalyst.expressions.JoinedRow.getByte(JoinedRow.scala:86) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_6$(Unknown Source) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$8(ParquetFileFormat.scala:385) [info] at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.next(RecordReaderIterator.scala:62) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:189) [info] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) [info] at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) [info] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) [info] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) [info] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) [info] at org.apache.spark.scheduler.Task.run(Task.scala:136) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) {code} I believe the issue to stem from [PartitioningUtils::castPartValueToDesiredType|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L533] returning an Integer for ByteType and ShortType (which then fails to unbox to the expected type): {code:java} case ByteType | ShortType | IntegerType => Integer.parseInt(value) {code} The issue appears to have been introduced in [this commit|https://github.com/apache/spark/commit/fc29c91f27d866502f5b6cc4261d4943b57e] so likely affects Spark 3.2 as well, though I've only tested on 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized
[ https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584458#comment-17584458 ] Ziqi Liu commented on SPARK-40211: -- I'm actively working on this > Allow executeTake() / collectLimit's number of starting partitions to be > customized > --- > > Key: SPARK-40211 > URL: https://issues.apache.org/jira/browse/SPARK-40211 > Project: Spark > Issue Type: Story > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Ziqi Liu >Priority: Major > > Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be > customized but does not allow for the initial number of partitions to be > customized: it’s currently hardcoded to {{{}1{}}}. > We should add a configuration so that the initial partition count can be > customized. By setting this new configuration to a high value we could > effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. > We could also set it to higher-than-1-but-still-small values (like, say, > {{{}10{}}}) to achieve a middle-ground trade-off. > > Essentially, we need to make {{numPartsToTry = 1L}} > ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481]) > customizable. We should do this via a new SQL conf, similar to the > {{limitScaleUpFactor}} conf. > > Spark has several near-duplicate versions of this code ([see code > search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in: > * SparkPlan > * RDD > * pyspark rdd > Also, in pyspark {{limitScaleUpFactor}} is not supported either. So for > now, I will focus on scala side first, leaving python side untouched and > meanwhile sync with pyspark members. Depending on the progress we can do them > all in one PR or make scala side change first and leave pyspark change as a > follow-up. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized
Ziqi Liu created SPARK-40211: Summary: Allow executeTake() / collectLimit's number of starting partitions to be customized Key: SPARK-40211 URL: https://issues.apache.org/jira/browse/SPARK-40211 Project: Spark Issue Type: Story Components: Spark Core, SQL Affects Versions: 3.4.0 Reporter: Ziqi Liu Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be customized but does not allow for the initial number of partitions to be customized: it’s currently hardcoded to {{{}1{}}}. We should add a configuration so that the initial partition count can be customized. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. We could also set it to higher-than-1-but-still-small values (like, say, {{{}10{}}}) to achieve a middle-ground trade-off. Essentially, we need to make {{numPartsToTry = 1L}} ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481]) customizable. We should do this via a new SQL conf, similar to the {{limitScaleUpFactor}} conf. Spark has several near-duplicate versions of this code ([see code search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in: * SparkPlan * RDD * pyspark rdd Also, in pyspark {{limitScaleUpFactor}} is not supported either. So for now, I will focus on scala side first, leaving python side untouched and meanwhile sync with pyspark members. Depending on the progress we can do them all in one PR or make scala side change first and leave pyspark change as a follow-up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call
[ https://issues.apache.org/jira/browse/SPARK-40210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584447#comment-17584447 ] Apache Spark commented on SPARK-40210: -- User 'khalidmammadov' has created a pull request for this issue: https://github.com/apache/spark/pull/37650 > Fix math atan2, hypot, pow and pmod float argument call > --- > > Key: SPARK-40210 > URL: https://issues.apache.org/jira/browse/SPARK-40210 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Priority: Minor > > PySpark atan2, hypot, pow and pmod functions marked as accepting float type > as argument but produce error when called together -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call
[ https://issues.apache.org/jira/browse/SPARK-40210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40210: Assignee: (was: Apache Spark) > Fix math atan2, hypot, pow and pmod float argument call > --- > > Key: SPARK-40210 > URL: https://issues.apache.org/jira/browse/SPARK-40210 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Priority: Minor > > PySpark atan2, hypot, pow and pmod functions marked as accepting float type > as argument but produce error when called together -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call
[ https://issues.apache.org/jira/browse/SPARK-40210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40210: Assignee: Apache Spark > Fix math atan2, hypot, pow and pmod float argument call > --- > > Key: SPARK-40210 > URL: https://issues.apache.org/jira/browse/SPARK-40210 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Assignee: Apache Spark >Priority: Minor > > PySpark atan2, hypot, pow and pmod functions marked as accepting float type > as argument but produce error when called together -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call
Khalid Mammadov created SPARK-40210: --- Summary: Fix math atan2, hypot, pow and pmod float argument call Key: SPARK-40210 URL: https://issues.apache.org/jira/browse/SPARK-40210 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Khalid Mammadov PySpark atan2, hypot, pow and pmod functions marked as accepting float type as argument but produce error when called together -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40209: Assignee: Max Gekk (was: Apache Spark) > Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE > -- > > Key: SPARK-40209 > URL: https://issues.apache.org/jira/browse/SPARK-40209 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql> select cast(interval '10.123' second as decimal(1, 0)); > [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). > If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > {code} > The value 0.10 is not related to 10.123. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40209: Assignee: Apache Spark (was: Max Gekk) > Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE > -- > > Key: SPARK-40209 > URL: https://issues.apache.org/jira/browse/SPARK-40209 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql> select cast(interval '10.123' second as decimal(1, 0)); > [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). > If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > {code} > The value 0.10 is not related to 10.123. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584436#comment-17584436 ] Apache Spark commented on SPARK-40209: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37649 > Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE > -- > > Key: SPARK-40209 > URL: https://issues.apache.org/jira/browse/SPARK-40209 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql> select cast(interval '10.123' second as decimal(1, 0)); > [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). > If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > {code} > The value 0.10 is not related to 10.123. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40209: - Description: The example below demonstrates the issue: {code:sql} spark-sql> select cast(interval '10.123' second as decimal(1, 0)); [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. {code} The value 0.10 is not related to 10.123. was: The example below demonstrates the issue: {code:sql} spark-sql> select cast(interval '10.123' second as decimal(1, 0)); [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. {code} > Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE > -- > > Key: SPARK-40209 > URL: https://issues.apache.org/jira/browse/SPARK-40209 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The example below demonstrates the issue: > {code:sql} > spark-sql> select cast(interval '10.123' second as decimal(1, 0)); > [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). > If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > {code} > The value 0.10 is not related to 10.123. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
Max Gekk created SPARK-40209: Summary: Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE Key: SPARK-40209 URL: https://issues.apache.org/jira/browse/SPARK-40209 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk The example below demonstrates the issue: {code:sql} spark-sql> select cast(interval '10.123' second as decimal(1, 0)); [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40195) Add PrunedScanWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura resolved SPARK-40195. --- Resolution: Invalid I just realized the suite is not for AQE, so closing > Add PrunedScanWithAQESuite > -- > > Key: SPARK-40195 > URL: https://issues.apache.org/jira/browse/SPARK-40195 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `PrunedScanSuite` assumes that AQE is always not applied. We should > also test with AQE force applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-40094. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37528 [https://github.com/apache/spark/pull/37528] > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Assignee: wangshengjie >Priority: Major > Fix For: 3.4.0 > > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-40094: --- Assignee: wangshengjie > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Assignee: wangshengjie >Priority: Major > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40208) New OFFSET clause does not use new error framework
[ https://issues.apache.org/jira/browse/SPARK-40208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584351#comment-17584351 ] Serge Rielau commented on SPARK-40208: -- [~maxgekk] (FYI) Also (I'm sure LIMIT is the same, maybe fix in one fell swoop?) spark-sql> SELECT name, age FROM person ORDER BY name OFFSET -1; Error in query: The offset expression must be equal to or greater than 0, but got -1; Offset -1 +- Sort [name#185 ASC NULLS FIRST], true +- Project [name#185, age#186] +- SubqueryAlias person +- View (`person`, [name#185,age#186]) +- Project [cast(col1#187 as string) AS name#185, cast(col2#188 as int) AS age#186] +- LocalRelation [col1#187, col2#188] > New OFFSET clause does not use new error framework > --- > > Key: SPARK-40208 > URL: https://issues.apache.org/jira/browse/SPARK-40208 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Minor > > CREATE TEMP VIEW person (name, age) > AS VALUES ('Zen Hui', 25), > ('Anil B' , 18), > ('Shone S', 16), > ('Mike A' , 25), > ('John A' , 18), > ('Jack N' , 16); > SELECT name, age FROM person ORDER BY name OFFSET length(name); > Error in query: The offset expression must evaluate to a constant value, but > got length(person.name); > Offset length(name#181) > +- Sort [name#181 ASC NULLS FIRST], true > +- Project [name#181, age#182] > +- SubqueryAlias person > +- View (`person`, [name#181,age#182]) > +- Project [cast(col1#183 as string) AS name#181, cast(col2#184 > as int) AS age#182] > +- LocalRelation [col1#183, col2#184|#183, col2#184] > > Returning the plan here is quite pointless as well. The context would be more > interesting. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40208) New OFFSET clause does not use new error framework
Serge Rielau created SPARK-40208: Summary: New OFFSET clause does not use new error framework Key: SPARK-40208 URL: https://issues.apache.org/jira/browse/SPARK-40208 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Serge Rielau CREATE TEMP VIEW person (name, age) AS VALUES ('Zen Hui', 25), ('Anil B' , 18), ('Shone S', 16), ('Mike A' , 25), ('John A' , 18), ('Jack N' , 16); SELECT name, age FROM person ORDER BY name OFFSET length(name); Error in query: The offset expression must evaluate to a constant value, but got length(person.name); Offset length(name#181) +- Sort [name#181 ASC NULLS FIRST], true +- Project [name#181, age#182] +- SubqueryAlias person +- View (`person`, [name#181,age#182]) +- Project [cast(col1#183 as string) AS name#181, cast(col2#184 as int) AS age#182] +- LocalRelation [col1#183, col2#184|#183, col2#184] Returning the plan here is quite pointless as well. The context would be more interesting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40207: Fix Version/s: (was: 3.4.0) > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Apache Spark >Priority: Major > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40207: Assignee: Apache Spark > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40207: Assignee: Apache Spark > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40207: Assignee: (was: Apache Spark) > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584268#comment-17584268 ] Apache Spark commented on SPARK-40207: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/37574 > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40207) Specify the column name when the data type is not supported by datasource
[ https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi kaifei updated SPARK-40207: -- Description: Currently, If the data type is not supported by the data source, the exception message thrown does not contain the column name, which is less clear for locating the problem, this Jira aims to optimize error message description > Specify the column name when the data type is not supported by datasource > - > > Key: SPARK-40207 > URL: https://issues.apache.org/jira/browse/SPARK-40207 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yi kaifei >Priority: Major > Fix For: 3.4.0 > > > Currently, If the data type is not supported by the data source, the > exception message thrown does not contain the column name, which is less > clear for locating the problem, this Jira aims to optimize error message > description -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40207) Specify the column name when the data type is not supported by datasource
Yi kaifei created SPARK-40207: - Summary: Specify the column name when the data type is not supported by datasource Key: SPARK-40207 URL: https://issues.apache.org/jira/browse/SPARK-40207 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yi kaifei Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB
[ https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584247#comment-17584247 ] Apache Spark commented on SPARK-38909: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37648 > Encapsulate LevelDB used by ExternalShuffleBlockResolver and > YarnShuffleService as LocalDB > -- > > Key: SPARK-38909 > URL: https://issues.apache.org/jira/browse/SPARK-38909 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}} use {{{}LevelDB > directly{}}}, this is not conducive to extending the use of {{RocksDB}} in > this scenario. This pr is encapsulated for expansibility. It will be the > pre-work of SPARK-3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB
[ https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584245#comment-17584245 ] Apache Spark commented on SPARK-38909: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37648 > Encapsulate LevelDB used by ExternalShuffleBlockResolver and > YarnShuffleService as LocalDB > -- > > Key: SPARK-38909 > URL: https://issues.apache.org/jira/browse/SPARK-38909 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}} use {{{}LevelDB > directly{}}}, this is not conducive to extending the use of {{RocksDB}} in > this scenario. This pr is encapsulated for expansibility. It will be the > pre-work of SPARK-3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39957) Delay onDisconnected to enable Driver receives ExecutorExitCode
[ https://issues.apache.org/jira/browse/SPARK-39957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi reassigned SPARK-39957: Assignee: Kai-Hsun Chen > Delay onDisconnected to enable Driver receives ExecutorExitCode > --- > > Key: SPARK-39957 > URL: https://issues.apache.org/jira/browse/SPARK-39957 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Kai-Hsun Chen >Assignee: Kai-Hsun Chen >Priority: Major > > There are two methods to detect executor loss. First, when RPC fails, the > function {{onDisconnected}} will be triggered. Second, when executor exits > with ExecutorExitCode, the exit code will be passed from ExecutorRunner to > Driver. These two methods may categorize same cases into different > conclusions. We hope to categorize the ExecutorLossReason by > ExecutorExitCode. This PR aims to make sure Driver receives ExecutorExitCode > before onDisconnected is called. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39957) Delay onDisconnected to enable Driver receives ExecutorExitCode
[ https://issues.apache.org/jira/browse/SPARK-39957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-39957. -- Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/37400 > Delay onDisconnected to enable Driver receives ExecutorExitCode > --- > > Key: SPARK-39957 > URL: https://issues.apache.org/jira/browse/SPARK-39957 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Kai-Hsun Chen >Assignee: Kai-Hsun Chen >Priority: Major > > There are two methods to detect executor loss. First, when RPC fails, the > function {{onDisconnected}} will be triggered. Second, when executor exits > with ExecutorExitCode, the exit code will be passed from ExecutorRunner to > Driver. These two methods may categorize same cases into different > conclusions. We hope to categorize the ExecutorLossReason by > ExecutorExitCode. This PR aims to make sure Driver receives ExecutorExitCode > before onDisconnected is called. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39957) Delay onDisconnected to enable Driver receives ExecutorExitCode
[ https://issues.apache.org/jira/browse/SPARK-39957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-39957: - Fix Version/s: 3.4.0 > Delay onDisconnected to enable Driver receives ExecutorExitCode > --- > > Key: SPARK-39957 > URL: https://issues.apache.org/jira/browse/SPARK-39957 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Kai-Hsun Chen >Assignee: Kai-Hsun Chen >Priority: Major > Fix For: 3.4.0 > > > There are two methods to detect executor loss. First, when RPC fails, the > function {{onDisconnected}} will be triggered. Second, when executor exits > with ExecutorExitCode, the exit code will be passed from ExecutorRunner to > Driver. These two methods may categorize same cases into different > conclusions. We hope to categorize the ExecutorLossReason by > ExecutorExitCode. This PR aims to make sure Driver receives ExecutorExitCode > before onDisconnected is called. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE
[ https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38752: Assignee: lvshaokang > Test the error class: UNSUPPORTED_DATATYPE > -- > > Key: SPARK-38752 > URL: https://issues.apache.org/jira/browse/SPARK-38752 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: lvshaokang >Priority: Minor > Labels: starter > > Add a test for the error classes *UNSUPPORTED_DATATYPE* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def dataTypeUnsupportedError(dataType: String, failure: String): Throwable > = { > new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE", > messageParameters = Array(dataType + failure)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE
[ https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38752. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37640 [https://github.com/apache/spark/pull/37640] > Test the error class: UNSUPPORTED_DATATYPE > -- > > Key: SPARK-38752 > URL: https://issues.apache.org/jira/browse/SPARK-38752 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: lvshaokang >Priority: Minor > Labels: starter > Fix For: 3.4.0 > > > Add a test for the error classes *UNSUPPORTED_DATATYPE* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def dataTypeUnsupportedError(dataType: String, failure: String): Throwable > = { > new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE", > messageParameters = Array(dataType + failure)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40203) Add test cases for Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40203. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37644 [https://github.com/apache/spark/pull/37644] > Add test cases for Spark Decimal > > > Key: SPARK-40203 > URL: https://issues.apache.org/jira/browse/SPARK-40203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40203) Add test cases for Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40203: Assignee: jiaan.geng > Add test cases for Spark Decimal > > > Key: SPARK-40203 > URL: https://issues.apache.org/jira/browse/SPARK-40203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39791) In Spark 3.0 standalone cluster mode, unable to customize driver JVM path
[ https://issues.apache.org/jira/browse/SPARK-39791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570717#comment-17570717 ] Obobj edited comment on SPARK-39791 at 8/24/22 10:49 AM: - [~hyukjin.kwon] Thanks Reply. In standalone mode, javaHome always empty, so always take childEnv.get("JAVA_HOME") was (Author: JIRAUSER292149): In standalone mode, javaHome always empty, so always take childEnv.get("JAVA_HOME") > In Spark 3.0 standalone cluster mode, unable to customize driver JVM path > - > > Key: SPARK-39791 > URL: https://issues.apache.org/jira/browse/SPARK-39791 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.0.0 >Reporter: Obobj >Priority: Minor > Labels: spark-submit, standalone > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > In Spark 3.0 standalone mode, unable to customize driver JVM path, instead > the JAVA_HOME of the spark-submit submission machine is used, but the JVM > paths of my submission machine and the cluster machine are different > {code:java} > launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java > List buildJavaCommand(String extraClassPath) throws IOException { > List cmd = new ArrayList<>(); > String firstJavaHome = firstNonEmpty(javaHome, > childEnv.get("JAVA_HOME"), > System.getenv("JAVA_HOME"), > System.getProperty("java.home")); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table
[ https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Tang updated SPARK-40206: - Labels: hive hive-buckets spark spark-sql (was: hive hive-buckets spark) > Spark SQL Predict Pushdown for Hive Bucketed Table > -- > > Key: SPARK-40206 > URL: https://issues.apache.org/jira/browse/SPARK-40206 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Raymond Tang >Priority: Minor > Labels: hive, hive-buckets, spark, spark-sql > > Hi team, > I was testing out Hive bucket table features. One of the benefits as most > documentation suggested is that bucketed hive table can be used for query > filer/predict pushdown to improve query performance. > However through my exploration, that doesn't seem to be true. *Can you please > help to clarify if Spark SQL supports query optimizations when using Hive > bucketed table?* > > How to produce the issue: > Create a Hive 3 table using the following DDL: > {code:java} > create table test_db.bucket_table(user_id int, key string) > comment 'A bucketed table' > partitioned by(country string) > clustered by(user_id) sorted by (key) into 10 buckets > stored as ORC;{code} > And then insert into this table using the following PySpark script: > {code:java} > from pyspark.sql import SparkSession > appName = "PySpark Hive Bucketing Example" > master = "local" > # Create Spark session with Hive supported. > spark = SparkSession.builder \ > .appName(appName) \ > .master(master) \ > .enableHiveSupport() \ > .getOrCreate() > # prepare sample data for inserting into hive table > data = [] > countries = ['CN', 'AU'] > for i in range(0, 1000): > data.append([int(i), 'U'+str(i), countries[i % 2]]) > df = spark.createDataFrame(data, ['user_id', 'key', 'country']) > df.show() > # Save df to Hive table test_db.bucket_table > df.write.mode('append').insertInto('test_db.bucket_table') {code} > Then query the table using the following script: > {code:java} > from pyspark.sql import SparkSession > appName = "PySpark Hive Bucketing Example" > master = "local" > # Create Spark session with Hive supported. > spark = SparkSession.builder \ > .appName(appName) \ > .master(master) \ > .enableHiveSupport() \ > .getOrCreate() > df = spark.sql("""select * from test_db.bucket_table > where country='AU' and user_id=101 > """) > df.show() > df.explain(extended=True) {code} > I am expecting to read from only one bucket file in HDFS but instead Spark > scanned all bucket files in partition folder country=AU. > {code:java} > == Parsed Logical Plan == > 'Project [*] > - 'Filter (('country = AU) AND ('t1.user_id = 101)) > - 'SubqueryAlias t1 >- 'UnresolvedRelation [test_db, bucket_table], [], false > == Analyzed Logical Plan == > user_id: int, key: string, country: string > Project [user_id#20, key#21, country#22] > - Filter ((country#22 = AU) AND (user_id#20 = 101)) > - SubqueryAlias t1 >- SubqueryAlias spark_catalog.test_db.bucket_table > - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc > == Optimized Logical Plan == > Filter (((isnotnull(country#22) AND isnotnull(user_id#20)) AND (country#22 = > AU)) AND (user_id#20 = 101)) > - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc > == Physical Plan == > *(1) Filter (isnotnull(user_id#20) AND (user_id#20 = 101)) > - *(1) ColumnarToRow > - FileScan orc test_db.bucket_table[user_id#20,key#21,country#22] > Batched: true, DataFilters: [isnotnull(user_id#20), (user_id#20 = 101)], > Format: ORC, Location: InMemoryFileIndex(1 > paths)[hdfs://localhost:9000/user/hive/warehouse/test_db.db/bucket_table/coun..., > PartitionFilters: [isnotnull(country#22), (country#22 = AU)], PushedFilters: > [IsNotNull(user_id), EqualTo(user_id,101)], ReadSchema: > struct {code} > *Am I doing something wrong? or is it because Spark doesn't support it? Your > guidance and help will be appreciated.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table
[ https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Tang updated SPARK-40206: - Description: Hi team, I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance. However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?* How to produce the issue: Create a Hive 3 table using the following DDL: {code:java} create table test_db.bucket_table(user_id int, key string) comment 'A bucketed table' partitioned by(country string) clustered by(user_id) sorted by (key) into 10 buckets stored as ORC;{code} And then insert into this table using the following PySpark script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() # prepare sample data for inserting into hive table data = [] countries = ['CN', 'AU'] for i in range(0, 1000): data.append([int(i), 'U'+str(i), countries[i % 2]]) df = spark.createDataFrame(data, ['user_id', 'key', 'country']) df.show() # Save df to Hive table test_db.bucket_table df.write.mode('append').insertInto('test_db.bucket_table') {code} Then query the table using the following script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql("""select * from test_db.bucket_table where country='AU' and user_id=101 """) df.show() df.explain(extended=True) {code} I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU. {code:java} == Parsed Logical Plan == 'Project [*] - 'Filter (('country = AU) AND ('t1.user_id = 101)) - 'SubqueryAlias t1 - 'UnresolvedRelation [test_db, bucket_table], [], false == Analyzed Logical Plan == user_id: int, key: string, country: string Project [user_id#20, key#21, country#22] - Filter ((country#22 = AU) AND (user_id#20 = 101)) - SubqueryAlias t1 - SubqueryAlias spark_catalog.test_db.bucket_table - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc == Optimized Logical Plan == Filter (((isnotnull(country#22) AND isnotnull(user_id#20)) AND (country#22 = AU)) AND (user_id#20 = 101)) - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc == Physical Plan == *(1) Filter (isnotnull(user_id#20) AND (user_id#20 = 101)) - *(1) ColumnarToRow - FileScan orc test_db.bucket_table[user_id#20,key#21,country#22] Batched: true, DataFilters: [isnotnull(user_id#20), (user_id#20 = 101)], Format: ORC, Location: InMemoryFileIndex(1 paths)[hdfs://localhost:9000/user/hive/warehouse/test_db.db/bucket_table/coun..., PartitionFilters: [isnotnull(country#22), (country#22 = AU)], PushedFilters: [IsNotNull(user_id), EqualTo(user_id,101)], ReadSchema: struct {code} *Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated.* was: Hi team, I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance. However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?* How to produce the issue: Create a Hive 3 table using the following DDL: {code:java} create table test_db.bucket_table(user_id int, key string) comment 'A bucketed table' partitioned by(country string) clustered by(user_id) sorted by (key) into 10 buckets stored as ORC;{code} And then insert into this table using the following PySpark script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() # prepare sample data for inserting into hive table data = [] countries = ['CN', 'AU'] for i in range(0, 1000): data.append([int(i), 'U'+str(i), countries[i % 2]]) df = spark.createDataFrame(data, ['user_id', 'key', 'country']) df.show() # Save df to Hive table test_db.bucket_table df.write.mode('append').insertInto('test_db.bucket_table') {code} Then query the table using the following script: {code:java}
[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table
[ https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Tang updated SPARK-40206: - Description: Hi team, I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance. However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?* How to produce the issue: Create a Hive 3 table using the following DDL: {code:java} create table test_db.bucket_table(user_id int, key string) comment 'A bucketed table' partitioned by(country string) clustered by(user_id) sorted by (key) into 10 buckets stored as ORC;{code} And then insert into this table using the following PySpark script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() # prepare sample data for inserting into hive table data = [] countries = ['CN', 'AU'] for i in range(0, 1000): data.append([int(i), 'U'+str(i), countries[i % 2]]) df = spark.createDataFrame(data, ['user_id', 'key', 'country']) df.show() # Save df to Hive table test_db.bucket_table df.write.mode('append').insertInto('test_db.bucket_table') {code} Then query the table using the following script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql("""select * from test_db.bucket_table where country='AU' and user_id=101 """) df.show() df.explain(extended=True) {code} I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU. Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. was: Hi team, I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance. However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?* How to produce the issue: Create a Hive 3 table using the following DDL: {code:java} create table test_db.bucket_table(user_id int, key string) comment 'A bucketed table' partitioned by(country string) clustered by(user_id) sorted by (key) into 10 buckets stored as ORC;{code} And then insert into this table using the following PySpark script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() # prepare sample data for inserting into hive table data = [] countries = ['CN', 'AU'] for i in range(0, 1000): data.append([int(i), 'U'+str(i), countries[i % 2]]) df = spark.createDataFrame(data, ['country', 'user_id', 'key']) df.show() # Save df to Hive table test_db.bucket_table df.write.mode('append').insertInto('test_db.bucket_table') {code} Then query the table using the following script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql("""select * from test_db.bucket_table where country='AU' and user_id=101 """) df.show() df.explain(extended=True) {code} I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU. Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. > Spark SQL Predict Pushdown for Hive Bucketed Table > -- > > Key: SPARK-40206 > URL: https://issues.apache.org/jira/browse/SPARK-40206 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Raymond Tang >Priority: Minor > Labels: hive, hive-buckets, spark > > Hi team, > I was testing out Hive bucket table features. One of the benefits as most > documentation suggested is that
[jira] [Created] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table
Raymond Tang created SPARK-40206: Summary: Spark SQL Predict Pushdown for Hive Bucketed Table Key: SPARK-40206 URL: https://issues.apache.org/jira/browse/SPARK-40206 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 3.3.0 Reporter: Raymond Tang Hi team, I was testing out Hive bucket table features. One of the benefits as most documentation suggested is that bucketed hive table can be used for query filer/predict pushdown to improve query performance. However through my exploration, that doesn't seem to be true. *Can you please help to clarify if Spark SQL supports query optimizations when using Hive bucketed table?* How to produce the issue: Create a Hive 3 table using the following DDL: {code:java} create table test_db.bucket_table(user_id int, key string) comment 'A bucketed table' partitioned by(country string) clustered by(user_id) sorted by (key) into 10 buckets stored as ORC;{code} And then insert into this table using the following PySpark script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() # prepare sample data for inserting into hive table data = [] countries = ['CN', 'AU'] for i in range(0, 1000): data.append([int(i), 'U'+str(i), countries[i % 2]]) df = spark.createDataFrame(data, ['country', 'user_id', 'key']) df.show() # Save df to Hive table test_db.bucket_table df.write.mode('append').insertInto('test_db.bucket_table') {code} Then query the table using the following script: {code:java} from pyspark.sql import SparkSession appName = "PySpark Hive Bucketing Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .enableHiveSupport() \ .getOrCreate() df = spark.sql("""select * from test_db.bucket_table where country='AU' and user_id=101 """) df.show() df.explain(extended=True) {code} I am expecting to read from only one bucket file in HDFS but instead Spark scanned all bucket files in partition folder country=AU. Am I doing something wrong? or is it because Spark doesn't support it? Your guidance and help will be appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
[ https://issues.apache.org/jira/browse/SPARK-40205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40205: Assignee: Max Gekk (was: Apache Spark) > Provide a query context of ELEMENT_AT_BY_INDEX_ZERO > --- > > Key: SPARK-40205 > URL: https://issues.apache.org/jira/browse/SPARK-40205 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Pass a query context to elementAtByIndexZeroError() in ElementAt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
[ https://issues.apache.org/jira/browse/SPARK-40205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584118#comment-17584118 ] Apache Spark commented on SPARK-40205: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37645 > Provide a query context of ELEMENT_AT_BY_INDEX_ZERO > --- > > Key: SPARK-40205 > URL: https://issues.apache.org/jira/browse/SPARK-40205 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Pass a query context to elementAtByIndexZeroError() in ElementAt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
[ https://issues.apache.org/jira/browse/SPARK-40205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40205: Assignee: Apache Spark (was: Max Gekk) > Provide a query context of ELEMENT_AT_BY_INDEX_ZERO > --- > > Key: SPARK-40205 > URL: https://issues.apache.org/jira/browse/SPARK-40205 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Pass a query context to elementAtByIndexZeroError() in ElementAt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
Max Gekk created SPARK-40205: Summary: Provide a query context of ELEMENT_AT_BY_INDEX_ZERO Key: SPARK-40205 URL: https://issues.apache.org/jira/browse/SPARK-40205 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Pass a query context to elementAtByIndexZeroError() in ElementAt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40203) Add test cases for Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584082#comment-17584082 ] Apache Spark commented on SPARK-40203: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37644 > Add test cases for Spark Decimal > > > Key: SPARK-40203 > URL: https://issues.apache.org/jira/browse/SPARK-40203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40203) Add test cases for Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40203: Assignee: Apache Spark > Add test cases for Spark Decimal > > > Key: SPARK-40203 > URL: https://issues.apache.org/jira/browse/SPARK-40203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40203) Add test cases for Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40203: Assignee: (was: Apache Spark) > Add test cases for Spark Decimal > > > Key: SPARK-40203 > URL: https://issues.apache.org/jira/browse/SPARK-40203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40203) Add test cases for Spark Decimal
[ https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584085#comment-17584085 ] Apache Spark commented on SPARK-40203: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37644 > Add test cases for Spark Decimal > > > Key: SPARK-40203 > URL: https://issues.apache.org/jira/browse/SPARK-40203 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40203) Add test cases for Spark Decimal
jiaan.geng created SPARK-40203: -- Summary: Add test cases for Spark Decimal Key: SPARK-40203 URL: https://issues.apache.org/jira/browse/SPARK-40203 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Spark Decimal have a lot of method without unit tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40204) Whether it is possible to support querying the status of a specific application in a subsequent version
bitao created SPARK-40204: - Summary: Whether it is possible to support querying the status of a specific application in a subsequent version Key: SPARK-40204 URL: https://issues.apache.org/jira/browse/SPARK-40204 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.6, 2.4.4 Environment: Standalone Cluster Mode Reporter: bitao The current SparkAppHandler cannot support obtaining the application status in Standalone Cluster mode. One way is to query the status of the specified Driver through the StandaloneRestServer, but it cannot query the status of the specified application. Is it possible to add a method (eg: handleAppStatus) to the StandaloneRestServer by asking the Master Send the RequestMasterState message to get the state of the specified application. The current MasterWebUI should do this, but the premise is that it needs to use the same RpcEnv as the Master Endpoint. Many times we care about the status of the application rather than the status of the Driver, so we hope to add this function in subsequent versions to support obtaining the status of the specified application in Standalone cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40180) Format error messages by spark-sql
[ https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40180. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37590 [https://github.com/apache/spark/pull/37590] > Format error messages by spark-sql > -- > > Key: SPARK-40180 > URL: https://issues.apache.org/jira/browse/SPARK-40180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Respect the SQL config spark.sql.error.messageFormat in the implementation of > the SQL CLI: spark-sql. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions
[ https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40176: - Fix Version/s: (was: 3.3.1) > Enhance collapse window optimization to work in case partition or order by > keys are expressions > --- > > Key: SPARK-40176 > URL: https://issues.apache.org/jira/browse/SPARK-40176 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Ayushi Agarwal >Priority: Major > > In window operator with multiple window functions, if any expression is > present in partition by or sort order columns, windows are not collapsed even > if partition and order by expression is same for all those window functions. > E.g. query: > val w = > Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value"))) > df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w)) > Current Plan: > -Window(lead(value,1), key, _w1) -- W1 > - Sort (key, _w1) > -Project (lower(“value”) as _w1) - P1 > -Window(lead(key,1), key, _w0) W2 > -Sort(key, _w0) > -Exchange(key) > -Project (lower(“value”) as _w0) P2 > -Scan > > W1 and W2 can be merged in single window -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40181) DataFrame.intersect and .intersectAll are inconsistently dropping rows
[ https://issues.apache.org/jira/browse/SPARK-40181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40181: - Component/s: SQL > DataFrame.intersect and .intersectAll are inconsistently dropping rows > -- > > Key: SPARK-40181 > URL: https://issues.apache.org/jira/browse/SPARK-40181 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.1 >Reporter: Luke >Priority: Major > > I don't have a minimal reproducible example for this, but the place where it > shows up in our workflow is very simple. > The data in "COLUMN" are a few hundred million distinct strings (gets > deduplicated in the plan also) and it is being compared against itself using > intersect. > The code that is failing is essentially: > {quote}values = [...] # python list containing many unique strings, none of > which are None > df = spark.createDataFrame( > spark.sparkContext.parallelize( > [(value,) for value in values], numSlices=2 + len(values) // 1 > ), > schema=StructType([StructField("COLUMN", StringType())]), > ) > df = df.distinct() > assert df.count() == df.intersect(df).count() > assert df.count() == df.intersectAll(df).count() > {quote} > The issue is that both of the above asserts sometimes pass, and sometimes > fail (technically we haven't seen intersectAll pass yet, but we have only > tried a few times). One thing which is striking is that if you call > df.intersect(df).count() multiple times, the returned count is not always the > same. Sometimes it is exactly df.count(), sometimes it is ~1% lower, but how > much lower exactly seems random. > In particular, we have called df.intersect(df).count() twice in a row, and > got two different counts, which is very surprising given that df should be > deterministic, and suggests maybe there is some kind of > concurrency/inconsistent hashing issue? > One other thing which is possibly noteworthy is that using df.join(df, > df.columns, how="inner") does seem to reliably have the desired behavior (not > dropping any rows). > Here is the resulting plan from df.intersect(df) > {quote}== Parsed Logical Plan == > 'Intersect false > :- Deduplicate [COLUMN#144487] > : +- LogicalRDD [COLUMN#144487], false > +- Deduplicate [COLUMN#144487] > +- LogicalRDD [COLUMN#144487], false > == Analyzed Logical Plan == > COLUMN: string > Intersect false > :- Deduplicate [COLUMN#144487] > : +- LogicalRDD [COLUMN#144487], false > +- Deduplicate [COLUMN#144523] > +- LogicalRDD [COLUMN#144523], false > == Optimized Logical Plan == > Aggregate [COLUMN#144487], [COLUMN#144487] > +- Join LeftSemi, (COLUMN#144487 <=> COLUMN#144523) > :- LogicalRDD [COLUMN#144487], false > +- Aggregate [COLUMN#144523], [COLUMN#144523] > +- LogicalRDD [COLUMN#144523], false > == Physical Plan == > *(7) HashAggregate(keys=[COLUMN#144487], functions=[], output=[COLUMN#144487]) > +- Exchange hashpartitioning(COLUMN#144487, 200), true, [id=#22790] > +- *(6) HashAggregate(keys=[COLUMN#144487], functions=[], > output=[COLUMN#144487]) > +- *(6) SortMergeJoin [coalesce(COLUMN#144487, ), > isnull(COLUMN#144487)], [coalesce(COLUMN#144523, ), isnull(COLUMN#144523)], > LeftSemi > :- *(2) Sort [coalesce(COLUMN#144487, ) ASC NULLS FIRST, > isnull(COLUMN#144487) ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(coalesce(COLUMN#144487, ), > isnull(COLUMN#144487), 200), true, [id=#22772] > : +- *(1) Scan ExistingRDD[COLUMN#144487] > +- *(5) Sort [coalesce(COLUMN#144523, ) ASC NULLS FIRST, > isnull(COLUMN#144523) ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(coalesce(COLUMN#144523, ), > isnull(COLUMN#144523), 200), true, [id=#22782] > +- *(4) HashAggregate(keys=[COLUMN#144523], functions=[], > output=[COLUMN#144523]) > +- Exchange hashpartitioning(COLUMN#144523, 200), true, > [id=#22778] > +- *(3) HashAggregate(keys=[COLUMN#144523], > functions=[], output=[COLUMN#144523]) > +- *(3) Scan ExistingRDD[COLUMN#144523] > {quote} > and for df.intersectAll(df) > {quote}== Parsed Logical Plan == > 'IntersectAll true > :- Deduplicate [COLUMN#144487] > : +- LogicalRDD [COLUMN#144487], false > +- Deduplicate [COLUMN#144487] > +- LogicalRDD [COLUMN#144487], false > == Analyzed Logical Plan == > COLUMN: string > IntersectAll true > :- Deduplicate [COLUMN#144487] > : +- LogicalRDD [COLUMN#144487], false > +- Deduplicate [COLUMN#144533] > +- LogicalRDD [COLUMN#144533], false > == Optimized Logical Plan == > Project [COLUMN#144487] > +- Generate replicaterows(min_count#144566L, COLUMN#144487), [1], false, > [COLUMN#144487] > +- Project [COLUMN#144487, if
[jira] [Updated] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions
[ https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40176: - Target Version/s: (was: 3.3.1) > Enhance collapse window optimization to work in case partition or order by > keys are expressions > --- > > Key: SPARK-40176 > URL: https://issues.apache.org/jira/browse/SPARK-40176 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Ayushi Agarwal >Priority: Major > Fix For: 3.3.1 > > > In window operator with multiple window functions, if any expression is > present in partition by or sort order columns, windows are not collapsed even > if partition and order by expression is same for all those window functions. > E.g. query: > val w = > Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value"))) > df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w)) > Current Plan: > -Window(lead(value,1), key, _w1) -- W1 > - Sort (key, _w1) > -Project (lower(“value”) as _w1) - P1 > -Window(lead(key,1), key, _w0) W2 > -Sort(key, _w0) > -Exchange(key) > -Project (lower(“value”) as _w0) P2 > -Scan > > W1 and W2 can be merged in single window -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39528) Use V2 Filter in SupportsRuntimeFiltering
[ https://issues.apache.org/jira/browse/SPARK-39528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584012#comment-17584012 ] Apache Spark commented on SPARK-39528: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37643 > Use V2 Filter in SupportsRuntimeFiltering > - > > Key: SPARK-39528 > URL: https://issues.apache.org/jira/browse/SPARK-39528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.4.0 > > > Currently, SupportsRuntimeFiltering uses v1 filter. We should use v2 filter > instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39528) Use V2 Filter in SupportsRuntimeFiltering
[ https://issues.apache.org/jira/browse/SPARK-39528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584011#comment-17584011 ] Apache Spark commented on SPARK-39528: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37643 > Use V2 Filter in SupportsRuntimeFiltering > - > > Key: SPARK-39528 > URL: https://issues.apache.org/jira/browse/SPARK-39528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.4.0 > > > Currently, SupportsRuntimeFiltering uses v1 filter. We should use v2 filter > instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org