[jira] [Updated] (SPARK-39858) Remove unnecessary AliasHelper or PredicateHelper for some rules
[ https://issues.apache.org/jira/browse/SPARK-39858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-39858: --- Summary: Remove unnecessary AliasHelper or PredicateHelper for some rules (was: Remove unnecessary AliasHelper for some rules) > Remove unnecessary AliasHelper or PredicateHelper for some rules > > > Key: SPARK-39858 > URL: https://issues.apache.org/jira/browse/SPARK-39858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: jiaan.geng >Priority: Major > > When I use AliasHelper, I found some rules not use it but extend. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39858) Remove unnecessary AliasHelper for some rules
jiaan.geng created SPARK-39858: -- Summary: Remove unnecessary AliasHelper for some rules Key: SPARK-39858 URL: https://issues.apache.org/jira/browse/SPARK-39858 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: jiaan.geng When I use AliasHelper, I found some rules not use it but extend. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39837) Filesystem leak when running `TPC-DS queries with SF=1`
[ https://issues.apache.org/jira/browse/SPARK-39837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-39837. -- Resolution: Not A Bug Just close delayed, not leaked. > Filesystem leak when running `TPC-DS queries with SF=1` > --- > > Key: SPARK-39837 > URL: https://issues.apache.org/jira/browse/SPARK-39837 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > Following log in `TPC-DS queries with SF=1` GA logs: > > {code:java} > 2022-07-22T00:19:52.8539664Z 00:19:52.849 WARN > org.apache.spark.DebugFilesystem: Leaked filesystem connection created at: > 2022-07-22T00:19:52.8548926Z java.lang.Throwable > 2022-07-22T00:19:52.8568135Z at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35) > 2022-07-22T00:19:52.8573547Z at > org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75) > 2022-07-22T00:19:52.8574108Z at > org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) > 2022-07-22T00:19:52.8578427Z at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > 2022-07-22T00:19:52.8579211Z at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774) > 2022-07-22T00:19:52.8589698Z at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100) > 2022-07-22T00:19:52.8590842Z at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:175) > 2022-07-22T00:19:52.8594751Z at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:340) > 2022-07-22T00:19:52.8595634Z at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:211) > 2022-07-22T00:19:52.8598975Z at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:272) > 2022-07-22T00:19:52.8599639Z at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118) > 2022-07-22T00:19:52.8602839Z at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:583) > 2022-07-22T00:19:52.8603625Z at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown > Source) > 2022-07-22T00:19:52.8606618Z at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown > Source) > 2022-07-22T00:19:52.8609954Z at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > 2022-07-22T00:19:52.8620028Z at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > 2022-07-22T00:19:52.8623148Z at > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > 2022-07-22T00:19:52.8623812Z at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > 2022-07-22T00:19:52.8627344Z at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > 2022-07-22T00:19:52.8628031Z at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > 2022-07-22T00:19:52.8637881Z at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > 2022-07-22T00:19:52.8638603Z at > org.apache.spark.scheduler.Task.run(Task.scala:139) > 2022-07-22T00:19:52.8644696Z at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) > 2022-07-22T00:19:52.8645352Z at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490) > 2022-07-22T00:19:52.8649598Z at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) > 2022-07-22T00:19:52.8650238Z at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2022-07-22T00:19:52.8657783Z at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2022-07-22T00:19:52.8658260Z at java.lang.Thread.run(Thread.java:750){code} > > > Actions have similar to log: > * [https://github.com/apache/spark/runs/7460003953?check_suite_focus=true] > * [https://github.com/apache/spark/runs/7459868605?check_suite_focus=true] > * [https://github.com/apache/spark/runs/7460262731?check_suite_focus=true] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Assigned] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
[ https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39857: Assignee: Apache Spark > V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate > -- > > Key: SPARK-39857 > URL: https://issues.apache.org/jira/browse/SPARK-39857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which > is BooleanType) is used to build the LiteralValue, InSet.child.dataType > should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
[ https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570622#comment-17570622 ] Apache Spark commented on SPARK-39857: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37271 > V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate > -- > > Key: SPARK-39857 > URL: https://issues.apache.org/jira/browse/SPARK-39857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Minor > > When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which > is BooleanType) is used to build the LiteralValue, InSet.child.dataType > should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
[ https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570621#comment-17570621 ] Apache Spark commented on SPARK-39857: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37271 > V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate > -- > > Key: SPARK-39857 > URL: https://issues.apache.org/jira/browse/SPARK-39857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Minor > > When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which > is BooleanType) is used to build the LiteralValue, InSet.child.dataType > should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
[ https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39857: Assignee: (was: Apache Spark) > V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate > -- > > Key: SPARK-39857 > URL: https://issues.apache.org/jira/browse/SPARK-39857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Minor > > When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which > is BooleanType) is used to build the LiteralValue, InSet.child.dataType > should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
Huaxin Gao created SPARK-39857: -- Summary: V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate Key: SPARK-39857 URL: https://issues.apache.org/jira/browse/SPARK-39857 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Huaxin Gao When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which is BooleanType) is used to build the LiteralValue, InSet.child.dataType should be used instead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39856. -- Fix Version/s: 3.3.1 3.0.4 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37270 [https://github.com/apache/spark/pull/37270] > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.1, 3.0.4, 3.1.4, 3.2.3, 3.4.0 > > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39856: Assignee: Hyukjin Kwon > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39840) Factor PythonArrowInput out as a symmetry to PythonArrowOutput
[ https://issues.apache.org/jira/browse/SPARK-39840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39840. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37253 [https://github.com/apache/spark/pull/37253] > Factor PythonArrowInput out as a symmetry to PythonArrowOutput > -- > > Key: SPARK-39840 > URL: https://issues.apache.org/jira/browse/SPARK-39840 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > In https://issues.apache.org/jira/browse/SPARK-29317, we factored > {{PythonArrowOutput}} out. It's better to factor {{PythonArrowInput}} out too > to be consistent -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39840) Factor PythonArrowInput out as a symmetry to PythonArrowOutput
[ https://issues.apache.org/jira/browse/SPARK-39840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39840: Assignee: Hyukjin Kwon > Factor PythonArrowInput out as a symmetry to PythonArrowOutput > -- > > Key: SPARK-39840 > URL: https://issues.apache.org/jira/browse/SPARK-39840 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > In https://issues.apache.org/jira/browse/SPARK-29317, we factored > {{PythonArrowOutput}} out. It's better to factor {{PythonArrowInput}} out too > to be consistent -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570588#comment-17570588 ] Apache Spark commented on SPARK-39856: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37270 > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570587#comment-17570587 ] Apache Spark commented on SPARK-39856: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37270 > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39856: Assignee: Apache Spark > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39856: Assignee: (was: Apache Spark) > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
Hyukjin Kwon created SPARK-39856: Summary: Avoid OOM in TPC-DS build with SMJ Key: SPARK-39856 URL: https://issues.apache.org/jira/browse/SPARK-39856 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Hyukjin Kwon TPC-DS consistently fails, see https://github.com/apache/spark/runs/7491836477?check_suite_focus=true presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ
[ https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39856: - Issue Type: Test (was: Improvement) > Avoid OOM in TPC-DS build with SMJ > -- > > Key: SPARK-39856 > URL: https://issues.apache.org/jira/browse/SPARK-39856 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > TPC-DS consistently fails, see > https://github.com/apache/spark/runs/7491836477?check_suite_focus=true > presumably because of out-of-memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
[ https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39854: Assignee: Apache Spark > Catalyst 'ColumnPruning' Optimizer does not play well with sql function > 'explode' > - > > Key: SPARK-39854 > URL: https://issues.apache.org/jira/browse/SPARK-39854 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.3.0 > Environment: Spark version: the latest (3.4.0-SNAPSHOT) > OS: Ubuntu 20.04 > JDK: Amazon corretto-11.0.14.1 >Reporter: Jiaji Wu >Assignee: Apache Spark >Priority: Major > > The *ColumnPruning* optimizer batch does not always work with *explode* sql > function. > * Here's a code snippet to repro the issue: > > {code:java} > import spark.implicits._ > val testJson = > """{ > | "b": { > | "id": "id00", > | "data": [{ > | "b1": "vb1", > | "b2": 101, > | "ex2": [ > |{ "fb1": false, "fb2": 11, "fb3": "t1" }, > |{ "fb1": true, "fb2": 12, "fb3": "t2" } > | ]}, { > | "b1": "vb2", > | "b2": 102, > | "ex2": [ > |{ "fb1": false, "fb2": 13, "fb3": "t3" }, > |{ "fb1": true, "fb2": 14, "fb3": "t4" } > | ]} > | ], > | "fa": "tes", > | "v": "1.5" > | } > |} > |""".stripMargin > val df = spark.read.json((testJson :: Nil).toDS()) > .withColumn("ex_b", explode($"b.data.ex2")) > .withColumn("ex_b2", explode($"ex_b")) > val df1 = df > .withColumn("rt", struct( > $"b.fa".alias("rt_fa"), > $"b.v".alias("rt_v") > )) > .drop("b", "ex_b") > df1.show(false){code} > * the result exception: > {code:java} > Exception in thread "main" java.lang.IllegalStateException: Couldn't find > _extract_v#35 in [_extract_fa#36,ex_b2#13] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at >
[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
[ https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570540#comment-17570540 ] Apache Spark commented on SPARK-39854: -- User 'jiaji-wu' has created a pull request for this issue: https://github.com/apache/spark/pull/37269 > Catalyst 'ColumnPruning' Optimizer does not play well with sql function > 'explode' > - > > Key: SPARK-39854 > URL: https://issues.apache.org/jira/browse/SPARK-39854 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.3.0 > Environment: Spark version: the latest (3.4.0-SNAPSHOT) > OS: Ubuntu 20.04 > JDK: Amazon corretto-11.0.14.1 >Reporter: Jiaji Wu >Priority: Major > > The *ColumnPruning* optimizer batch does not always work with *explode* sql > function. > * Here's a code snippet to repro the issue: > > {code:java} > import spark.implicits._ > val testJson = > """{ > | "b": { > | "id": "id00", > | "data": [{ > | "b1": "vb1", > | "b2": 101, > | "ex2": [ > |{ "fb1": false, "fb2": 11, "fb3": "t1" }, > |{ "fb1": true, "fb2": 12, "fb3": "t2" } > | ]}, { > | "b1": "vb2", > | "b2": 102, > | "ex2": [ > |{ "fb1": false, "fb2": 13, "fb3": "t3" }, > |{ "fb1": true, "fb2": 14, "fb3": "t4" } > | ]} > | ], > | "fa": "tes", > | "v": "1.5" > | } > |} > |""".stripMargin > val df = spark.read.json((testJson :: Nil).toDS()) > .withColumn("ex_b", explode($"b.data.ex2")) > .withColumn("ex_b2", explode($"ex_b")) > val df1 = df > .withColumn("rt", struct( > $"b.fa".alias("rt_fa"), > $"b.v".alias("rt_v") > )) > .drop("b", "ex_b") > df1.show(false){code} > * the result exception: > {code:java} > Exception in thread "main" java.lang.IllegalStateException: Couldn't find > _extract_v#35 in [_extract_fa#36,ex_b2#13] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at >
[jira] [Assigned] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
[ https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39854: Assignee: (was: Apache Spark) > Catalyst 'ColumnPruning' Optimizer does not play well with sql function > 'explode' > - > > Key: SPARK-39854 > URL: https://issues.apache.org/jira/browse/SPARK-39854 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.3.0 > Environment: Spark version: the latest (3.4.0-SNAPSHOT) > OS: Ubuntu 20.04 > JDK: Amazon corretto-11.0.14.1 >Reporter: Jiaji Wu >Priority: Major > > The *ColumnPruning* optimizer batch does not always work with *explode* sql > function. > * Here's a code snippet to repro the issue: > > {code:java} > import spark.implicits._ > val testJson = > """{ > | "b": { > | "id": "id00", > | "data": [{ > | "b1": "vb1", > | "b2": 101, > | "ex2": [ > |{ "fb1": false, "fb2": 11, "fb3": "t1" }, > |{ "fb1": true, "fb2": 12, "fb3": "t2" } > | ]}, { > | "b1": "vb2", > | "b2": 102, > | "ex2": [ > |{ "fb1": false, "fb2": 13, "fb3": "t3" }, > |{ "fb1": true, "fb2": 14, "fb3": "t4" } > | ]} > | ], > | "fa": "tes", > | "v": "1.5" > | } > |} > |""".stripMargin > val df = spark.read.json((testJson :: Nil).toDS()) > .withColumn("ex_b", explode($"b.data.ex2")) > .withColumn("ex_b2", explode($"ex_b")) > val df1 = df > .withColumn("rt", struct( > $"b.fa".alias("rt_fa"), > $"b.v".alias("rt_v") > )) > .drop("b", "ex_b") > df1.show(false){code} > * the result exception: > {code:java} > Exception in thread "main" java.lang.IllegalStateException: Couldn't find > _extract_v#35 in [_extract_fa#36,ex_b2#13] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at >
[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
[ https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570541#comment-17570541 ] Apache Spark commented on SPARK-39854: -- User 'jiaji-wu' has created a pull request for this issue: https://github.com/apache/spark/pull/37269 > Catalyst 'ColumnPruning' Optimizer does not play well with sql function > 'explode' > - > > Key: SPARK-39854 > URL: https://issues.apache.org/jira/browse/SPARK-39854 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.3.0 > Environment: Spark version: the latest (3.4.0-SNAPSHOT) > OS: Ubuntu 20.04 > JDK: Amazon corretto-11.0.14.1 >Reporter: Jiaji Wu >Assignee: Apache Spark >Priority: Major > > The *ColumnPruning* optimizer batch does not always work with *explode* sql > function. > * Here's a code snippet to repro the issue: > > {code:java} > import spark.implicits._ > val testJson = > """{ > | "b": { > | "id": "id00", > | "data": [{ > | "b1": "vb1", > | "b2": 101, > | "ex2": [ > |{ "fb1": false, "fb2": 11, "fb3": "t1" }, > |{ "fb1": true, "fb2": 12, "fb3": "t2" } > | ]}, { > | "b1": "vb2", > | "b2": 102, > | "ex2": [ > |{ "fb1": false, "fb2": 13, "fb3": "t3" }, > |{ "fb1": true, "fb2": 14, "fb3": "t4" } > | ]} > | ], > | "fa": "tes", > | "v": "1.5" > | } > |} > |""".stripMargin > val df = spark.read.json((testJson :: Nil).toDS()) > .withColumn("ex_b", explode($"b.data.ex2")) > .withColumn("ex_b2", explode($"ex_b")) > val df1 = df > .withColumn("rt", struct( > $"b.fa".alias("rt_fa"), > $"b.v".alias("rt_v") > )) > .drop("b", "ex_b") > df1.show(false){code} > * the result exception: > {code:java} > Exception in thread "main" java.lang.IllegalStateException: Couldn't find > _extract_v#35 in [_extract_fa#36,ex_b2#13] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at >
[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
[ https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570535#comment-17570535 ] Jiaji Wu commented on SPARK-39854: -- One workaround is to exclude *ColumnPruning* by set spark config: {color:#54b33e}"spark.sql.optimizer.excludedRules" {color}-> {color:#54b33e}"org.apache.spark.sql.catalyst.optimizer.ColumnPruning"{color} > Catalyst 'ColumnPruning' Optimizer does not play well with sql function > 'explode' > - > > Key: SPARK-39854 > URL: https://issues.apache.org/jira/browse/SPARK-39854 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.3.0 > Environment: Spark version: the latest (3.4.0-SNAPSHOT) > OS: Ubuntu 20.04 > JDK: Amazon corretto-11.0.14.1 >Reporter: Jiaji Wu >Priority: Major > > The *ColumnPruning* optimizer batch does not always work with *explode* sql > function. > * Here's a code snippet to repro the issue: > > {code:java} > import spark.implicits._ > val testJson = > """{ > | "b": { > | "id": "id00", > | "data": [{ > | "b1": "vb1", > | "b2": 101, > | "ex2": [ > |{ "fb1": false, "fb2": 11, "fb3": "t1" }, > |{ "fb1": true, "fb2": 12, "fb3": "t2" } > | ]}, { > | "b1": "vb2", > | "b2": 102, > | "ex2": [ > |{ "fb1": false, "fb2": 13, "fb3": "t3" }, > |{ "fb1": true, "fb2": 14, "fb3": "t4" } > | ]} > | ], > | "fa": "tes", > | "v": "1.5" > | } > |} > |""".stripMargin > val df = spark.read.json((testJson :: Nil).toDS()) > .withColumn("ex_b", explode($"b.data.ex2")) > .withColumn("ex_b2", explode($"ex_b")) > val df1 = df > .withColumn("rt", struct( > $"b.fa".alias("rt_fa"), > $"b.v".alias("rt_v") > )) > .drop("b", "ex_b") > df1.show(false){code} > * the result exception: > {code:java} > Exception in thread "main" java.lang.IllegalStateException: Couldn't find > _extract_v#35 in [_extract_fa#36,ex_b2#13] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at
[jira] [Commented] (SPARK-39855) Unable to set zstd compression level while writing orc files
[ https://issues.apache.org/jira/browse/SPARK-39855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570504#comment-17570504 ] shezm commented on SPARK-39855: --- I will follow up on this issue > Unable to set zstd compression level while writing orc files > > > Key: SPARK-39855 > URL: https://issues.apache.org/jira/browse/SPARK-39855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: shezm >Priority: Major > > like this issue : https://issues.apache.org/jira/browse/SPARK-39743 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39855) Unable to set zstd compression level while writing orc files
shezm created SPARK-39855: - Summary: Unable to set zstd compression level while writing orc files Key: SPARK-39855 URL: https://issues.apache.org/jira/browse/SPARK-39855 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: shezm like this issue : https://issues.apache.org/jira/browse/SPARK-39743 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
[ https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaji Wu updated SPARK-39854: - Affects Version/s: 3.2.1 > Catalyst 'ColumnPruning' Optimizer does not play well with sql function > 'explode' > - > > Key: SPARK-39854 > URL: https://issues.apache.org/jira/browse/SPARK-39854 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1, 3.3.0 > Environment: Spark version: the latest (3.4.0-SNAPSHOT) > OS: Ubuntu 20.04 > JDK: Amazon corretto-11.0.14.1 >Reporter: Jiaji Wu >Priority: Major > > The *ColumnPruning* optimizer batch does not always work with *explode* sql > function. > * Here's a code snippet to repro the issue: > > {code:java} > import spark.implicits._ > val testJson = > """{ > | "b": { > | "id": "id00", > | "data": [{ > | "b1": "vb1", > | "b2": 101, > | "ex2": [ > |{ "fb1": false, "fb2": 11, "fb3": "t1" }, > |{ "fb1": true, "fb2": 12, "fb3": "t2" } > | ]}, { > | "b1": "vb2", > | "b2": 102, > | "ex2": [ > |{ "fb1": false, "fb2": 13, "fb3": "t3" }, > |{ "fb1": true, "fb2": 14, "fb3": "t4" } > | ]} > | ], > | "fa": "tes", > | "v": "1.5" > | } > |} > |""".stripMargin > val df = spark.read.json((testJson :: Nil).toDS()) > .withColumn("ex_b", explode($"b.data.ex2")) > .withColumn("ex_b2", explode($"ex_b")) > val df1 = df > .withColumn("rt", struct( > $"b.fa".alias("rt_fa"), > $"b.v".alias("rt_v") > )) > .drop("b", "ex_b") > df1.show(false){code} > * the result exception: > {code:java} > Exception in thread "main" java.lang.IllegalStateException: Couldn't find > _extract_v#35 in [_extract_fa#36,ex_b2#13] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at >
[jira] [Created] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'
Jiaji Wu created SPARK-39854: Summary: Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode' Key: SPARK-39854 URL: https://issues.apache.org/jira/browse/SPARK-39854 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.3.0 Environment: Spark version: the latest (3.4.0-SNAPSHOT) OS: Ubuntu 20.04 JDK: Amazon corretto-11.0.14.1 Reporter: Jiaji Wu The *ColumnPruning* optimizer batch does not always work with *explode* sql function. * Here's a code snippet to repro the issue: {code:java} import spark.implicits._ val testJson = """{ | "b": { | "id": "id00", | "data": [{ | "b1": "vb1", | "b2": 101, | "ex2": [ |{ "fb1": false, "fb2": 11, "fb3": "t1" }, |{ "fb1": true, "fb2": 12, "fb3": "t2" } | ]}, { | "b1": "vb2", | "b2": 102, | "ex2": [ |{ "fb1": false, "fb2": 13, "fb3": "t3" }, |{ "fb1": true, "fb2": 14, "fb3": "t4" } | ]} | ], | "fa": "tes", | "v": "1.5" | } |} |""".stripMargin val df = spark.read.json((testJson :: Nil).toDS()) .withColumn("ex_b", explode($"b.data.ex2")) .withColumn("ex_b2", explode($"ex_b")) val df1 = df .withColumn("rt", struct( $"b.fa".alias("rt_fa"), $"b.v".alias("rt_v") )) .drop("b", "ex_b") df1.show(false){code} * the result exception: {code:java} Exception in thread "main" java.lang.IllegalStateException: Couldn't find _extract_v#35 in [_extract_fa#36,ex_b2#13] at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196) at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) at scala.collection.immutable.List.map(List.scala:297) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:196) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:151) at
[jira] [Assigned] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled
[ https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39853: Assignee: (was: Apache Spark) > Support stage level schedule for standalone cluster when dynamic allocation > is disabled > --- > > Key: SPARK-39853 > URL: https://issues.apache.org/jira/browse/SPARK-39853 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: huangtengfei >Priority: Major > > [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage > level schedule support for standalone cluster when dynamic allocation was > enabled, spark would request for executors for different resource profiles. > While when dynamic allocation is disabled, we can also leverage stage level > schedule to schedule tasks based on resource profile(task resource requests) > to executors with default resource profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled
[ https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570501#comment-17570501 ] Apache Spark commented on SPARK-39853: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/37268 > Support stage level schedule for standalone cluster when dynamic allocation > is disabled > --- > > Key: SPARK-39853 > URL: https://issues.apache.org/jira/browse/SPARK-39853 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: huangtengfei >Priority: Major > > [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage > level schedule support for standalone cluster when dynamic allocation was > enabled, spark would request for executors for different resource profiles. > While when dynamic allocation is disabled, we can also leverage stage level > schedule to schedule tasks based on resource profile(task resource requests) > to executors with default resource profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled
[ https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39853: Assignee: Apache Spark > Support stage level schedule for standalone cluster when dynamic allocation > is disabled > --- > > Key: SPARK-39853 > URL: https://issues.apache.org/jira/browse/SPARK-39853 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: huangtengfei >Assignee: Apache Spark >Priority: Major > > [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage > level schedule support for standalone cluster when dynamic allocation was > enabled, spark would request for executors for different resource profiles. > While when dynamic allocation is disabled, we can also leverage stage level > schedule to schedule tasks based on resource profile(task resource requests) > to executors with default resource profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled
huangtengfei created SPARK-39853: Summary: Support stage level schedule for standalone cluster when dynamic allocation is disabled Key: SPARK-39853 URL: https://issues.apache.org/jira/browse/SPARK-39853 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: huangtengfei [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage level schedule support for standalone cluster when dynamic allocation was enabled, spark would request for executors for different resource profiles. While when dynamic allocation is disabled, we can also leverage stage level schedule to schedule tasks based on resource profile(task resource requests) to executors with default resource profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39851) Improve join stats estimation if one side can keep uniqueness
[ https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-39851: Summary: Improve join stats estimation if one side can keep uniqueness (was: Fix join stats estimation if one side can keep uniqueness) > Improve join stats estimation if one side can keep uniqueness > - > > Key: SPARK-39851 > URL: https://issues.apache.org/jira/browse/SPARK-39851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > SELECT i_item_sk ss_item_sk > FROM item, >(SELECT DISTINCT iss.i_brand_idbrand_id, > iss.i_class_idclass_id, > iss.i_category_id category_id > FROM item iss) x > WHERE i_brand_id = brand_id >AND i_class_id = class_id >AND i_category_id = category_id > {code} > Current: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, > rowCount=3.24E+7) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation > spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) > {noformat} > Excepted: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, > rowCount=2.02E+5) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation >
[jira] [Assigned] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness
[ https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39851: Assignee: (was: Apache Spark) > Fix join stats estimation if one side can keep uniqueness > - > > Key: SPARK-39851 > URL: https://issues.apache.org/jira/browse/SPARK-39851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > SELECT i_item_sk ss_item_sk > FROM item, >(SELECT DISTINCT iss.i_brand_idbrand_id, > iss.i_class_idclass_id, > iss.i_category_id category_id > FROM item iss) x > WHERE i_brand_id = brand_id >AND i_class_id = class_id >AND i_category_id = category_id > {code} > Current: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, > rowCount=3.24E+7) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation > spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) > {noformat} > Excepted: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, > rowCount=2.02E+5) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation >
[jira] [Commented] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness
[ https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570473#comment-17570473 ] Apache Spark commented on SPARK-39851: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37267 > Fix join stats estimation if one side can keep uniqueness > - > > Key: SPARK-39851 > URL: https://issues.apache.org/jira/browse/SPARK-39851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > SELECT i_item_sk ss_item_sk > FROM item, >(SELECT DISTINCT iss.i_brand_idbrand_id, > iss.i_class_idclass_id, > iss.i_category_id category_id > FROM item iss) x > WHERE i_brand_id = brand_id >AND i_class_id = class_id >AND i_category_id = category_id > {code} > Current: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, > rowCount=3.24E+7) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation > spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) > {noformat} > Excepted: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, > rowCount=2.02E+5) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation >
[jira] [Assigned] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness
[ https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39851: Assignee: Apache Spark > Fix join stats estimation if one side can keep uniqueness > - > > Key: SPARK-39851 > URL: https://issues.apache.org/jira/browse/SPARK-39851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:sql} > SELECT i_item_sk ss_item_sk > FROM item, >(SELECT DISTINCT iss.i_brand_idbrand_id, > iss.i_class_idclass_id, > iss.i_category_id category_id > FROM item iss) x > WHERE i_brand_id = brand_id >AND i_class_id = class_id >AND i_category_id = category_id > {code} > Current: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, > rowCount=3.24E+7) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation > spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) > {noformat} > Excepted: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, > rowCount=2.02E+5) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation >
[jira] [Commented] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness
[ https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570474#comment-17570474 ] Apache Spark commented on SPARK-39851: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37267 > Fix join stats estimation if one side can keep uniqueness > - > > Key: SPARK-39851 > URL: https://issues.apache.org/jira/browse/SPARK-39851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > SELECT i_item_sk ss_item_sk > FROM item, >(SELECT DISTINCT iss.i_brand_idbrand_id, > iss.i_class_idclass_id, > iss.i_category_id category_id > FROM item iss) x > WHERE i_brand_id = brand_id >AND i_class_id = class_id >AND i_category_id = category_id > {code} > Current: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, > rowCount=3.24E+7) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation > spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) > {noformat} > Excepted: > {noformat} > == Optimized Logical Plan == > Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, > rowCount=2.02E+5) > +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = > class_id#52)) AND (i_category_id#15 = category_id#53)), > Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) >:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], > Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) >: +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND > isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) >: +- Relation > spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] > parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) >+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, > class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, > rowCount=1.37E+5) > +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, > i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, > rowCount=2.02E+5) > +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) > AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, > rowCount=2.02E+5) > +- Relation >
[jira] [Commented] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns
[ https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570468#comment-17570468 ] Apache Spark commented on SPARK-39852: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37266 > Unify v1 and v2 DESCRIBE TABLE tests for columns > > > Key: SPARK-39852 > URL: https://issues.apache.org/jira/browse/SPARK-39852 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns
[ https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570467#comment-17570467 ] Apache Spark commented on SPARK-39852: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37266 > Unify v1 and v2 DESCRIBE TABLE tests for columns > > > Key: SPARK-39852 > URL: https://issues.apache.org/jira/browse/SPARK-39852 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns
[ https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39852: Assignee: Max Gekk (was: Apache Spark) > Unify v1 and v2 DESCRIBE TABLE tests for columns > > > Key: SPARK-39852 > URL: https://issues.apache.org/jira/browse/SPARK-39852 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns
[ https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39852: Assignee: Apache Spark (was: Max Gekk) > Unify v1 and v2 DESCRIBE TABLE tests for columns > > > Key: SPARK-39852 > URL: https://issues.apache.org/jira/browse/SPARK-39852 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns
Max Gekk created SPARK-39852: Summary: Unify v1 and v2 DESCRIBE TABLE tests for columns Key: SPARK-39852 URL: https://issues.apache.org/jira/browse/SPARK-39852 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness
Yuming Wang created SPARK-39851: --- Summary: Fix join stats estimation if one side can keep uniqueness Key: SPARK-39851 URL: https://issues.apache.org/jira/browse/SPARK-39851 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang {code:sql} SELECT i_item_sk ss_item_sk FROM item, (SELECT DISTINCT iss.i_brand_idbrand_id, iss.i_class_idclass_id, iss.i_category_id category_id FROM item iss) x WHERE i_brand_id = brand_id AND i_class_id = class_id AND i_category_id = category_id {code} Current: {noformat} == Optimized Logical Plan == Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, rowCount=3.24E+7) +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = class_id#52)) AND (i_category_id#15 = category_id#53)), Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7) :- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) : +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) : +- Relation spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) +- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, rowCount=1.37E+5) +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, rowCount=2.02E+5) +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) +- Relation spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) {noformat} Excepted: {noformat} == Optimized Logical Plan == Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, rowCount=2.02E+5) +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = class_id#52)) AND (i_category_id#15 = category_id#53)), Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5) :- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5) : +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) : +- Relation spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) +- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, rowCount=1.37E+5) +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, rowCount=2.02E+5) +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5) +- Relation spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76] parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39850) Print applicationId once applied from yarn rm
[ https://issues.apache.org/jira/browse/SPARK-39850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39850: Assignee: (was: Apache Spark) > Print applicationId once applied from yarn rm > - > > Key: SPARK-39850 > URL: https://issues.apache.org/jira/browse/SPARK-39850 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: LiDongwei >Priority: Major > > As we all know,between client gets application from yarn and submits the > application to yarn,there is still a lot work to do . if a application fails > during these works,user can not easily find out the application id. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39850) Print applicationId once applied from yarn rm
[ https://issues.apache.org/jira/browse/SPARK-39850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570447#comment-17570447 ] Apache Spark commented on SPARK-39850: -- User 'DongweiLee' has created a pull request for this issue: https://github.com/apache/spark/pull/37265 > Print applicationId once applied from yarn rm > - > > Key: SPARK-39850 > URL: https://issues.apache.org/jira/browse/SPARK-39850 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: LiDongwei >Priority: Major > > As we all know,between client gets application from yarn and submits the > application to yarn,there is still a lot work to do . if a application fails > during these works,user can not easily find out the application id. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39850) Print applicationId once applied from yarn rm
[ https://issues.apache.org/jira/browse/SPARK-39850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39850: Assignee: Apache Spark > Print applicationId once applied from yarn rm > - > > Key: SPARK-39850 > URL: https://issues.apache.org/jira/browse/SPARK-39850 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: LiDongwei >Assignee: Apache Spark >Priority: Major > > As we all know,between client gets application from yarn and submits the > application to yarn,there is still a lot work to do . if a application fails > during these works,user can not easily find out the application id. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39850) Print applicationId once applied from yarn rm
LiDongwei created SPARK-39850: - Summary: Print applicationId once applied from yarn rm Key: SPARK-39850 URL: https://issues.apache.org/jira/browse/SPARK-39850 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.3.0 Reporter: LiDongwei As we all know,between client gets application from yarn and submits the application to yarn,there is still a lot work to do . if a application fails during these works,user can not easily find out the application id. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value
[ https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570443#comment-17570443 ] Apache Spark commented on SPARK-39849: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/37264 > Dataset.as(StructType) fills missing new columns with null value > > > Key: SPARK-39849 > URL: https://issues.apache.org/jira/browse/SPARK-39849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Su >Priority: Minor > > As a followup of > [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would > be great to fill missing new columns with null values, instead of failing out > loud. Note it would only work for nullable columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value
[ https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39849: Assignee: (was: Apache Spark) > Dataset.as(StructType) fills missing new columns with null value > > > Key: SPARK-39849 > URL: https://issues.apache.org/jira/browse/SPARK-39849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Su >Priority: Minor > > As a followup of > [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would > be great to fill missing new columns with null values, instead of failing out > loud. Note it would only work for nullable columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value
[ https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570442#comment-17570442 ] Apache Spark commented on SPARK-39849: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/37264 > Dataset.as(StructType) fills missing new columns with null value > > > Key: SPARK-39849 > URL: https://issues.apache.org/jira/browse/SPARK-39849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Su >Priority: Minor > > As a followup of > [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would > be great to fill missing new columns with null values, instead of failing out > loud. Note it would only work for nullable columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value
[ https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39849: Assignee: Apache Spark > Dataset.as(StructType) fills missing new columns with null value > > > Key: SPARK-39849 > URL: https://issues.apache.org/jira/browse/SPARK-39849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > As a followup of > [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would > be great to fill missing new columns with null values, instead of failing out > loud. Note it would only work for nullable columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value
Cheng Su created SPARK-39849: Summary: Dataset.as(StructType) fills missing new columns with null value Key: SPARK-39849 URL: https://issues.apache.org/jira/browse/SPARK-39849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Cheng Su As a followup of [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would be great to fill missing new columns with null values, instead of failing out loud. Note it would only work for nullable columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570435#comment-17570435 ] Apache Spark commented on SPARK-39743: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/37263 > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39743: Assignee: (was: Apache Spark) > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39743: Assignee: Apache Spark > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Assignee: Apache Spark >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570434#comment-17570434 ] Apache Spark commented on SPARK-39743: -- User 'ming95' has created a pull request for this issue: https://github.com/apache/spark/pull/37263 > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org