date:20220724

[jira] [Updated] (SPARK-39858) Remove unnecessary AliasHelper or PredicateHelper for some rules

2022-07-24 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39858:
---
Summary: Remove unnecessary AliasHelper or PredicateHelper for some rules  
(was: Remove unnecessary AliasHelper for some rules)

> Remove unnecessary AliasHelper or PredicateHelper for some rules
> 
>
> Key: SPARK-39858
> URL: https://issues.apache.org/jira/browse/SPARK-39858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I use AliasHelper, I found some rules not use it but extend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39858) Remove unnecessary AliasHelper for some rules

2022-07-24 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-39858:
--

 Summary: Remove unnecessary AliasHelper for some rules
 Key: SPARK-39858
 URL: https://issues.apache.org/jira/browse/SPARK-39858
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: jiaan.geng


When I use AliasHelper, I found some rules not use it but extend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39837) Filesystem leak when running `TPC-DS queries with SF=1`

2022-07-24 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-39837.
--
Resolution: Not A Bug

Just close delayed, not leaked.

 

> Filesystem leak when running `TPC-DS queries with SF=1`
> ---
>
> Key: SPARK-39837
> URL: https://issues.apache.org/jira/browse/SPARK-39837
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> Following log in `TPC-DS queries with SF=1` GA logs:
>  
> {code:java}
> 2022-07-22T00:19:52.8539664Z 00:19:52.849 WARN 
> org.apache.spark.DebugFilesystem: Leaked filesystem connection created at:
> 2022-07-22T00:19:52.8548926Z java.lang.Throwable
> 2022-07-22T00:19:52.8568135Z  at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:35)
> 2022-07-22T00:19:52.8573547Z  at 
> org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:75)
> 2022-07-22T00:19:52.8574108Z  at 
> org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
> 2022-07-22T00:19:52.8578427Z  at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
> 2022-07-22T00:19:52.8579211Z  at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774)
> 2022-07-22T00:19:52.8589698Z  at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:100)
> 2022-07-22T00:19:52.8590842Z  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:175)
> 2022-07-22T00:19:52.8594751Z  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:340)
> 2022-07-22T00:19:52.8595634Z  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:211)
> 2022-07-22T00:19:52.8598975Z  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:272)
> 2022-07-22T00:19:52.8599639Z  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
> 2022-07-22T00:19:52.8602839Z  at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:583)
> 2022-07-22T00:19:52.8603625Z  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown
>  Source)
> 2022-07-22T00:19:52.8606618Z  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
>  Source)
> 2022-07-22T00:19:52.8609954Z  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> 2022-07-22T00:19:52.8620028Z  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> 2022-07-22T00:19:52.8623148Z  at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> 2022-07-22T00:19:52.8623812Z  at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> 2022-07-22T00:19:52.8627344Z  at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> 2022-07-22T00:19:52.8628031Z  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
> 2022-07-22T00:19:52.8637881Z  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> 2022-07-22T00:19:52.8638603Z  at 
> org.apache.spark.scheduler.Task.run(Task.scala:139)
> 2022-07-22T00:19:52.8644696Z  at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
> 2022-07-22T00:19:52.8645352Z  at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
> 2022-07-22T00:19:52.8649598Z  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
> 2022-07-22T00:19:52.8650238Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2022-07-22T00:19:52.8657783Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2022-07-22T00:19:52.8658260Z  at java.lang.Thread.run(Thread.java:750){code}
>  
>  
> Actions have similar to log:
>  * [https://github.com/apache/spark/runs/7460003953?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7459868605?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7460262731?check_suite_focus=true]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Assigned] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39857:


Assignee: Apache Spark

> V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
> --
>
> Key: SPARK-39857
> URL: https://issues.apache.org/jira/browse/SPARK-39857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which 
> is BooleanType) is used to build the LiteralValue, InSet.child.dataType 
> should be used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570622#comment-17570622
 ] 

Apache Spark commented on SPARK-39857:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37271

> V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
> --
>
> Key: SPARK-39857
> URL: https://issues.apache.org/jira/browse/SPARK-39857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which 
> is BooleanType) is used to build the LiteralValue, InSet.child.dataType 
> should be used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570621#comment-17570621
 ] 

Apache Spark commented on SPARK-39857:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37271

> V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
> --
>
> Key: SPARK-39857
> URL: https://issues.apache.org/jira/browse/SPARK-39857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which 
> is BooleanType) is used to build the LiteralValue, InSet.child.dataType 
> should be used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39857:


Assignee: (was: Apache Spark)

> V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate
> --
>
> Key: SPARK-39857
> URL: https://issues.apache.org/jira/browse/SPARK-39857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which 
> is BooleanType) is used to build the LiteralValue, InSet.child.dataType 
> should be used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39857) V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate

2022-07-24 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-39857:
--

 Summary: V2ExpressionBuilder uses the wrong LiteralValue data type 
for In predicate
 Key: SPARK-39857
 URL: https://issues.apache.org/jira/browse/SPARK-39857
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Huaxin Gao


When building V2 In Predicate in V2ExpressionBuilder, InSet.dataType (which is 
BooleanType) is used to build the LiteralValue, InSet.child.dataType should be 
used instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39856.
--
Fix Version/s: 3.3.1
   3.0.4
   3.1.4
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37270
[https://github.com/apache/spark/pull/37270]

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.1, 3.0.4, 3.1.4, 3.2.3, 3.4.0
>
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39856:


Assignee: Hyukjin Kwon

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39840) Factor PythonArrowInput out as a symmetry to PythonArrowOutput

2022-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39840.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37253
[https://github.com/apache/spark/pull/37253]

> Factor PythonArrowInput out as a symmetry to PythonArrowOutput
> --
>
> Key: SPARK-39840
> URL: https://issues.apache.org/jira/browse/SPARK-39840
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> In https://issues.apache.org/jira/browse/SPARK-29317, we factored 
> {{PythonArrowOutput}} out. It's better to factor {{PythonArrowInput}} out too 
> to be consistent



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39840) Factor PythonArrowInput out as a symmetry to PythonArrowOutput

2022-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39840:


Assignee: Hyukjin Kwon

> Factor PythonArrowInput out as a symmetry to PythonArrowOutput
> --
>
> Key: SPARK-39840
> URL: https://issues.apache.org/jira/browse/SPARK-39840
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> In https://issues.apache.org/jira/browse/SPARK-29317, we factored 
> {{PythonArrowOutput}} out. It's better to factor {{PythonArrowInput}} out too 
> to be consistent



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570588#comment-17570588
 ] 

Apache Spark commented on SPARK-39856:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37270

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570587#comment-17570587
 ] 

Apache Spark commented on SPARK-39856:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37270

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39856:


Assignee: Apache Spark

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39856:


Assignee: (was: Apache Spark)

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-39856:


 Summary: Avoid OOM in TPC-DS build with SMJ
 Key: SPARK-39856
 URL: https://issues.apache.org/jira/browse/SPARK-39856
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


TPC-DS consistently fails, see 
https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39856) Avoid OOM in TPC-DS build with SMJ

2022-07-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39856:
-
Issue Type: Test  (was: Improvement)

> Avoid OOM in TPC-DS build with SMJ
> --
>
> Key: SPARK-39856
> URL: https://issues.apache.org/jira/browse/SPARK-39856
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> TPC-DS consistently fails, see 
> https://github.com/apache/spark/runs/7491836477?check_suite_focus=true 
> presumably because of out-of-memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39854:


Assignee: Apache Spark

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.3.0
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Assignee: Apache Spark
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
>

[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570540#comment-17570540
 ] 

Apache Spark commented on SPARK-39854:
--

User 'jiaji-wu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37269

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.3.0
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
>

[jira] [Assigned] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39854:


Assignee: (was: Apache Spark)

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.3.0
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
>

[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570541#comment-17570541
 ] 

Apache Spark commented on SPARK-39854:
--

User 'jiaji-wu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37269

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.3.0
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Assignee: Apache Spark
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
>

[jira] [Commented] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Jiaji Wu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570535#comment-17570535
 ] 

Jiaji Wu commented on SPARK-39854:
--

One workaround is to exclude *ColumnPruning* by set spark config:

{color:#54b33e}"spark.sql.optimizer.excludedRules" {color}-> 
{color:#54b33e}"org.apache.spark.sql.catalyst.optimizer.ColumnPruning"{color}

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.3.0
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at

[jira] [Commented] (SPARK-39855) Unable to set zstd compression level while writing orc files

2022-07-24 Thread shezm (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570504#comment-17570504
 ] 

shezm commented on SPARK-39855:
---

I will follow up on this issue

> Unable to set zstd compression level while writing orc files
> 
>
> Key: SPARK-39855
> URL: https://issues.apache.org/jira/browse/SPARK-39855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: shezm
>Priority: Major
>
> like this issue : https://issues.apache.org/jira/browse/SPARK-39743



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39855) Unable to set zstd compression level while writing orc files

2022-07-24 Thread shezm (Jira)

shezm created SPARK-39855:
-

 Summary: Unable to set zstd compression level while writing orc 
files
 Key: SPARK-39855
 URL: https://issues.apache.org/jira/browse/SPARK-39855
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: shezm


like this issue : https://issues.apache.org/jira/browse/SPARK-39743



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Jiaji Wu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaji Wu updated SPARK-39854:
-
Affects Version/s: 3.2.1

> Catalyst 'ColumnPruning' Optimizer does not play well with sql function 
> 'explode'
> -
>
> Key: SPARK-39854
> URL: https://issues.apache.org/jira/browse/SPARK-39854
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1, 3.3.0
> Environment: Spark version: the latest (3.4.0-SNAPSHOT)
> OS: Ubuntu 20.04
> JDK: Amazon corretto-11.0.14.1
>Reporter: Jiaji Wu
>Priority: Major
>
> The *ColumnPruning* optimizer batch does not always work with *explode* sql 
> function.
>  * Here's a code snippet to repro the issue:
>  
> {code:java}
> import spark.implicits._
> val testJson =
>   """{
> | "b": {
> |  "id": "id00",
> |  "data": [{
> |   "b1": "vb1",
> |   "b2": 101,
> |   "ex2": [
> |{ "fb1": false, "fb2": 11, "fb3": "t1" },
> |{ "fb1": true, "fb2": 12, "fb3": "t2" }
> |   ]}, {
> |   "b1": "vb2",
> |   "b2": 102,
> |   "ex2": [
> |{ "fb1": false, "fb2": 13, "fb3": "t3" },
> |{ "fb1": true, "fb2": 14, "fb3": "t4" }
> |   ]}
> |  ],
> |  "fa": "tes",
> |  "v": "1.5"
> | }
> |}
> |""".stripMargin
> val df = spark.read.json((testJson :: Nil).toDS())
>   .withColumn("ex_b", explode($"b.data.ex2"))
>   .withColumn("ex_b2", explode($"ex_b"))
> val df1 = df
>   .withColumn("rt", struct(
> $"b.fa".alias("rt_fa"),
> $"b.v".alias("rt_v")
>   ))
>   .drop("b", "ex_b")
> df1.show(false){code}
>  * the result exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
> _extract_v#35 in [_extract_fa#36,ex_b2#13]
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
>     at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
>     at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>     at 
>

[jira] [Created] (SPARK-39854) Catalyst 'ColumnPruning' Optimizer does not play well with sql function 'explode'

2022-07-24 Thread Jiaji Wu (Jira)

Jiaji Wu created SPARK-39854:


 Summary: Catalyst 'ColumnPruning' Optimizer does not play well 
with sql function 'explode'
 Key: SPARK-39854
 URL: https://issues.apache.org/jira/browse/SPARK-39854
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.3.0
 Environment: Spark version: the latest (3.4.0-SNAPSHOT)

OS: Ubuntu 20.04

JDK: Amazon corretto-11.0.14.1
Reporter: Jiaji Wu


The *ColumnPruning* optimizer batch does not always work with *explode* sql 
function.
 * Here's a code snippet to repro the issue:

 
{code:java}
import spark.implicits._

val testJson =
  """{
| "b": {
|  "id": "id00",
|  "data": [{
|   "b1": "vb1",
|   "b2": 101,
|   "ex2": [
|{ "fb1": false, "fb2": 11, "fb3": "t1" },
|{ "fb1": true, "fb2": 12, "fb3": "t2" }
|   ]}, {
|   "b1": "vb2",
|   "b2": 102,
|   "ex2": [
|{ "fb1": false, "fb2": 13, "fb3": "t3" },
|{ "fb1": true, "fb2": 14, "fb3": "t4" }
|   ]}
|  ],
|  "fa": "tes",
|  "v": "1.5"
| }
|}
|""".stripMargin
val df = spark.read.json((testJson :: Nil).toDS())
  .withColumn("ex_b", explode($"b.data.ex2"))
  .withColumn("ex_b2", explode($"ex_b"))
val df1 = df
  .withColumn("rt", struct(
$"b.fa".alias("rt_fa"),
$"b.v".alias("rt_v")
  ))
  .drop("b", "ex_b")
df1.show(false){code}
 * the result exception:

{code:java}
Exception in thread "main" java.lang.IllegalStateException: Couldn't find 
_extract_v#35 in [_extract_fa#36,ex_b2#13]
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
    at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
    at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
    at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
    at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1196)
    at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1195)
    at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:513)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
    at scala.collection.immutable.List.map(List.scala:297)
    at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94)
    at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
    at 
org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:196)
    at 
org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:151)
    at

[jira] [Assigned] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39853:


Assignee: (was: Apache Spark)

> Support stage level schedule for standalone cluster when dynamic allocation 
> is disabled
> ---
>
> Key: SPARK-39853
> URL: https://issues.apache.org/jira/browse/SPARK-39853
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: huangtengfei
>Priority: Major
>
> [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage 
> level schedule support for standalone cluster when dynamic allocation was 
> enabled, spark would request for executors for different resource profiles.
> While when dynamic allocation is disabled, we can also leverage stage level 
> schedule to schedule tasks based on resource profile(task resource requests) 
> to executors with default resource profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570501#comment-17570501
 ] 

Apache Spark commented on SPARK-39853:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/37268

> Support stage level schedule for standalone cluster when dynamic allocation 
> is disabled
> ---
>
> Key: SPARK-39853
> URL: https://issues.apache.org/jira/browse/SPARK-39853
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: huangtengfei
>Priority: Major
>
> [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage 
> level schedule support for standalone cluster when dynamic allocation was 
> enabled, spark would request for executors for different resource profiles.
> While when dynamic allocation is disabled, we can also leverage stage level 
> schedule to schedule tasks based on resource profile(task resource requests) 
> to executors with default resource profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39853:


Assignee: Apache Spark

> Support stage level schedule for standalone cluster when dynamic allocation 
> is disabled
> ---
>
> Key: SPARK-39853
> URL: https://issues.apache.org/jira/browse/SPARK-39853
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: huangtengfei
>Assignee: Apache Spark
>Priority: Major
>
> [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage 
> level schedule support for standalone cluster when dynamic allocation was 
> enabled, spark would request for executors for different resource profiles.
> While when dynamic allocation is disabled, we can also leverage stage level 
> schedule to schedule tasks based on resource profile(task resource requests) 
> to executors with default resource profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled

2022-07-24 Thread huangtengfei (Jira)

huangtengfei created SPARK-39853:


 Summary: Support stage level schedule for standalone cluster when 
dynamic allocation is disabled
 Key: SPARK-39853
 URL: https://issues.apache.org/jira/browse/SPARK-39853
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: huangtengfei


[SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage 
level schedule support for standalone cluster when dynamic allocation was 
enabled, spark would request for executors for different resource profiles.
While when dynamic allocation is disabled, we can also leverage stage level 
schedule to schedule tasks based on resource profile(task resource requests) to 
executors with default resource profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39851) Improve join stats estimation if one side can keep uniqueness

2022-07-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-39851:

Summary: Improve join stats estimation if one side can keep uniqueness  
(was: Fix join stats estimation if one side can keep uniqueness)

> Improve join stats estimation if one side can keep uniqueness
> -
>
> Key: SPARK-39851
> URL: https://issues.apache.org/jira/browse/SPARK-39851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> SELECT i_item_sk ss_item_sk
> FROM   item,
>(SELECT DISTINCT iss.i_brand_idbrand_id,
> iss.i_class_idclass_id,
> iss.i_category_id category_id
> FROM   item iss) x
> WHERE  i_brand_id = brand_id
>AND i_class_id = class_id
>AND i_category_id = category_id 
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
> rowCount=3.24E+7)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
> spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
> rowCount=2.02E+5)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
>

[jira] [Assigned] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39851:


Assignee: (was: Apache Spark)

> Fix join stats estimation if one side can keep uniqueness
> -
>
> Key: SPARK-39851
> URL: https://issues.apache.org/jira/browse/SPARK-39851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> SELECT i_item_sk ss_item_sk
> FROM   item,
>(SELECT DISTINCT iss.i_brand_idbrand_id,
> iss.i_class_idclass_id,
> iss.i_category_id category_id
> FROM   item iss) x
> WHERE  i_brand_id = brand_id
>AND i_class_id = class_id
>AND i_category_id = category_id 
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
> rowCount=3.24E+7)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
> spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
> rowCount=2.02E+5)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
>

[jira] [Commented] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570473#comment-17570473
 ] 

Apache Spark commented on SPARK-39851:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37267

> Fix join stats estimation if one side can keep uniqueness
> -
>
> Key: SPARK-39851
> URL: https://issues.apache.org/jira/browse/SPARK-39851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> SELECT i_item_sk ss_item_sk
> FROM   item,
>(SELECT DISTINCT iss.i_brand_idbrand_id,
> iss.i_class_idclass_id,
> iss.i_category_id category_id
> FROM   item iss) x
> WHERE  i_brand_id = brand_id
>AND i_class_id = class_id
>AND i_category_id = category_id 
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
> rowCount=3.24E+7)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
> spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
> rowCount=2.02E+5)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
>

[jira] [Assigned] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39851:


Assignee: Apache Spark

> Fix join stats estimation if one side can keep uniqueness
> -
>
> Key: SPARK-39851
> URL: https://issues.apache.org/jira/browse/SPARK-39851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:sql}
> SELECT i_item_sk ss_item_sk
> FROM   item,
>(SELECT DISTINCT iss.i_brand_idbrand_id,
> iss.i_class_idclass_id,
> iss.i_category_id category_id
> FROM   item iss) x
> WHERE  i_brand_id = brand_id
>AND i_class_id = class_id
>AND i_category_id = category_id 
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
> rowCount=3.24E+7)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
> spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
> rowCount=2.02E+5)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
>

[jira] [Commented] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570474#comment-17570474
 ] 

Apache Spark commented on SPARK-39851:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37267

> Fix join stats estimation if one side can keep uniqueness
> -
>
> Key: SPARK-39851
> URL: https://issues.apache.org/jira/browse/SPARK-39851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> SELECT i_item_sk ss_item_sk
> FROM   item,
>(SELECT DISTINCT iss.i_brand_idbrand_id,
> iss.i_class_idclass_id,
> iss.i_category_id category_id
> FROM   item iss) x
> WHERE  i_brand_id = brand_id
>AND i_class_id = class_id
>AND i_category_id = category_id 
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
> rowCount=3.24E+7)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
> spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
> {noformat}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
> rowCount=2.02E+5)
> +- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
> class_id#52)) AND (i_category_id#15 = category_id#53)), 
> Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
>:- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
> Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
>:  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
> isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
>: +- Relation 
> spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
>  parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
>+- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
> class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, 
> rowCount=1.37E+5)
>   +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
> i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
> rowCount=2.02E+5)
>  +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) 
> AND isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, 
> rowCount=2.02E+5)
> +- Relation 
>

[jira] [Commented] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570468#comment-17570468
 ] 

Apache Spark commented on SPARK-39852:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37266

> Unify v1 and v2 DESCRIBE TABLE tests for columns
> 
>
> Key: SPARK-39852
> URL: https://issues.apache.org/jira/browse/SPARK-39852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570467#comment-17570467
 ] 

Apache Spark commented on SPARK-39852:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37266

> Unify v1 and v2 DESCRIBE TABLE tests for columns
> 
>
> Key: SPARK-39852
> URL: https://issues.apache.org/jira/browse/SPARK-39852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39852:


Assignee: Max Gekk  (was: Apache Spark)

> Unify v1 and v2 DESCRIBE TABLE tests for columns
> 
>
> Key: SPARK-39852
> URL: https://issues.apache.org/jira/browse/SPARK-39852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39852:


Assignee: Apache Spark  (was: Max Gekk)

> Unify v1 and v2 DESCRIBE TABLE tests for columns
> 
>
> Key: SPARK-39852
> URL: https://issues.apache.org/jira/browse/SPARK-39852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39852) Unify v1 and v2 DESCRIBE TABLE tests for columns

2022-07-24 Thread Max Gekk (Jira)

Max Gekk created SPARK-39852:


 Summary: Unify v1 and v2 DESCRIBE TABLE tests for columns
 Key: SPARK-39852
 URL: https://issues.apache.org/jira/browse/SPARK-39852
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Write or move v1 and v2 tests for the DESCRIBE TABLE command for columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39851) Fix join stats estimation if one side can keep uniqueness

2022-07-24 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-39851:
---

 Summary: Fix join stats estimation if one side can keep uniqueness
 Key: SPARK-39851
 URL: https://issues.apache.org/jira/browse/SPARK-39851
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang


{code:sql}
SELECT i_item_sk ss_item_sk
FROM   item,
   (SELECT DISTINCT iss.i_brand_idbrand_id,
iss.i_class_idclass_id,
iss.i_category_id category_id
FROM   item iss) x
WHERE  i_brand_id = brand_id
   AND i_class_id = class_id
   AND i_category_id = category_id 
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=370.8 MiB, 
rowCount=3.24E+7)
+- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
class_id#52)) AND (i_category_id#15 = category_id#53)), 
Statistics(sizeInBytes=1112.3 MiB, rowCount=3.24E+7)
   :- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
   :  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5)
   : +- Relation 
spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
 parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
   +- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, rowCount=1.37E+5)
  +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
rowCount=2.02E+5)
 +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) AND 
isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5)
+- Relation 
spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
 parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)

{noformat}

Excepted:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#4 AS ss_item_sk#54], Statistics(sizeInBytes=2.3 MiB, 
rowCount=2.02E+5)
+- Join Inner, (((i_brand_id#11 = brand_id#51) AND (i_class_id#13 = 
class_id#52)) AND (i_category_id#15 = category_id#53)), 
Statistics(sizeInBytes=7.0 MiB, rowCount=2.02E+5)
   :- Project [i_item_sk#4, i_brand_id#11, i_class_id#13, i_category_id#15], 
Statistics(sizeInBytes=4.6 MiB, rowCount=2.02E+5)
   :  +- Filter ((isnotnull(i_brand_id#11) AND isnotnull(i_class_id#13)) AND 
isnotnull(i_category_id#15)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5)
   : +- Relation 
spark_catalog.default.item[i_item_sk#4,i_item_id#5,i_rec_start_date#6,i_rec_end_date#7,i_item_desc#8,i_current_price#9,i_wholesale_cost#10,i_brand_id#11,i_brand#12,i_class_id#13,i_class#14,i_category_id#15,i_category#16,i_manufact_id#17,i_manufact#18,i_size#19,i_formulation#20,i_color#21,i_units#22,i_container#23,i_manager_id#24,i_product_name#25]
 parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
   +- Aggregate [brand_id#51, class_id#52, category_id#53], [brand_id#51, 
class_id#52, category_id#53], Statistics(sizeInBytes=2.6 MiB, rowCount=1.37E+5)
  +- Project [i_brand_id#62 AS brand_id#51, i_class_id#64 AS class_id#52, 
i_category_id#66 AS category_id#53], Statistics(sizeInBytes=3.9 MiB, 
rowCount=2.02E+5)
 +- Filter ((isnotnull(i_brand_id#62) AND isnotnull(i_class_id#64)) AND 
isnotnull(i_category_id#66)), Statistics(sizeInBytes=84.6 MiB, rowCount=2.02E+5)
+- Relation 
spark_catalog.default.item[i_item_sk#55,i_item_id#56,i_rec_start_date#57,i_rec_end_date#58,i_item_desc#59,i_current_price#60,i_wholesale_cost#61,i_brand_id#62,i_brand#63,i_class_id#64,i_class#65,i_category_id#66,i_category#67,i_manufact_id#68,i_manufact#69,i_size#70,i_formulation#71,i_color#72,i_units#73,i_container#74,i_manager_id#75,i_product_name#76]
 parquet, Statistics(sizeInBytes=85.2 MiB, rowCount=2.04E+5)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39850) Print applicationId once applied from yarn rm

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39850:


Assignee: (was: Apache Spark)

> Print applicationId once applied from yarn rm
> -
>
> Key: SPARK-39850
> URL: https://issues.apache.org/jira/browse/SPARK-39850
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: LiDongwei
>Priority: Major
>
> As we all know,between client gets application from yarn and submits the 
> application to yarn,there is still a lot work to do . if a application fails 
> during these works,user can not easily find out the application id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39850) Print applicationId once applied from yarn rm

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570447#comment-17570447
 ] 

Apache Spark commented on SPARK-39850:
--

User 'DongweiLee' has created a pull request for this issue:
https://github.com/apache/spark/pull/37265

> Print applicationId once applied from yarn rm
> -
>
> Key: SPARK-39850
> URL: https://issues.apache.org/jira/browse/SPARK-39850
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: LiDongwei
>Priority: Major
>
> As we all know,between client gets application from yarn and submits the 
> application to yarn,there is still a lot work to do . if a application fails 
> during these works,user can not easily find out the application id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39850) Print applicationId once applied from yarn rm

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39850:


Assignee: Apache Spark

> Print applicationId once applied from yarn rm
> -
>
> Key: SPARK-39850
> URL: https://issues.apache.org/jira/browse/SPARK-39850
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: LiDongwei
>Assignee: Apache Spark
>Priority: Major
>
> As we all know,between client gets application from yarn and submits the 
> application to yarn,there is still a lot work to do . if a application fails 
> during these works,user can not easily find out the application id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39850) Print applicationId once applied from yarn rm

2022-07-24 Thread LiDongwei (Jira)

LiDongwei created SPARK-39850:
-

 Summary: Print applicationId once applied from yarn rm
 Key: SPARK-39850
 URL: https://issues.apache.org/jira/browse/SPARK-39850
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.3.0
Reporter: LiDongwei


As we all know,between client gets application from yarn and submits the 
application to yarn,there is still a lot work to do . if a application fails 
during these works,user can not easily find out the application id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570443#comment-17570443
 ] 

Apache Spark commented on SPARK-39849:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/37264

> Dataset.as(StructType) fills missing new columns with null value
> 
>
> Key: SPARK-39849
> URL: https://issues.apache.org/jira/browse/SPARK-39849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> As a followup of 
> [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would 
> be great to fill missing new columns with null values, instead of failing out 
> loud. Note it would only work for nullable columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39849:


Assignee: (was: Apache Spark)

> Dataset.as(StructType) fills missing new columns with null value
> 
>
> Key: SPARK-39849
> URL: https://issues.apache.org/jira/browse/SPARK-39849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> As a followup of 
> [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would 
> be great to fill missing new columns with null values, instead of failing out 
> loud. Note it would only work for nullable columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570442#comment-17570442
 ] 

Apache Spark commented on SPARK-39849:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/37264

> Dataset.as(StructType) fills missing new columns with null value
> 
>
> Key: SPARK-39849
> URL: https://issues.apache.org/jira/browse/SPARK-39849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> As a followup of 
> [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would 
> be great to fill missing new columns with null values, instead of failing out 
> loud. Note it would only work for nullable columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39849:


Assignee: Apache Spark

> Dataset.as(StructType) fills missing new columns with null value
> 
>
> Key: SPARK-39849
> URL: https://issues.apache.org/jira/browse/SPARK-39849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> As a followup of 
> [https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would 
> be great to fill missing new columns with null values, instead of failing out 
> loud. Note it would only work for nullable columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39849) Dataset.as(StructType) fills missing new columns with null value

2022-07-24 Thread Cheng Su (Jira)

Cheng Su created SPARK-39849:


 Summary: Dataset.as(StructType) fills missing new columns with 
null value
 Key: SPARK-39849
 URL: https://issues.apache.org/jira/browse/SPARK-39849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Cheng Su


As a followup of 
[https://github.com/apache/spark/pull/37011#discussion_r917700960] , it would 
be great to fill missing new columns with null values, instead of failing out 
loud. Note it would only work for nullable columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570435#comment-17570435
 ] 

Apache Spark commented on SPARK-39743:
--

User 'ming95' has created a pull request for this issue:
https://github.com/apache/spark/pull/37263

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39743:


Assignee: (was: Apache Spark)

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39743:


Assignee: Apache Spark

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Assignee: Apache Spark
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570434#comment-17570434
 ] 

Apache Spark commented on SPARK-39743:
--

User 'ming95' has created a pull request for this issue:
https://github.com/apache/spark/pull/37263

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

55 matches

Mail list logo