[ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--------------------------------------
    Attachment: test_and_twodays_simplified.sql

> Driver OOM happens in query planning phase with empty tables
> ------------------------------------------------------------
>
>                 Key: SPARK-46981
>                 URL: https://issues.apache.org/jira/browse/SPARK-46981
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>         Environment: * OSS Spark 3.5.0
>  * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
>  * AWS Glue Spark 3.3.0 (Glue version 4.0)
>            Reporter: Noritaka Sekiyama
>            Priority: Major
>         Attachments: create_sanitized_tables.py, 
> test_and_twodays_simplified.sql
>
>
> We have observed that Driver OOM happens in query planning phase with empty 
> tables when we ran specific patterns of queries.
> h2. Issue details
> If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
> then the query fails in planning phase due to Driver OOM, more specifically, 
> }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
> exceeded{}}}{{{}{}}}.
> If we change the where condition from {{pt>='20231004' and pt<='20231004'}} 
> to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.
>  
> This issue happened even with empty table, and it happened before actual data 
> load. This seems like an issue in catalyst side.
> h2. Reproduction step
> Attaching script and query to reproduce the issue.
>  * create_sanitized_tables.py: Script to create table definitions
>  * test_and_twodays_simplified.sql: Query to reproduce the issue
> Here's the typical stacktrace:
> {{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
>     at scala.collection.immutable.Vector.iterator(Vector.scala:69)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at 
> scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
>     at 
> scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
>     at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
>     at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
>     at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
>  Source)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
>     at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
>  Source)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>     at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
>  Source)
>     at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
>     at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
>     at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
>     at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
> GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>     at scala.collection.immutable.Vector.iterator(Vector.scala:100)
>     at scala.collection.immutable.Vector.iterator(Vector.scala:69)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at 
> scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
>     at 
> scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
>     at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
>     at 
> org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
>     at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
>     at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
>  Source)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
>     at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
>  Source)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>     at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
>     at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
>  Source)
>     at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
>     at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
>     at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
>     at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to