[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Noritaka Sekiyama updated SPARK-46981: -------------------------------------- Description: We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, "java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source)~ ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ ~at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ ~GC overhead limit exceeded~ ~java.lang.OutOfMemoryError: GC overhead limit exceeded~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source)~ ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ ~at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ was: We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit exceeded{}}}{{{}{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) at scala.collection.immutable.Vector.iterator(Vector.scala:69) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.Vector.iterator(Vector.scala:100) at scala.collection.immutable.Vector.iterator(Vector.scala:69) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)}} > Driver OOM happens in query planning phase with empty tables > ------------------------------------------------------------ > > Key: SPARK-46981 > URL: https://issues.apache.org/jira/browse/SPARK-46981 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Environment: * OSS Spark 3.5.0 > * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) > * AWS Glue Spark 3.3.0 (Glue version 4.0) > Reporter: Noritaka Sekiyama > Priority: Major > Attachments: create_sanitized_tables.py, > test_and_twodays_simplified.sql > > > We have observed that Driver OOM happens in query planning phase with empty > tables when we ran specific patterns of queries. > h2. Issue details > If we run the query with where condition {{{}pt>='20231004' and > pt<='20231004', then the query fails in planning phase due to Driver OOM, > more specifically, "java.lang.OutOfMemoryError: GC overhead limit > exceeded"{}}}. > If we change the where condition from {{pt>='20231004' and pt<='20231004'}} > to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. > > This issue happened even with empty table, and it happened before actual data > load. This seems like an issue in catalyst side. > h2. Reproduction step > Attaching script and query to reproduce the issue. > * create_sanitized_tables.py: Script to create table definitions > * test_and_twodays_simplified.sql: Query to reproduce the issue > Here's the typical stacktrace: > ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ > ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ > ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ > ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ > ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ > ~at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ > ~at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ > ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ > ~at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ > ~at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ > ~at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ > ~at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source)~ > ~at scala.Option.getOrElse(Option.scala:189)~ > ~at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ > ~at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source)~ > ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ > ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ > ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ > ~at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source)~ > ~at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ > ~at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ > ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ > ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ > ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ > ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ > ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ > ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ > ~GC overhead limit exceeded~ > ~java.lang.OutOfMemoryError: GC overhead limit exceeded~ > ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ > ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ > ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ > ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ > ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ > ~at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ > ~at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ > ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ > ~at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ > ~at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ > ~at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ > ~at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source)~ > ~at scala.Option.getOrElse(Option.scala:189)~ > ~at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ > ~at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source)~ > ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ > ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ > ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ > ~at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ > ~at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source)~ > ~at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ > ~at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ > ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ > ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ > ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ > ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ > ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ > ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org