[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Noritaka Sekiyama updated SPARK-46981: -------------------------------------- Attachment: test_and_twodays_simplified.sql > Driver OOM happens in query planning phase with empty tables > ------------------------------------------------------------ > > Key: SPARK-46981 > URL: https://issues.apache.org/jira/browse/SPARK-46981 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Environment: * OSS Spark 3.5.0 > * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) > * AWS Glue Spark 3.3.0 (Glue version 4.0) > Reporter: Noritaka Sekiyama > Priority: Major > Attachments: create_sanitized_tables.py, > test_and_twodays_simplified.sql > > > We have observed that Driver OOM happens in query planning phase with empty > tables when we ran specific patterns of queries. > h2. Issue details > If we run the query with where condition {{pt>='20231004' and pt<='20231004', > then the query fails in planning phase due to Driver OOM, more specifically, > }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit > exceeded{}}}{{{}{}}}. > If we change the where condition from {{pt>='20231004' and pt<='20231004'}} > to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. > > This issue happened even with empty table, and it happened before actual data > load. This seems like an issue in catalyst side. > h2. Reproduction step > Attaching script and query to reproduce the issue. > * create_sanitized_tables.py: Script to create table definitions > * test_and_twodays_simplified.sql: Query to reproduce the issue > Here's the typical stacktrace: > {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) > at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) > at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) > at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) > GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) > at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) > at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) > at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org