[ https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan updated SPARK-28156: -------------------------------- Fix Version/s: 2.4.4 > Join plan sometimes does not use cached query > --------------------------------------------- > > Key: SPARK-28156 > URL: https://issues.apache.org/jira/browse/SPARK-28156 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.3, 3.0.0, 2.4.3 > Reporter: Bruce Robbins > Assignee: Liang-Chi Hsieh > Priority: Major > Fix For: 2.4.4, 3.0.0 > > > I came across a case where a cached query is referenced on both sides of a > join, but the InMemoryRelation is inserted on only one side. This case occurs > only when the cached query uses a (Hive-style) view. > Consider this example: > {noformat} > // create the data > val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", > "c", "d") > df1.write.mode("overwrite").format("orc").saveAsTable("table1") > sql("drop view if exists table1_vw") > sql("create view table1_vw as select * from table1") > // create the cached query > val cacheddataDf = sql(""" > select a, b, c, d > from table1_vw > """) > import org.apache.spark.storage.StorageLevel.DISK_ONLY > cacheddataDf.createOrReplaceTempView("cacheddata") > cacheddataDf.persist(DISK_ONLY) > // main query > val queryDf = sql(s""" > select leftside.a, leftside.b > from cacheddata leftside > join cacheddata rightside > on leftside.a = rightside.a > """) > queryDf.explain(true) > {noformat} > Note that the optimized plan does not use an InMemoryRelation for the right > side, but instead just uses a Relation: > {noformat} > Project [a#45, b#46] > +- Join Inner, (a#45 = a#37) > :- Project [a#45, b#46] > : +- Filter isnotnull(a#45) > : +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 > replicas) > : +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] > Batched: true, DataFilters: [], Format: ORC, Location: > InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<a:int,b:int,c:int,d:int> > +- Project [a#37] > +- Filter isnotnull(a#37) > +- Relation[a#37,b#38,c#39,d#40] orc > {noformat} > The fragment does not match the cached query because AliasViewChild adds an > extra projection under the View on the right side (see #2 below). > AliasViewChild adds the extra projection because the exprIds in the View's > output appears to have been renamed by Analyzer$ResolveReferences (#1 below). > I have not yet looked at why. > {noformat} > - > - > - > +- SubqueryAlias `rightside` > +- SubqueryAlias `cacheddata` > +- Project [a#73, b#74, c#75, d#76] > +- SubqueryAlias `default`.`table1_vw` > (#1) -> +- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76]) > (#2) -> +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS > b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76] > +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) > AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48] > +- Project [a#37, b#38, c#39, d#40] > +- SubqueryAlias `default`.`table1` > +- Relation[a#37,b#38,c#39,d#40] orc > {noformat} > In a larger query (where cachedata may be referred on either side only > indirectly), this phenomenon can create certain oddities, as the fragment is > not replaced with InMemoryRelation, and the fragment is present when the plan > is optimized as a whole. > In Spark 2.1.3, Spark uses InMemoryRelation on both sides. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org