[ https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806285#comment-15806285 ]
Apache Spark commented on SPARK-19093: -------------------------------------- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/16493 > Cached tables are not used in SubqueryExpression > ------------------------------------------------ > > Key: SPARK-19093 > URL: https://issues.apache.org/jira/browse/SPARK-19093 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2, 2.1.0 > Reporter: Josh Rosen > > See reproduction at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html > Consider the following: > {code} > Seq(("a", "b"), ("c", "d")) > .toDS > .write > .parquet("/tmp/rows") > val df = spark.read.parquet("/tmp/rows") > df.cache() > df.count() > df.createOrReplaceTempView("rows") > spark.sql(""" > select * from rows cross join rows > """).explain(true) > spark.sql(""" > select * from rows where not exists (select * from rows) > """).explain(true) > {code} > In both plans, I'd expect that both sides of the joins would read from the > cached table for both the cross join and anti join, but the left anti join > produces the following plan which only reads the left side from cache and > reads the right side via a regular non-cahced scan: > {code} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter NOT exists#3994 > : +- 'Project [*] > : +- 'UnresolvedRelation `rows` > +- 'UnresolvedRelation `rows` > == Analyzed Logical Plan == > _1: string, _2: string > Project [_1#3775, _2#3776] > +- Filter NOT predicate-subquery#3994 [] > : +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > : +- Project [_1#3775, _2#3776] > : +- SubqueryAlias rows > : +- Relation[_1#3775,_2#3776] parquet > +- SubqueryAlias rows > +- Relation[_1#3775,_2#3776] parquet > == Optimized Logical Plan == > Join LeftAnti > :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > +- Relation[_1#3775,_2#3776] parquet > == Physical Plan == > BroadcastNestedLoopJoin BuildRight, LeftAnti > :- InMemoryTableScan [_1#3775, _2#3776] > : +- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- BroadcastExchange IdentityBroadcastMode > +- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org