[ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885108#comment-15885108
 ] 

Shawn Lavelle commented on SPARK-19730:
---------------------------------------

Reference: The query is just as simple as "*select * from table1 where col in 
(select value from table2 where key=aThing)*".  In this instance, col won't be 
pushed down into the query for table1.  Now, in order to succeed, Spark needs 
to load the entirety of table1 either into memory, or persisted to disk, prior 
to continuing. That seems... not practical.

[~hvanhovell] While I don't know spark's architecture well, could an IN clause 
also be re-written as some sort of series of queries union'd together instead 
of "sort merge semi join"?  "sort merge semi join" is just how Spark translated 
query above, it's not how I wrote it.  If table1 is significantly larger than 
table2, Spark seems to have the same issue which you have raised. :/ 

I haven't been able to rewrite the query as a join such that the column I need 
is pushed down to the table.

> Predicate Subqueries do not push results of subqueries to data source
> ---------------------------------------------------------------------
>
>                 Key: SPARK-19730
>                 URL: https://issues.apache.org/jira/browse/SPARK-19730
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.1.0
>            Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a filter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The subquery works correctly returning the 
> points in the group, however, outer query does not push a filter for point 
> column.
> Affect:
> The "group" table has a few hundred rows to group a few hundred thousand 
> points.  The data table has several billion rows keyed by point and time.  
> Without the ability to push down the filters for the columns of outer the 
> query, the data source cannot properly conduct its pruned scan.
> The subquery results should be pushed down to the outer query as an IN Filter 
> with the results of the subquery.
> {panel:title=Physical Plan}
> *Project [point#263, value#270]
> +- SortMergeJoin [point#263], [col#284], LeftSemi
>    :- *Sort [point#263 ASC NULLS FIRST], false, 0
>    :  +- Exchange hashpartitioning(point#263, 20)
>    :     +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
> 1487964696000))
>    :        +- \*Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
> *PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
> LessThanOrEqual(time,1487964691000)*, ReadSchema: 
> struct<point:int,time:bigint,value:double>...
>    +- *Sort [col#284 ASC NULLS FIRST], false, 0
>       +- Exchange hashpartitioning(col#284, 20)
>          +- Generate explode(points#273), false, false, [col#284]
>             +- *Project [points#273]
>                +- *Filter (group_id#272 = 1)
>                   +- *Scan @12fb3c1a <db>.groups[points#273,group_id#272] 
> PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct<points:array<int>>  |
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to