[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511746#comment-16511746
 ] 

David McLennan commented on SPARK-19609:
----------------------------------------

This feature would be extremely useful in making external data lookups more 
efficient.  For example, you have a stream of data coming in with a window of 
10,000 messages.  You need to join each message with reference data on external 
services to enrich it (for example accounts and products).  Today, you would 
either have to pull the entire external data sources into the executors 
(expensive on all sides - even the small datasets are many 10's of gigabyres), 
or lookup the external datasets key by key on a per message basis, which is 
very chatty from a communication perspective.  If this feature is implemented, 
it could reduce the amount of data transfer significantly, if the cardinality 
of the join keys is low (i.e. you might have 10,000 messages, but they 
reference only 15 unique accounts and 50 unique products.)  It would also 
relieve the author of the burden of having to implement something which does 
this themselves - they could just register the dataframes, run a sql context 
ontop of it, and go.

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-19609
>                 URL: https://issues.apache.org/jira/browse/SPARK-19609
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Nick Dimiduk
>            Priority: Major
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to