[jira] [Commented] (SPARK-16614) DirectJoin with DataSource for SparkSQL

Russell Spitzer (JIRA) Fri, 02 Sep 2016 08:15:44 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458784#comment-15458784
 ]


Russell Spitzer commented on SPARK-16614:
-----------------------------------------

Yes. This would be similar to how Presto works by default. The normal operation 
is to pull only the keys in the Source unless the Source is very large at which 
point it does the full scan and then join.


Quick example if DF A only has 3 rows, we only want to do 3 requests against C* 
(if possible)

In the Cassandra case joins against certain columns can be done directly while 
some columns will still require a Full Table Scan

> DirectJoin with DataSource for SparkSQL
> ---------------------------------------
>
>                 Key: SPARK-16614
>                 URL: https://issues.apache.org/jira/browse/SPARK-16614
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Russell Spitzer
>
> Join behaviors against some datasources can be improved by skipping a full 
> scan and instead performing a series of point lookups.
> An example
> {code}DataFrame A contains { key1, key5, key302, ... key 50923423} 
>     DataFrame B is a source reading from a C* database with keys {key1, key2, 
> key3 ....}
>     a.join(b){code}
> Currently this will cause the entirety of the DataFrame B to be read into 
> memory before performing a Join. Instead it would be useful if we could 
> expose another api, {{DirectJoinSource}} which allowed connectors to provide 
> a means of requesting a non-contiguous subset of keys from a DataSource.
> This kind of lookup would behave like the joinWithCasandraTable call in the 
> Spark Cassandra Connector 
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable.
>  
> We find that this is much more useful when the end user is requesting only a 
> small portion of well defined records. I believe this could be applicable to 
> a variety of datasources where reading the entire source is inefficient 
> compared to point lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16614) DirectJoin with DataSource for SparkSQL

Reply via email to