[jira] [Created] (SPARK-16614) DirectJoin with DataSource for SparkSQL

Russell Spitzer (JIRA) Mon, 18 Jul 2016 14:53:49 -0700

Russell Spitzer created SPARK-16614:
---------------------------------------


             Summary: DirectJoin with DataSource for SparkSQL
                 Key: SPARK-16614
                 URL: https://issues.apache.org/jira/browse/SPARK-16614
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Russell Spitzer


Join behaviors against some datasources can be improved by skipping a full scan 
and instead performing a series of point lookups.

An example

{code}DataFrame A contains { key1, key5, key302, ... key 50923423} 
    DataFrame B is a source reading from a C* database with keys {key1, key2, 
key3 ....}
    a.join(b){code}

Currently this will cause the entirety of the DataFrame B to be read into 
memory before performing a Join. Instead it would be useful if we could expose 
another api, {{DirectJoinSource}} which allowed connectors to provide a means 
of requests a non-contiguous subset of keys from a DataSource.

This kind of lookup would behave like the joinWithCasandraTable call in the 
Spark Cassandra Connector 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable.
 

We find that this is much more useful when the end user is requesting only a 
small portion of well defined records. I believe this could be applicable to a 
variety of datasources where reading the entire source is inefficient compared 
to point lookups.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16614) DirectJoin with DataSource for SparkSQL

Reply via email to