[ https://issues.apache.org/jira/browse/SPARK-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-16614: --------------------------------- Labels: bulk-closed (was: ) > DirectJoin with DataSource for SparkSQL > --------------------------------------- > > Key: SPARK-16614 > URL: https://issues.apache.org/jira/browse/SPARK-16614 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.0.0 > Reporter: Russell Spitzer > Priority: Major > Labels: bulk-closed > > Join behaviors against some datasources can be improved by skipping a full > scan and instead performing a series of point lookups. > An example > {code}DataFrame A contains { key1, key5, key302, ... key 50923423} > DataFrame B is a source reading from a C* database with keys {key1, key2, > key3 ....} > a.join(b){code} > Currently this will cause the entirety of the DataFrame B to be read into > memory before performing a Join. Instead it would be useful if we could > expose another api, {{DirectJoinSource}} which allowed connectors to provide > a means of requesting a non-contiguous subset of keys from a DataSource. > This kind of lookup would behave like the joinWithCasandraTable call in the > Spark Cassandra Connector > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable. > > We find that this is much more useful when the end user is requesting only a > small portion of well defined records. I believe this could be applicable to > a variety of datasources where reading the entire source is inefficient > compared to point lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org