Russell Spitzer created SPARK-16616:
---------------------------------------

             Summary: Allow Catalyst to take Advantage of Hash Partitioned 
DataSources
                 Key: SPARK-16616
                 URL: https://issues.apache.org/jira/browse/SPARK-16616
             Project: Spark
          Issue Type: New Feature
            Reporter: Russell Spitzer


Many Distributed Databases provide hash partitioned data (in contrast to data 
partitioned on a specific column) and this information can be used to greatly 
enhance Spark performance. 

For example: 

Data within Cassandra is distributed based on a Hash of the "Partition Key" 
which is a set of columns. This means all values read from the database which 
contain the same "partition key" will exist in the same Spark Partition. When 
these rows are joined with themselves or aggregated on these "Partition Key" 
columns there is no need to do a shuffle.

{code}
CREATE TABLE (UserID int, purchase int, amount int, PRIMARY KEY (customer, 
purchase))
{code}

Would internally (using the SparkCassandraConnector) make an RDD that looks like

{code}
Spark Partition 1 :  (1, 1, 5), (1, 2, 6), (432, 1, 10) .... 
Spark Partition 2 :  (2, 1, 4), (2, 2, 5), (700, 1, 1) ...
{code}

Where the all values for {{UserID}} 1 are in the First Partition but the values 
contained within Spark Partition 1 do not cover a contiguous range of values 
for {{UserID}}

Like with normal RDDs, it would be nice if we could expose a Partitioning 
function that (given the key value) we could indicate what partition the row 
would be in. This information could also be used in aggregates and joins. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to