Russell Spitzer created SPARK-16616: ---------------------------------------
Summary: Allow Catalyst to take Advantage of Hash Partitioned DataSources Key: SPARK-16616 URL: https://issues.apache.org/jira/browse/SPARK-16616 Project: Spark Issue Type: New Feature Reporter: Russell Spitzer Many Distributed Databases provide hash partitioned data (in contrast to data partitioned on a specific column) and this information can be used to greatly enhance Spark performance. For example: Data within Cassandra is distributed based on a Hash of the "Partition Key" which is a set of columns. This means all values read from the database which contain the same "partition key" will exist in the same Spark Partition. When these rows are joined with themselves or aggregated on these "Partition Key" columns there is no need to do a shuffle. {code} CREATE TABLE (UserID int, purchase int, amount int, PRIMARY KEY (customer, purchase)) {code} Would internally (using the SparkCassandraConnector) make an RDD that looks like {code} Spark Partition 1 : (1, 1, 5), (1, 2, 6), (432, 1, 10) .... Spark Partition 2 : (2, 1, 4), (2, 2, 5), (700, 1, 1) ... {code} Where the all values for {{UserID}} 1 are in the First Partition but the values contained within Spark Partition 1 do not cover a contiguous range of values for {{UserID}} Like with normal RDDs, it would be nice if we could expose a Partitioning function that (given the key value) we could indicate what partition the row would be in. This information could also be used in aggregates and joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org