Ankur Dave created SPARK-2365:
---------------------------------

             Summary: Add IndexedRDD, an efficient updatable key-value store
                 Key: SPARK-2365
                 URL: https://issues.apache.org/jira/browse/SPARK-2365
             Project: Spark
          Issue Type: New Feature
          Components: GraphX, Spark Core
            Reporter: Ankur Dave
            Assignee: Ankur Dave


RDDs currently provide a bulk-updatable, iterator-based interface. This imposes 
minimal requirements on the storage layer, which only needs to support 
sequential access, enabling on-disk and serialized storage.

However, many applications would benefit from a richer interface. Efficient 
support for point lookups would enable serving data out of RDDs, but it 
currently requires iterating over an entire partition to find the desired 
element. Point updates similarly require copying an entire iterator. Joins are 
also expensive, requiring a shuffle and local hash joins.

To address these problems, we propose IndexedRDD, an efficient key-value store 
built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
uniqueness and pre-indexing the entries for efficient joins and point lookups, 
updates, and deletions.

It would be implemented by (1) hash-partitioning the entries by key, (2) 
maintaining a hash index within each partition, and (3) using purely functional 
(immutable and efficiently updatable) data structures to enable efficient 
modifications and deletions.

GraphX would be the first user of IndexedRDD, since it currently implements a 
limited form of this functionality in VertexRDD. We envision a variety of other 
uses for IndexedRDD, including streaming updates to RDDs, direct serving from 
RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to