Sounds good. I'm looking forward to tracking improvements in this area. Also, just to connect some more dots here, I just remembered that there is currently an initiative to add an IndexedRDD <https://issues.apache.org/jira/browse/SPARK-2365> interface. Some interesting use cases mentioned there include (emphasis added):
To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing > key uniqueness and pre-indexing the entries for efficient joins and *point > lookups, updates, and deletions*. GraphX would be the first user of IndexedRDD, since it currently implements > a limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including *streaming updates* to RDDs, *direct > serving* from RDDs, and as an execution strategy for Spark SQL. Maybe some day we'll have Spark clusters directly serving up point lookups or updates. I imagine the tasks running on clusters like that would be tiny and would benefit from very low task startup times and scheduling latency. Am I painting that picture correctly? Anyway, thanks for explaining the current status of Sparrow. Nick
