On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote:
> Sounds good. I'm looking forward to tracking improvements in this area. > > Also, just to connect some more dots here, I just remembered that there is > currently an initiative to add an IndexedRDD > <https://issues.apache.org/jira/browse/SPARK-2365> interface. Some > interesting use cases mentioned there include (emphasis added): > > To address these problems, we propose IndexedRDD, an efficient key-value > > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing > > key uniqueness and pre-indexing the entries for efficient joins and > *point > > lookups, updates, and deletions*. > > > GraphX would be the first user of IndexedRDD, since it currently implements > > a limited form of this functionality in VertexRDD. We envision a variety > of > > other uses for IndexedRDD, including *streaming updates* to RDDs, *direct > > serving* from RDDs, and as an execution strategy for Spark SQL. > > > Maybe some day we'll have Spark clusters directly serving up point lookups > or updates. I imagine the tasks running on clusters like that would be tiny > and would benefit from very low task startup times and scheduling latency. > Am I painting that picture correctly? > > Yeah - we painted a similar picture in a short paper last year titled "The Case for Tiny Tasks in Compute Clusters" http://shivaram.org/publications/tinytasks-hotos13.pdf > Anyway, thanks for explaining the current status of Sparrow. > > Nick >