Hmm, relevant quote from section 3.3: newer frameworks like Spark [35] reduce the overhead to 5ms. To support > tasks that complete in hundreds of mil- liseconds, we argue for reducing > task launch overhead even further to 1ms so that launch overhead > constitutes at most 1% of task runtime. By maintaining an active thread > pool for task execution on each worker node and caching binaries, task > launch overhead can be reduced to the time to make a remote procedure call > to the slave machine to launch the task. Today’s datacenter networks easily > allow a RPC to complete within 1ms. In fact, re- cent work showed that 10μs > RPCs are possible in the short term [26]; thus, with careful engineering, > we be- lieve task launch overheads of 50μ s are attainable. 50μ s task > launch overheads would enable even smaller tasks that could read data from > in-memory or from flash stor- age in order to complete in milliseconds.
So it looks like I misunderstood the current cost of task initialization. It's already as low as 5ms (and not 100ms)? Nick On Fri, Nov 7, 2014 at 11:15 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > > > On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Sounds good. I'm looking forward to tracking improvements in this area. >> >> Also, just to connect some more dots here, I just remembered that there is >> currently an initiative to add an IndexedRDD >> <https://issues.apache.org/jira/browse/SPARK-2365> interface. Some >> interesting use cases mentioned there include (emphasis added): >> >> To address these problems, we propose IndexedRDD, an efficient key-value >> > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing >> > key uniqueness and pre-indexing the entries for efficient joins and >> *point >> > lookups, updates, and deletions*. >> >> >> GraphX would be the first user of IndexedRDD, since it currently >> implements >> > a limited form of this functionality in VertexRDD. We envision a >> variety of >> > other uses for IndexedRDD, including *streaming updates* to RDDs, >> *direct >> > serving* from RDDs, and as an execution strategy for Spark SQL. >> >> >> Maybe some day we'll have Spark clusters directly serving up point lookups >> or updates. I imagine the tasks running on clusters like that would be >> tiny >> and would benefit from very low task startup times and scheduling latency. >> Am I painting that picture correctly? >> >> Yeah - we painted a similar picture in a short paper last year titled > "The Case for Tiny Tasks in Compute Clusters" > http://shivaram.org/publications/tinytasks-hotos13.pdf > >> Anyway, thanks for explaining the current status of Sparrow. >> >> Nick >> > >