Re: Replacing Spark's native scheduler with Sparrow

2014-11-10 Thread Nicholas Chammas
On Sun, Nov 9, 2014 at 1:51 AM, Tathagata Das tathagata.das1...@gmail.com wrote: This causes a scalability vs. latency tradeoff - if your limit is 1000 tasks per second (simplifying from 1500), you could either configure it to use 100 receivers at 100 ms batches (10 blocks/sec), or 1000

Re: Replacing Spark's native scheduler with Sparrow

2014-11-10 Thread Tathagata Das
Too bad Nick, I dont have anything immediately ready that tests Spark Streaming with those extreme settings. :) On Mon, Nov 10, 2014 at 9:56 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Sun, Nov 9, 2014 at 1:51 AM, Tathagata Das tathagata.das1...@gmail.com wrote: This causes a

Re: Replacing Spark's native scheduler with Sparrow

2014-11-08 Thread Michael Armbrust
However, I haven't seen it be as high as the 100ms Michael quoted (maybe this was for jobs with tasks that have much larger objects that take a long time to deserialize?). I was thinking more about the average end-to-end latency for launching a query that has 100s of partitions. Its also

Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Nicholas Chammas
I just watched Kay's talk from 2013 on Sparrow https://www.youtube.com/watch?v=ayjH_bG-RC0. Is replacing Spark's native scheduler with Sparrow still on the books? The Sparrow repo https://github.com/radlab/sparrow hasn't been updated recently, and I don't see any JIRA issues about it. It would

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Kay Ousterhout
larger clusters, such that Sparrow will be necessary! -Kay On Fri, Nov 7, 2014 at 3:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I just watched Kay's talk from 2013 on Sparrow https://www.youtube.com/watch?v=ayjH_bG-RC0. Is replacing Spark's native scheduler with Sparrow still

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Nicholas Chammas
If, for example, you have a cluster of 100 machines, this means the scheduler can launch 150 tasks per machine per second. Did you mean 15 tasks per machine per second here? Or alternatively, 10 machines? I don't know of any existing Spark clusters that have a large enough number of

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Kay Ousterhout
On Fri, Nov 7, 2014 at 6:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If, for example, you have a cluster of 100 machines, this means the scheduler can launch 150 tasks per machine per second. Did you mean 15 tasks per machine per second here? Or alternatively, 10 machines?

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Nicholas Chammas
Sounds good. I'm looking forward to tracking improvements in this area. Also, just to connect some more dots here, I just remembered that there is currently an initiative to add an IndexedRDD https://issues.apache.org/jira/browse/SPARK-2365 interface. Some interesting use cases mentioned there

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Shivaram Venkataraman
On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Sounds good. I'm looking forward to tracking improvements in this area. Also, just to connect some more dots here, I just remembered that there is currently an initiative to add an IndexedRDD

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Nicholas Chammas
Hmm, relevant quote from section 3.3: newer frameworks like Spark [35] reduce the overhead to 5ms. To support tasks that complete in hundreds of mil- liseconds, we argue for reducing task launch overhead even further to 1ms so that launch overhead constitutes at most 1% of task runtime. By

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Shivaram Venkataraman
I think Kay might be able to give a better answer. The most recent benchmark I remember had the number at at somewhere between 8.6ms and 14.6ms depending on the Spark version ( https://github.com/apache/spark/pull/2030#issuecomment-52715181). Another point to note is that this is the total time to

Re: Replacing Spark's native scheduler with Sparrow

2014-11-07 Thread Kay Ousterhout
I don't have much more info than what Shivaram said. My sense is that, over time, task launch overhead with Spark has slowly grown as Spark supports more and more functionality. However, I haven't seen it be as high as the 100ms Michael quoted (maybe this was for jobs with tasks that have much