>
> If, for example, you have a cluster of 100 machines, this means the
> scheduler can launch 150 tasks per machine per second.


Did you mean 15 tasks per machine per second here? Or alternatively, 10
machines?

I don't know of any existing Spark clusters that have a large enough number
> of machines or short enough tasks to justify the added complexity of
> distributing the scheduler.


Actually, this was the reason I took interest in Sparrow--specifically, the
idea of a Spark cluster handling many very short (<< 50 ms) tasks.

At the recent Spark Committer Night
<http://www.meetup.com/Spark-NYC/events/209271842/> in NYC, I asked Michael
if he thought that Spark SQL could eventually completely fill the need for
very low latency queries currently served by MPP databases like Redshift or
Vertica. If I recall correctly, he said that the main obstacle to that was
simply task startup time, which is on the order of 100 ms.

Is there interest in (or perhaps an existing initiative related to)
improving task startup times to the point where one could legitimately look
at Spark SQL as a low latency database that can serve many users or
applications at once? That would probably make a good use case for Sparrow,
no?

Nick

Reply via email to