Hi Weijie,
Thanks much for the suggestions! It will take a while to digest all of this as
Drill’s existing scheduler (for fragments) is quite complex, but it works. I’ll
need to map those concepts to Sparrow. We still don’t have query-level
scheduling, but perhaps there is something that can
Maybe we need to adjust the MajorFragments execution phase as the
intermediate MajorFragments are lazy executed now(if the intermediate
fragments tasks are lazy allocated or not allocated due to resource
restrict, the down stream running works will couldn't send their data out).
We should let
Hi Paul:
I have read the codes of Sparrow and Spark-Sparrow last few days. It
seems Sparrow can match Drill's architecture very well. According to
sparrow's spark implementation, every MinorFragment can be treat as a spark
task ,a MajorFragment can be treat as a spark taskset. We will start a
Hi Weijie,
Thanks for the link. I’d seen this project a bit earlier, along with Apollo
[1]. Sparrow is quite interesting, but is designed to place tasks (processes)
on available nodes. This is not quite what Drill does: Drill launches multiple
waves of “fragments” to all nodes across the
@paul have you noticed the Sparrow project (
https://github.com/radlab/sparrow ) and related paper mentioned in the
github . Sparrow is a non-central ,low latency scheduler . This seems meet
Drill's demand. I think we can first abstract a scheduler interface like
what Spark does , then we can
Thanks for all your suggestions.
@paul your analysis is impressive . I agree with your opinion. Current
queue solution can not solve this problem perfectly. Our system is
suffering a hard time once the cluster is in high load. I will think about
this more deeply. welcome more ideas or
Hi Weijie,
Great analysis. Let’s look at a few more data points.
Drill has no central scheduler (this is a feature: it makes the cluster much
easier to manage and has no single point of failure. It was probably the
easiest possible solution while Drill was being built.) Instead of central
If control RPC is down to a drillbit i.e if a drillbit is not responding,
zookeeper should detect that and notify other drillbits to remove the dead
drillbit
from their active list. Once that happens, the next query that comes in should
not even see that drillbit.
We need a way to differentiate
HI all:
Drill's current schedule policy seems a little simple. The
SimpleParallelizer assigns endpoints in round robin model which ignores the
system's load and other factors. To critical scenario, some drillbits are
suffering frequent full GCs which will let their control RPC blocked.
Current