@paul have you noticed the Sparrow project ( https://github.com/radlab/sparrow ) and related paper mentioned in the github . Sparrow is a non-central ,low latency scheduler . This seems meet Drill's demand. I think we can first abstract a scheduler interface like what Spark does , then we can have different scheduler implementations (central or non-central ,maybe non-central like sparrow be the default one ).
On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <[email protected]> wrote: > Thanks for all your suggestions. > > @paul your analysis is impressive . I agree with your opinion. Current > queue solution can not solve this problem perfectly. Our system is > suffering a hard time once the cluster is in high load. I will think about > this more deeply. welcome more ideas or suggestions to be shared in this > thread,maybe some little improvement . > > > On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <[email protected]> wrote: > >> Hi Weijie, >> >> Great analysis. Let’s look at a few more data points. >> >> Drill has no central scheduler (this is a feature: it makes the cluster >> much easier to manage and has no single point of failure. It was probably >> the easiest possible solution while Drill was being built.) Instead of >> central control, Drill is based on the assumption of symmetry: all >> Drillbits are identical. So, each Foreman, acting independently, should try >> to schedule its load in a way that evenly distributes work across nodes in >> the cluster. If all Drillbits do the same, then load should be balanced; >> there should be no “hot spots.” >> >> Note, for this to work, Drill should either own the cluster, or any other >> workload on the cluster should also be evenly distributed. >> >> Drill makes another simplification: that the cluster has infinite >> resources (or, equivalently, that the admin sized the cluster for peak >> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill >> usually runs with no throttling mechanism to limit overall cluster load. In >> real clusters, of course, resources are limited and either a large query >> load, or a few large queries, can saturate some or all of the available >> resources. >> >> Drill has a feature, seldom used, to throttle queries based purely on >> number. These ZK-based queues can allow, say, 5 queries to run (each of >> which is assumed to be evenly distributed.) In actual fact, the ZK-based >> queues recognize that typical workload have many small, and a few large, >> queries and so Drill offers the “small query” and “large query” queues. >> >> OK, so that’s where we are today. I think I’m not stepping too far out of >> line to observe that the above model is just a bit naive. It does not take >> into consideration the available cores, memory or disk I/Os. It does not >> consider the fact that memory has a hard upper limit and must be managed. >> Drill creates one thread for each minor fragment limited by the number of >> cores. But, each query can contain dozens or more fragments, resulting in >> far, far more threads per query than a node has cores. That is, Drill’s >> current scheduling model does not consider that, above a certain level, >> adding more threads makes the system slower because of thrashing. >> >> You propose a closed-loop, reactive control system (schedule load based >> on observed load on each Drillbit.) However, control-system theory tells us >> that such a system is subject to oscillation. All Foremen observe that a >> node X is loaded so none send it work. Node X later finishes its work and >> becomes underloaded. All Foremen now prefer node X and it swings back to >> being overloaded. In fact, Impala tried an open-loop design and there is >> some evidence in their documentation that they hit these very problems. >> >> So, what else could we do? As we’ve wrestled with these issues, we’ve >> come to the understanding that we need an open-loop, predictive solution. >> That is a fancy name for what YARN or Mesos does: keep track of available >> resources, reserve them for a task, and monitor the task so that it stays >> within the resource allocation. Predict load via allocation rather than >> reacting to actual load. >> >> In Drill, that might mean a scheduler which looks at all incoming queries >> and assigns cluster resources to each; queueing the query if necessary >> until resources become available. It also means that queries must live >> within their resource allocation. (The planner can help by predicting the >> likely needed resources. Then, at run time, spill-to-disk and other >> mechanisms allow queries to honor the resource limits.) >> >> The scheduler-based design is nothing new: it seems to be what Impala >> settled on, it is what YARN does for batch jobs, and it is a common pattern >> in other query engines. >> >> Back to the RPC issue. With proper scheduling, we limit load on each >> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is, >> rather than overloading a node, then attempting to recover, we wish instead >> to manage to load to prevent the overload in the first place. >> >> A coming pull request will take a first, small, step: it will allocate >> memory to queries based on the limit set by the ZK-based queues. The next >> step is to figure out how to limit the number of threads per query. (As >> noted above, a single large query can overwhelm the cluster if, say, it >> tries to do 100 subqueries with many sorts, joins, etc.) We welcome >> suggestions and pointers to how others have solved the problem. >> >> We also keep tossing around the idea of introducing that central >> scheduler. But, that is quite a bit of work and we’ve hard that users seem >> to have struggles with maintaining the YARN and Impala schedulers, so we’re >> somewhat hesitant to move away from a purely symmetrical configuration. >> Suggestions in this area are very welcome. >> >> For now, try turning on the ZK queues to limit concurrent queries and >> prevent overload. Ensure your cluster is sized for your workload. Ensure >> other work on the cluster is also symmetrical and doe not compete with >> Drill for resources. >> >> And, please continue to share your experiences! >> >> Thanks, >> >> - Paul >> >> > On Aug 20, 2017, at 5:39 AM, weijie tong <[email protected]> >> wrote: >> > >> > HI all: >> > >> > Drill's current schedule policy seems a little simple. The >> > SimpleParallelizer assigns endpoints in round robin model which ignores >> the >> > system's load and other factors. To critical scenario, some drillbits >> are >> > suffering frequent full GCs which will let their control RPC blocked. >> > Current assignment will not exclude these drillbits from the next coming >> > queries's assignment. then the problem will get worse . >> > I propose to add a zk path to hold bad drillbits. Forman will recognize >> > bad drillbits by waiting timeout (timeout of control response from >> > intermediate fragments), then update the bad drillbits path. Next coming >> > queries will exclude these drillbits from the assignment list. >> > How do you think about it or any suggests ? If sounds ok ,will file a >> > JIRA and give some contributes. >> >>
