Thanks for all your suggestions. @paul your analysis is impressive . I agree with your opinion. Current queue solution can not solve this problem perfectly. Our system is suffering a hard time once the cluster is in high load. I will think about this more deeply. welcome more ideas or suggestions to be shared in this thread,maybe some little improvement .
On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <prog...@mapr.com> wrote: > Hi Weijie, > > Great analysis. Let’s look at a few more data points. > > Drill has no central scheduler (this is a feature: it makes the cluster > much easier to manage and has no single point of failure. It was probably > the easiest possible solution while Drill was being built.) Instead of > central control, Drill is based on the assumption of symmetry: all > Drillbits are identical. So, each Foreman, acting independently, should try > to schedule its load in a way that evenly distributes work across nodes in > the cluster. If all Drillbits do the same, then load should be balanced; > there should be no “hot spots.” > > Note, for this to work, Drill should either own the cluster, or any other > workload on the cluster should also be evenly distributed. > > Drill makes another simplification: that the cluster has infinite > resources (or, equivalently, that the admin sized the cluster for peak > load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill > usually runs with no throttling mechanism to limit overall cluster load. In > real clusters, of course, resources are limited and either a large query > load, or a few large queries, can saturate some or all of the available > resources. > > Drill has a feature, seldom used, to throttle queries based purely on > number. These ZK-based queues can allow, say, 5 queries to run (each of > which is assumed to be evenly distributed.) In actual fact, the ZK-based > queues recognize that typical workload have many small, and a few large, > queries and so Drill offers the “small query” and “large query” queues. > > OK, so that’s where we are today. I think I’m not stepping too far out of > line to observe that the above model is just a bit naive. It does not take > into consideration the available cores, memory or disk I/Os. It does not > consider the fact that memory has a hard upper limit and must be managed. > Drill creates one thread for each minor fragment limited by the number of > cores. But, each query can contain dozens or more fragments, resulting in > far, far more threads per query than a node has cores. That is, Drill’s > current scheduling model does not consider that, above a certain level, > adding more threads makes the system slower because of thrashing. > > You propose a closed-loop, reactive control system (schedule load based on > observed load on each Drillbit.) However, control-system theory tells us > that such a system is subject to oscillation. All Foremen observe that a > node X is loaded so none send it work. Node X later finishes its work and > becomes underloaded. All Foremen now prefer node X and it swings back to > being overloaded. In fact, Impala tried an open-loop design and there is > some evidence in their documentation that they hit these very problems. > > So, what else could we do? As we’ve wrestled with these issues, we’ve come > to the understanding that we need an open-loop, predictive solution. That > is a fancy name for what YARN or Mesos does: keep track of available > resources, reserve them for a task, and monitor the task so that it stays > within the resource allocation. Predict load via allocation rather than > reacting to actual load. > > In Drill, that might mean a scheduler which looks at all incoming queries > and assigns cluster resources to each; queueing the query if necessary > until resources become available. It also means that queries must live > within their resource allocation. (The planner can help by predicting the > likely needed resources. Then, at run time, spill-to-disk and other > mechanisms allow queries to honor the resource limits.) > > The scheduler-based design is nothing new: it seems to be what Impala > settled on, it is what YARN does for batch jobs, and it is a common pattern > in other query engines. > > Back to the RPC issue. With proper scheduling, we limit load on each > Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is, > rather than overloading a node, then attempting to recover, we wish instead > to manage to load to prevent the overload in the first place. > > A coming pull request will take a first, small, step: it will allocate > memory to queries based on the limit set by the ZK-based queues. The next > step is to figure out how to limit the number of threads per query. (As > noted above, a single large query can overwhelm the cluster if, say, it > tries to do 100 subqueries with many sorts, joins, etc.) We welcome > suggestions and pointers to how others have solved the problem. > > We also keep tossing around the idea of introducing that central > scheduler. But, that is quite a bit of work and we’ve hard that users seem > to have struggles with maintaining the YARN and Impala schedulers, so we’re > somewhat hesitant to move away from a purely symmetrical configuration. > Suggestions in this area are very welcome. > > For now, try turning on the ZK queues to limit concurrent queries and > prevent overload. Ensure your cluster is sized for your workload. Ensure > other work on the cluster is also symmetrical and doe not compete with > Drill for resources. > > And, please continue to share your experiences! > > Thanks, > > - Paul > > > On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie...@gmail.com> > wrote: > > > > HI all: > > > > Drill's current schedule policy seems a little simple. The > > SimpleParallelizer assigns endpoints in round robin model which ignores > the > > system's load and other factors. To critical scenario, some drillbits are > > suffering frequent full GCs which will let their control RPC blocked. > > Current assignment will not exclude these drillbits from the next coming > > queries's assignment. then the problem will get worse . > > I propose to add a zk path to hold bad drillbits. Forman will recognize > > bad drillbits by waiting timeout (timeout of control response from > > intermediate fragments), then update the bad drillbits path. Next coming > > queries will exclude these drillbits from the assignment list. > > How do you think about it or any suggests ? If sounds ok ,will file a > > JIRA and give some contributes. > >