Hi Weijie,

Great analysis. Let’s look at a few more data points.

Drill has no central scheduler (this is a feature: it makes the cluster much 
easier to manage and has no single point of failure. It was probably the 
easiest possible solution while Drill was being built.) Instead of central 
control, Drill is based on the assumption of symmetry: all Drillbits are 
identical. So, each Foreman, acting independently, should try to schedule its 
load in a way that evenly distributes work across nodes in the cluster. If all 
Drillbits do the same, then load should be balanced; there should be no “hot 
spots.”

Note, for this to work, Drill should either own the cluster, or any other 
workload on the cluster should also be evenly distributed.

Drill makes another simplification: that the cluster has infinite resources 
(or, equivalently, that the admin sized the cluster for peak load.) That is, as 
Sudheesh puts it, “Drill is optimistic” Therefore, Drill usually runs with no 
throttling mechanism to limit overall cluster load. In real clusters, of 
course, resources are limited and either a large query load, or a few large 
queries, can saturate some or all of the available resources.

Drill has a feature, seldom used, to throttle queries based purely on number. 
These ZK-based queues can allow, say, 5 queries to run (each of which is 
assumed to be evenly distributed.) In actual fact, the ZK-based queues 
recognize that typical workload have many small, and a few large, queries and 
so Drill offers the “small query” and “large query” queues.

OK, so that’s where we are today. I think I’m not stepping too far out of line 
to observe that the above model is just a bit naive. It does not take into 
consideration the available cores, memory or disk I/Os. It does not consider 
the fact that memory has a hard upper limit and must be managed. Drill creates 
one thread for each minor fragment limited by the number of cores. But, each 
query can contain dozens or more fragments, resulting in far, far more threads 
per query than a node has cores. That is, Drill’s current scheduling model does 
not consider that, above a certain level, adding more threads makes the system 
slower because of thrashing.

You propose a closed-loop, reactive control system (schedule load based on 
observed load on each Drillbit.) However, control-system theory tells us that 
such a system is subject to oscillation. All Foremen observe that a node X is 
loaded so none send it work. Node X later finishes its work and becomes 
underloaded. All Foremen now prefer node X and it swings back to being 
overloaded. In fact, Impala tried an open-loop design and there is some 
evidence in their documentation that they hit these very problems.

So, what else could we do? As we’ve wrestled with these issues, we’ve come to 
the understanding that we need an open-loop, predictive solution. That is a 
fancy name for what YARN or Mesos does: keep track of available resources, 
reserve them for a task, and monitor the task so that it stays within the 
resource allocation. Predict load via allocation rather than reacting to actual 
load.

In Drill, that might mean a scheduler which looks at all incoming queries and 
assigns cluster resources to each; queueing the query if necessary until 
resources become available. It also means that queries must live within their 
resource allocation. (The planner can help by predicting the likely needed 
resources. Then, at run time, spill-to-disk and other mechanisms allow queries 
to honor the resource limits.)

The scheduler-based design is nothing new: it seems to be what Impala settled 
on, it is what YARN does for batch jobs, and it is a common pattern in other 
query engines.

Back to the RPC issue. With proper scheduling, we limit load on each Drillbit 
so that RPC (and ZK heartbeats) can operate correctly. That is, rather than 
overloading a node, then attempting to recover, we wish instead to manage to 
load to prevent the overload in the first place.

A coming pull request will take a first, small, step: it will allocate memory 
to queries based on the limit set by the ZK-based queues. The next step is to 
figure out how to limit the number of threads per query. (As noted above, a 
single large query can overwhelm the cluster if, say, it tries to do 100 
subqueries with many sorts, joins, etc.) We welcome suggestions and pointers to 
how others have solved the problem.

We also keep tossing around the idea of introducing that central scheduler. 
But, that is quite a bit of work and we’ve hard that users seem to have 
struggles with maintaining the YARN and Impala schedulers, so we’re somewhat 
hesitant to move away from a purely symmetrical configuration. Suggestions in 
this area are very welcome.

For now, try turning on the ZK queues to limit concurrent queries and prevent 
overload. Ensure your cluster is sized for your workload. Ensure other work on 
the cluster is also symmetrical and doe not compete with Drill for resources.

And, please continue to share your experiences!

Thanks,

- Paul

> On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie...@gmail.com> wrote:
> 
> HI all:
> 
>  Drill's current schedule policy seems a little simple. The
> SimpleParallelizer assigns endpoints in round robin model which ignores the
> system's load and other factors. To critical scenario, some drillbits are
> suffering frequent full GCs which will let their control RPC blocked.
> Current assignment will not exclude these drillbits from the next coming
> queries's assignment. then the problem will get worse .
>  I propose to add a zk path to hold bad drillbits. Forman will recognize
> bad drillbits by waiting timeout (timeout of  control response from
> intermediate fragments), then update the bad drillbits path. Next coming
> queries will exclude these drillbits from the assignment list.
>  How do you think about it or any suggests ? If sounds ok ,will file a
> JIRA and give some contributes.

Reply via email to