Thanks for all your suggestions.

 @paul your analysis is impressive . I agree with  your opinion. Current
queue solution can not solve this problem perfectly. Our system is
suffering a  hard time once the cluster is in high load. I will think about
this more deeply. welcome more ideas or suggestions to  be shared in this
thread,maybe some little improvement .


On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <prog...@mapr.com> wrote:

> Hi Weijie,
>
> Great analysis. Let’s look at a few more data points.
>
> Drill has no central scheduler (this is a feature: it makes the cluster
> much easier to manage and has no single point of failure. It was probably
> the easiest possible solution while Drill was being built.) Instead of
> central control, Drill is based on the assumption of symmetry: all
> Drillbits are identical. So, each Foreman, acting independently, should try
> to schedule its load in a way that evenly distributes work across nodes in
> the cluster. If all Drillbits do the same, then load should be balanced;
> there should be no “hot spots.”
>
> Note, for this to work, Drill should either own the cluster, or any other
> workload on the cluster should also be evenly distributed.
>
> Drill makes another simplification: that the cluster has infinite
> resources (or, equivalently, that the admin sized the cluster for peak
> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
> usually runs with no throttling mechanism to limit overall cluster load. In
> real clusters, of course, resources are limited and either a large query
> load, or a few large queries, can saturate some or all of the available
> resources.
>
> Drill has a feature, seldom used, to throttle queries based purely on
> number. These ZK-based queues can allow, say, 5 queries to run (each of
> which is assumed to be evenly distributed.) In actual fact, the ZK-based
> queues recognize that typical workload have many small, and a few large,
> queries and so Drill offers the “small query” and “large query” queues.
>
> OK, so that’s where we are today. I think I’m not stepping too far out of
> line to observe that the above model is just a bit naive. It does not take
> into consideration the available cores, memory or disk I/Os. It does not
> consider the fact that memory has a hard upper limit and must be managed.
> Drill creates one thread for each minor fragment limited by the number of
> cores. But, each query can contain dozens or more fragments, resulting in
> far, far more threads per query than a node has cores. That is, Drill’s
> current scheduling model does not consider that, above a certain level,
> adding more threads makes the system slower because of thrashing.
>
> You propose a closed-loop, reactive control system (schedule load based on
> observed load on each Drillbit.) However, control-system theory tells us
> that such a system is subject to oscillation. All Foremen observe that a
> node X is loaded so none send it work. Node X later finishes its work and
> becomes underloaded. All Foremen now prefer node X and it swings back to
> being overloaded. In fact, Impala tried an open-loop design and there is
> some evidence in their documentation that they hit these very problems.
>
> So, what else could we do? As we’ve wrestled with these issues, we’ve come
> to the understanding that we need an open-loop, predictive solution. That
> is a fancy name for what YARN or Mesos does: keep track of available
> resources, reserve them for a task, and monitor the task so that it stays
> within the resource allocation. Predict load via allocation rather than
> reacting to actual load.
>
> In Drill, that might mean a scheduler which looks at all incoming queries
> and assigns cluster resources to each; queueing the query if necessary
> until resources become available. It also means that queries must live
> within their resource allocation. (The planner can help by predicting the
> likely needed resources. Then, at run time, spill-to-disk and other
> mechanisms allow queries to honor the resource limits.)
>
> The scheduler-based design is nothing new: it seems to be what Impala
> settled on, it is what YARN does for batch jobs, and it is a common pattern
> in other query engines.
>
> Back to the RPC issue. With proper scheduling, we limit load on each
> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
> rather than overloading a node, then attempting to recover, we wish instead
> to manage to load to prevent the overload in the first place.
>
> A coming pull request will take a first, small, step: it will allocate
> memory to queries based on the limit set by the ZK-based queues. The next
> step is to figure out how to limit the number of threads per query. (As
> noted above, a single large query can overwhelm the cluster if, say, it
> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
> suggestions and pointers to how others have solved the problem.
>
> We also keep tossing around the idea of introducing that central
> scheduler. But, that is quite a bit of work and we’ve hard that users seem
> to have struggles with maintaining the YARN and Impala schedulers, so we’re
> somewhat hesitant to move away from a purely symmetrical configuration.
> Suggestions in this area are very welcome.
>
> For now, try turning on the ZK queues to limit concurrent queries and
> prevent overload. Ensure your cluster is sized for your workload. Ensure
> other work on the cluster is also symmetrical and doe not compete with
> Drill for resources.
>
> And, please continue to share your experiences!
>
> Thanks,
>
> - Paul
>
> > On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie...@gmail.com>
> wrote:
> >
> > HI all:
> >
> >  Drill's current schedule policy seems a little simple. The
> > SimpleParallelizer assigns endpoints in round robin model which ignores
> the
> > system's load and other factors. To critical scenario, some drillbits are
> > suffering frequent full GCs which will let their control RPC blocked.
> > Current assignment will not exclude these drillbits from the next coming
> > queries's assignment. then the problem will get worse .
> >  I propose to add a zk path to hold bad drillbits. Forman will recognize
> > bad drillbits by waiting timeout (timeout of  control response from
> > intermediate fragments), then update the bad drillbits path. Next coming
> > queries will exclude these drillbits from the assignment list.
> >  How do you think about it or any suggests ? If sounds ok ,will file a
> > JIRA and give some contributes.
>
>

Reply via email to