@paul  have you noticed the Sparrow project (
https://github.com/radlab/sparrow ) and related paper mentioned in the
github .  Sparrow is a non-central ,low latency scheduler . This seems meet
Drill's demand. I think we can first abstract a scheduler interface like
what Spark does , then we can have different scheduler implementations
(central or non-central ,maybe non-central like sparrow be the default one
).

On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <[email protected]>
wrote:

> Thanks for all your suggestions.
>
>  @paul your analysis is impressive . I agree with  your opinion. Current
> queue solution can not solve this problem perfectly. Our system is
> suffering a  hard time once the cluster is in high load. I will think about
> this more deeply. welcome more ideas or suggestions to  be shared in this
> thread,maybe some little improvement .
>
>
> On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <[email protected]> wrote:
>
>> Hi Weijie,
>>
>> Great analysis. Let’s look at a few more data points.
>>
>> Drill has no central scheduler (this is a feature: it makes the cluster
>> much easier to manage and has no single point of failure. It was probably
>> the easiest possible solution while Drill was being built.) Instead of
>> central control, Drill is based on the assumption of symmetry: all
>> Drillbits are identical. So, each Foreman, acting independently, should try
>> to schedule its load in a way that evenly distributes work across nodes in
>> the cluster. If all Drillbits do the same, then load should be balanced;
>> there should be no “hot spots.”
>>
>> Note, for this to work, Drill should either own the cluster, or any other
>> workload on the cluster should also be evenly distributed.
>>
>> Drill makes another simplification: that the cluster has infinite
>> resources (or, equivalently, that the admin sized the cluster for peak
>> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
>> usually runs with no throttling mechanism to limit overall cluster load. In
>> real clusters, of course, resources are limited and either a large query
>> load, or a few large queries, can saturate some or all of the available
>> resources.
>>
>> Drill has a feature, seldom used, to throttle queries based purely on
>> number. These ZK-based queues can allow, say, 5 queries to run (each of
>> which is assumed to be evenly distributed.) In actual fact, the ZK-based
>> queues recognize that typical workload have many small, and a few large,
>> queries and so Drill offers the “small query” and “large query” queues.
>>
>> OK, so that’s where we are today. I think I’m not stepping too far out of
>> line to observe that the above model is just a bit naive. It does not take
>> into consideration the available cores, memory or disk I/Os. It does not
>> consider the fact that memory has a hard upper limit and must be managed.
>> Drill creates one thread for each minor fragment limited by the number of
>> cores. But, each query can contain dozens or more fragments, resulting in
>> far, far more threads per query than a node has cores. That is, Drill’s
>> current scheduling model does not consider that, above a certain level,
>> adding more threads makes the system slower because of thrashing.
>>
>> You propose a closed-loop, reactive control system (schedule load based
>> on observed load on each Drillbit.) However, control-system theory tells us
>> that such a system is subject to oscillation. All Foremen observe that a
>> node X is loaded so none send it work. Node X later finishes its work and
>> becomes underloaded. All Foremen now prefer node X and it swings back to
>> being overloaded. In fact, Impala tried an open-loop design and there is
>> some evidence in their documentation that they hit these very problems.
>>
>> So, what else could we do? As we’ve wrestled with these issues, we’ve
>> come to the understanding that we need an open-loop, predictive solution.
>> That is a fancy name for what YARN or Mesos does: keep track of available
>> resources, reserve them for a task, and monitor the task so that it stays
>> within the resource allocation. Predict load via allocation rather than
>> reacting to actual load.
>>
>> In Drill, that might mean a scheduler which looks at all incoming queries
>> and assigns cluster resources to each; queueing the query if necessary
>> until resources become available. It also means that queries must live
>> within their resource allocation. (The planner can help by predicting the
>> likely needed resources. Then, at run time, spill-to-disk and other
>> mechanisms allow queries to honor the resource limits.)
>>
>> The scheduler-based design is nothing new: it seems to be what Impala
>> settled on, it is what YARN does for batch jobs, and it is a common pattern
>> in other query engines.
>>
>> Back to the RPC issue. With proper scheduling, we limit load on each
>> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
>> rather than overloading a node, then attempting to recover, we wish instead
>> to manage to load to prevent the overload in the first place.
>>
>> A coming pull request will take a first, small, step: it will allocate
>> memory to queries based on the limit set by the ZK-based queues. The next
>> step is to figure out how to limit the number of threads per query. (As
>> noted above, a single large query can overwhelm the cluster if, say, it
>> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
>> suggestions and pointers to how others have solved the problem.
>>
>> We also keep tossing around the idea of introducing that central
>> scheduler. But, that is quite a bit of work and we’ve hard that users seem
>> to have struggles with maintaining the YARN and Impala schedulers, so we’re
>> somewhat hesitant to move away from a purely symmetrical configuration.
>> Suggestions in this area are very welcome.
>>
>> For now, try turning on the ZK queues to limit concurrent queries and
>> prevent overload. Ensure your cluster is sized for your workload. Ensure
>> other work on the cluster is also symmetrical and doe not compete with
>> Drill for resources.
>>
>> And, please continue to share your experiences!
>>
>> Thanks,
>>
>> - Paul
>>
>> > On Aug 20, 2017, at 5:39 AM, weijie tong <[email protected]>
>> wrote:
>> >
>> > HI all:
>> >
>> >  Drill's current schedule policy seems a little simple. The
>> > SimpleParallelizer assigns endpoints in round robin model which ignores
>> the
>> > system's load and other factors. To critical scenario, some drillbits
>> are
>> > suffering frequent full GCs which will let their control RPC blocked.
>> > Current assignment will not exclude these drillbits from the next coming
>> > queries's assignment. then the problem will get worse .
>> >  I propose to add a zk path to hold bad drillbits. Forman will recognize
>> > bad drillbits by waiting timeout (timeout of  control response from
>> > intermediate fragments), then update the bad drillbits path. Next coming
>> > queries will exclude these drillbits from the assignment list.
>> >  How do you think about it or any suggests ? If sounds ok ,will file a
>> > JIRA and give some contributes.
>>
>>

Reply via email to