[ 
https://issues.apache.org/jira/browse/DRILL-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258585#comment-16258585
 ] 

Paul Rogers commented on DRILL-5975:
------------------------------------

The design proposed may work for simple queries, but it is not sufficiently 
general for large queries. Let's start with some background. Here is Drill's 
current design. As you point out, there are many opportunities for improvement. 
But, this is what we have today.

This is an important topic, so I've taken the liberty of explaining our current 
understanding in some detail.

h4. Current Model

Drill's current resource model is based on symmetry: it assumes queries arrive 
randomly at each Drillbit. That is, that each user connects to a random 
Foreman, that each user has similar load, and that the number of users is 
larger than the number of Drillbits, so that each Foreman sees roughly equal 
load.

Next, Drill assumes that users want maximum performance of each query. 
Therefore, it strives for maximum parallelization across both the CPUs in each 
node and all nodes in the cluster.

To do this, Drill assumes nodes are symmetrical: that all have the same number 
of available CPUs and memory. Drill also assumes that Drill is the cluster is 
dedicated to Drill and so Drill attempts to saturate all CPUs on all hosts.

Next, Drill assumes that all queries are fast, and so that query 1 will 
complete roughly before query 2 starts. This means that resource sharing is 
sequential: query 1 uses all memory and CPUs, then query 2 does so.

The above model may not be ideal, but it is the "simplest thing that could 
work." It has gotten Drill this far, but it does clearly have limits.

h4. Limitations of the Current Model

Problems occur, obviously, when the real world does not agree with Drill's 
assumptions. Typical problems:

* Drill does not "own" its cluster, and instead shared resources (such as CPU) 
with other workloads.
* Queries are not short, and instead of running sequentially, they end up 
running in parallel.
* Queries need more resources (CPU or memory) than the user has provided.

h4. CPU Overload

While our recent work focused on memory (with spilling and queue-based memory 
allocation), you correctly point out that we need to turn our attention to CPU.

Let's look just a bit deep at the cause of CPU usage. Drill makes two 
assumptions:

* Create as many "major fragments" (work units) as possible for each query. 
(Shown as different colors in the query plan visualization UI.)
* Create as many "minor fragments" (slices) as possible for each major 
fragment. (Shown as numbered items in the query plan UI tables.)
* By default, create a number of minor fragments equal to 70% of the number of 
CPUs. (If your machine has 20 cores, Drill will create 14 slices per major 
fragment.)
* Every minor fragment is implemented as a Java thread.

A bit of math shows why this is a problem. Assume a typical query with, say, 5 
major fragments. Assume 24 CPUs. We will create a number of threads equal to:

24 CPUs * 70% * 5 major fragments = 87 threads

If all these threads were busy at the same time, we'd overload our CPUs by a 
factor of 3, causing intense context switching (which pretty much defeats any 
attempts to optimize for internal CPU caches.) Larger queries can easily 
oversubscribe CPUs by 10 or more times.

Note that each of these threads wants to make use of unlimited memory. It does 
not take too many such queries before Drill thrashes the CPU and runs out of 
memory.

The lesson is that the workload exceeds the available resources, then the 
"assume infinite resources" model no longer works. Some form of throttling is 
needed. Let's discuss that.

Suppose that Drill can make a query faster by using all CPUs. Then, there is, 
by design, no room in the cluster for another query. If we are already using 
300% of CPU, then adding another query simply causes more thrashing, puts more 
pressure on memory, and causes both queries to slow down. In an ideal case, 
both queries will take twice as long as if they ran separately. In extreme 
cases, the slow-down is sub-linear once the OS starts wasting time thrashing 
threads (as shown by the percent system time in, say, the "top" command.)

In this (simplified) model, we are better off running one query to completion, 
then starting the second. Both make full use of the cluster. Total run time is 
the same. Plus, memory pressure is halved.

In general, some queries need all resources, but many do not. In our own 
testing, we see that, with TPC-H queries, there is some advantage to running 
parallel queries up to maybe three or four concurrent queries. After that, we 
just see a linear slow-down. (We've pushed the system to 20, 30 or more queries 
-- it is like your local freeway at rush hour; everything becomes vey slow.)

h4. Throttling

The first challenge is to accept that every cluster has limits. Once Drill 
saturates CPUs, there is nothing more to give; things will just get slower. No 
one likes this truth, but the physics are pretty clear.

This leads to the idea of throttling queries as a whole. To maximize query 
performance, allow a limited number of queries into the system. Hold additional 
queries until resources are available. The goal would be to keep CPU 
utilization at some target (say 90%). Too few concurrent queries and CPU is 
wasted. Too many and the CPUs are overused, leading to excessive context 
switching and memory pressure.

In the end, a throttling model maximizes query throughput for a given cluster 
size. (Of course, we should work to improve Drill efficiency so that each query 
requires less CPU. Even so, in any given release, there is some CPU cost per 
query that we must manage.)

Once one starts rationing resources (which is what throttling is), users want 
some say. The Boss is more important than a worker-bee and so should move to 
the head of the line. ("Per-user prioritization.") Dashboard queries should run 
before batch reports. ("Application prioritization.") Marketing paid for only 
1/3 of the cluster and so should only be able to use that much. ("Resource 
pools.") And so on. Impala, as noted, does a pretty good job here.

h4. Minor Fragment Scheduling

The proposal in this query is to schedule at the fragment level. This seems 
like a good idea, but the devil is in the details.

* Deadlock can occur if fragment A depends on B and C, but fragment C is 
blocked waiting for A to complete. Good DAG dependency analysis will help here.
* Drill gets its speed from in-memory processing. Spilling data to disk between 
stages simply reinvents Hive. So, while Drill should do spilling when 
absolutely necessary, we wish to minimize extra disk writes.
* Drill is a big data engine: it is designed to deal with large data volumes. 
Typically, insufficient memory is available to buffer the entire data set. 
(Hence the need for spilling in operations such as sort.) Instead, Drill 
attempts to "stream" data from data source, through the DAG, and out to the 
client.
* Increased memory pressure as fragment A holds onto buffers while paused. 
Ideally, fragments with large memory would have a higher priority to finish so 
that they can release memory. But, they may feed into another memory-hungry 
operator.

Where fragment-level throttling would be helpful is if a single query were so 
large that, by itself, it would create far too many threads. In this case, it 
might be a good idea to break the query into "stages" which run sequentially, 
with intermediate results materialized to disk. (That is, be Hive-like for 
large queries.)

h4. Open vs. Closed Loop Scheduling

Let me introduce one other concept. Impala has tried many ways to control load. 
One attempt was to use a "closed-loop" controller: the Impala state store 
attempted to monitor actual cluster load. Impala daemons (equivalent to a 
Foreman) released queries based on this perceived load. The problems were 
exactly what control theory predicts: extreme oscillation. Queries use 
resources over time. Monitor load early and it looks like the cluster is idle, 
so admit more queries. Eventually, all these queries need peak resources 
queries are blocked. But, by now, excessive resources are in use and 
out-of-memory errors occur. Queries die, load drops, and the whole cycle starts 
again.

Impala moved to the classic YARN-like "open loop" design: queries are assigned 
resources and must (presumably) live within that "resource budget." We wish to 
learn from Impala and start with the classic open-loop, resource allocation 
scheduling model.

h4. Summary

All of this is a way of suggesting that the best next step in throttling is to 
implement a better query-level scheduler. Some goals:

* Better estimate query resource usage to do better resource planning. (Better 
than the crude large/small model that exists today.)
* Revise each operator so that it lives within a defined memory budget. (That 
is, add spilling or other techniques where needed.)
* Find a solution to schedule threads; perhaps limiting slices per major 
fragment.
* Find a solution to the major-fragment count problem.

Once query-level throttling works, then we can look at opportunities to 
optimize at the slice (minor-fragment) level.

> Resource utilization
> --------------------
>
>                 Key: DRILL-5975
>                 URL: https://issues.apache.org/jira/browse/DRILL-5975
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 2.0.0
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>
> h1. Motivation
> Now the resource utilization radio of Drill's cluster is not too good. Most 
> of the cluster resource is wasted. We can not afford too much concurrent 
> queries. Once the system accepted more queries with a not high cpu load, the 
> query which originally is very quick will become slower and slower.
> The reason is Drill does not supply a scheduler . It just assume all the 
> nodes have enough calculation resource. Once a query comes, it will schedule 
> the related fragments to random nodes not caring about the node's load. Some 
> nodes will suffer more cpu context switch to satisfy the coming query. The 
> profound causes to this is that the runtime minor fragments construct a 
> runtime tree whose nodes spread different drillbits. The runtime tree is a 
> memory pipeline that is all the nodes will stay alone the whole lifecycle of 
> a query by sending out data to upper nodes successively, even though some 
> node could run quickly and quit immediately.What's more the runtime tree is 
> constructed before actual running. The schedule target to Drill will become 
> the whole runtime tree nodes.
> h1. Design
> It will be hard to schedule the runtime tree nodes as a whole. So I try to 
> solve this by breaking the runtime cascade nodes. The graph below describes 
> the initial design. 
> !https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!   
>  [graph 
> link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]
> Every Drillbit instance will have a RecordBatchManager which will accept all 
> the RecordBatchs written by the senders of local different MinorFragments. 
> The RecordBatchManager will hold the RecordBatchs in memory firstly then disk 
> storage . Once the first RecordBatch of a MinorFragment sender of one query 
> occurs , it will notice the FragmentScheduler. The FragmentScheduler is 
> instanced by the Foreman.It holds the whole PlanFragment execution graph.It 
> will allocate a new corresponding FragmentExecutor to run the generated 
> RecordBatch. The allocated FragmentExecutor will then notify the 
> corresponding FragmentManager to indicate that I am ready to receive the 
> data. Then the FragmentManger will send out the RecordBatch one by one to the 
> corresponding FragmentExecutor's receiver like what the current Sender does 
> by throttling the data stream.
> What we can gain from this design is :
> a. The computation leaf node does not to wait for the consumer's speed to end 
> its life to release the resource.
> b. The sending data logic will be isolated from the computation nodes and 
> shared by different FragmentManagers.
> c. We can schedule the MajorFragments according to Drillbit's actual resource 
> capacity at runtime.
> d. Drill's pipeline data processing characteristic is also retained.
> h1. Plan
> This will be a large PR ,so I plan to divide it into some small ones.
> a. to implement the RecordManager.
> b. to implement a simple random FragmentScheduler and the whole event flow.
> c. to implement a primitive FragmentScheduler which may reference the Sparrow 
> project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to