[ 
https://issues.apache.org/jira/browse/DRILL-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258824#comment-16258824
 ] 

weijie.tong edited comment on DRILL-5975 at 11/20/17 5:58 AM:
--------------------------------------------------------------

Thanks for your detail explanation of current model and the reason caused cpu 
overloaded. That matches what I thought. Drill's assumption about resource is 
not real at actual scenario as you pointed out. At Alibaba, we have other teams 
using Flink and Spark, both of the two systems can reach a cpu utilization to 
70%. To Drill, the best score we have achieved is 40%. So far as I know, this 
is also the best result of Mapr .  So Drill really have some works to do about 
the schedule. Here I have some different opinions about some points:

* What's the entity to schedule ?  The entity which consumes the memory and cpu 
is the target. To Drill ,it's the MinorFragment. When the scheduler schedules 
the entity ,it can determine which nodes have enough resource to match the 
requirement to assign work to them. This can also answer the question do we 
need the query-level schedule. Query-level schedule is too coarse to take this 
role. I have investigated what Impala does. I am not optimistic to see better 
improvements. MinorFragment level schedule will let the system works as what 
cpu pipeline acts.

* The RecordBatch will be first inserted into memory , if the memory is not 
enough , it will be flushed into the disk. This the common solution to Flink 
and Spark. The allocated next 'stage' workers will consume the RecordBatch 
soon, maybe just from the memory destination. What's more , the data will also 
work as stream model. The RecordBatchManager will take the Sender's role to 
contact with the receiver and send out the RecordBatch consecutively. By 
leveraging the DataTunnel's sendRecordBatch method, the throttling 
characteristic is also retained. To be clear, this only happens at the exchange 
stage. I don't agree we are reinventing Hive, this model is also what Flink and 
Spark follows.

   The current zk-queue based query level throttling is still needed to  
protect the system from overloaded queries.Some of what you summary will help 
us to do a good scheduler to estimate a more accurate resource requirement to 
allocate.




was (Author: weijie):
Thanks for your detail explanation of current model and the reason caused cpu 
overloaded. That matches what I thought. Drill's assumption about resource is 
not real at actual scenario as you pointed out. At Alibaba, we have other teams 
using Flink and Spark, both of the two systems can reach a cpu utilization to 
70%. To Drill, the best score we have achieved is 40%. So far as I know, this 
is also the best result of Mapr .  So Drill really have some works to do about 
the schedule. Here I have some different opinions about some points:

* What's the entity to schedule ?  The entity which consumes the memory and cpu 
is the target. To Drill ,it's the MinorFragment. When the scheduler schedules 
the entity ,it can determine which nodes have enough resource to match the 
requirement to assign work to them. This can 
also answer the question do we need the query-level schedule. Query-level 
schedule is too coarse to take this role. I have investigated what Impala does. 
I am not optimistic to see better improvements. MinorFragment level schedule 
will let the system works as what cpu pipeline acts.

* The RecordBatch will be first inserted into memory , if the memory is not 
enough , it will be flushed into the disk. This the common solution to Flink 
and Spark. The allocated next 'stage' workers will consume the RecordBatch 
soon, maybe just from the memory destination. What's more , the data will also 
work as stream model. The RecordBatchManager will take the Sender's role to 
contact with the receiver and send out the RecordBatch consecutively. By 
leveraging the DataTunnel's sendRecordBatch method, the throttling 
characteristic is also retained. To be clear, this only happens at the exchange 
stage. I don't agree we are reinventing Hive, this model is also what Flink and 
Spark follows.

   The current zk-queue based query level throttling is still needed to  
protect the system from overloaded queries.Some of what you summary will help 
us to do a good scheduler to estimate a more accurate resource requirement to 
allocate.



> Resource utilization
> --------------------
>
>                 Key: DRILL-5975
>                 URL: https://issues.apache.org/jira/browse/DRILL-5975
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 2.0.0
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>
> h1. Motivation
> Now the resource utilization radio of Drill's cluster is not too good. Most 
> of the cluster resource is wasted. We can not afford too much concurrent 
> queries. Once the system accepted more queries with a not high cpu load, the 
> query which originally is very quick will become slower and slower.
> The reason is Drill does not supply a scheduler . It just assume all the 
> nodes have enough calculation resource. Once a query comes, it will schedule 
> the related fragments to random nodes not caring about the node's load. Some 
> nodes will suffer more cpu context switch to satisfy the coming query. The 
> profound causes to this is that the runtime minor fragments construct a 
> runtime tree whose nodes spread different drillbits. The runtime tree is a 
> memory pipeline that is all the nodes will stay alone the whole lifecycle of 
> a query by sending out data to upper nodes successively, even though some 
> node could run quickly and quit immediately.What's more the runtime tree is 
> constructed before actual running. The schedule target to Drill will become 
> the whole runtime tree nodes.
> h1. Design
> It will be hard to schedule the runtime tree nodes as a whole. So I try to 
> solve this by breaking the runtime cascade nodes. The graph below describes 
> the initial design. 
> !https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!   
>  [graph 
> link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]
> Every Drillbit instance will have a RecordBatchManager which will accept all 
> the RecordBatchs written by the senders of local different MinorFragments. 
> The RecordBatchManager will hold the RecordBatchs in memory firstly then disk 
> storage . Once the first RecordBatch of a MinorFragment sender of one query 
> occurs , it will notice the FragmentScheduler. The FragmentScheduler is 
> instanced by the Foreman.It holds the whole PlanFragment execution graph.It 
> will allocate a new corresponding FragmentExecutor to run the generated 
> RecordBatch. The allocated FragmentExecutor will then notify the 
> corresponding FragmentManager to indicate that I am ready to receive the 
> data. Then the FragmentManger will send out the RecordBatch one by one to the 
> corresponding FragmentExecutor's receiver like what the current Sender does 
> by throttling the data stream.
> What we can gain from this design is :
> a. The computation leaf node does not to wait for the consumer's speed to end 
> its life to release the resource.
> b. The sending data logic will be isolated from the computation nodes and 
> shared by different FragmentManagers.
> c. We can schedule the MajorFragments according to Drillbit's actual resource 
> capacity at runtime.
> d. Drill's pipeline data processing characteristic is also retained.
> h1. Plan
> This will be a large PR ,so I plan to divide it into some small ones.
> a. to implement the RecordManager.
> b. to implement a simple random FragmentScheduler and the whole event flow.
> c. to implement a primitive FragmentScheduler which may reference the Sparrow 
> project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to