If control RPC is down to a drillbit i.e if a drillbit is not responding, zookeeper should detect that and notify other drillbits to remove the dead drillbit from their active list. Once that happens, the next query that comes in should not even see that drillbit. We need a way to differentiate drillbits based on their resource (i.e. CPU and memory) availability/usage and consider that information for planning. We do not have that capability. We treat all of them the same.
Thanks, Padma On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie...@gmail.com<mailto:tongweijie...@gmail.com>> wrote: HI all: Drill's current schedule policy seems a little simple. The SimpleParallelizer assigns endpoints in round robin model which ignores the system's load and other factors. To critical scenario, some drillbits are suffering frequent full GCs which will let their control RPC blocked. Current assignment will not exclude these drillbits from the next coming queries's assignment. then the problem will get worse . I propose to add a zk path to hold bad drillbits. Forman will recognize bad drillbits by waiting timeout (timeout of control response from intermediate fragments), then update the bad drillbits path. Next coming queries will exclude these drillbits from the assignment list. How do you think about it or any suggests ? If sounds ok ,will file a JIRA and give some contributes.