HI all:
Drill's current schedule policy seems a little simple. The SimpleParallelizer assigns endpoints in round robin model which ignores the system's load and other factors. To critical scenario, some drillbits are suffering frequent full GCs which will let their control RPC blocked. Current assignment will not exclude these drillbits from the next coming queries's assignment. then the problem will get worse . I propose to add a zk path to hold bad drillbits. Forman will recognize bad drillbits by waiting timeout (timeout of control response from intermediate fragments), then update the bad drillbits path. Next coming queries will exclude these drillbits from the assignment list. How do you think about it or any suggests ? If sounds ok ,will file a JIRA and give some contributes.