Hey all, Paul and I have been in discussion on the draining of drill bits (quiescent mode)
https://issues.apache.org/jira/browse/DRILL-4286 I'd really like draw folks attention to this issue. As an administrator, having this ability is key to high availability while using proper patching/handling of failure. Currently, for Drill as a stand alone cluster, or in the R&D work Paul has been doing on Yarn, and I have been doing on Mesos, it is difficult to handle situations where we need to reboot nodes, move drillbits, shutdown drillbits while not affecting running queries. This can be very difficult on a large cluster... Consider a 100 node, 24x7 actively used Drill cluster. Queries could be running at any time, and work could be running on any of those 100 nodes. To take even one drillbit out of operation could end up with failed queries. (Per Paul's research, the current SIGTERM method will drain a node of running queries, but not prevent more queries from being scheduled). This is a nightmare for an administration... we need to patch, but we don't want to fail user queries. The discussion around this topic, and interest in its resolution is an important feature to enterprise adoption and acceptance of Drill as a first class citizen of a well run production data center. I look forward to discussion on the topic, check out the JIRA for some of the initial thoughts from Paul and I . John