Hey all, Paul and I have been in discussion on the draining of drill bits
(quiescent mode)

https://issues.apache.org/jira/browse/DRILL-4286


I'd really like draw folks attention to this issue. As an administrator,
having this ability is key to high availability while using proper
patching/handling of failure.  Currently, for Drill as a stand alone
cluster, or in the R&D work Paul has been doing on Yarn, and I have been
doing on Mesos, it is difficult to handle situations where we need to
reboot nodes, move drillbits, shutdown drillbits while not affecting
running queries.  This can be very difficult on a large cluster...

Consider a 100 node, 24x7 actively used Drill cluster.  Queries could be
running at any time, and work could be running on any of those 100 nodes.
To take even one drillbit out of operation could end up with failed
queries. (Per Paul's research, the current SIGTERM method will drain a node
of running queries, but not prevent more queries from being scheduled).

This is a nightmare for an administration... we need to patch, but we don't
want to fail user queries.  The discussion around this topic, and interest
in its resolution is an important feature to enterprise adoption and
acceptance of Drill as a first class citizen of a well run production data
center.

I look forward to discussion on the topic, check out the JIRA for some of
the initial thoughts from Paul and I .

John

Reply via email to