[
https://issues.apache.org/jira/browse/DRILL-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143227#comment-16143227
]
Paul Rogers commented on DRILL-4286:
------------------------------------
There may be just a bit of confusion about the purpose of this feature. Drill
already provide the means to take down a Drillbit quickly. Just kill the
process. (Drills {{drillbit.sh}} script sends a {{SIGTERM}}, waits a while,
then sends a {{SIGKILL}}.) So, if fast exit is the goal, we already have that.
The problem, of course, is that such a fast exit causes all in-flight queries
to die. Why? Drill is fully symmetric: all queries launch fragments on all
nodes. Drill is an "in-memory" DAG model meaning data flows directly from one
fragment (node) to another with no persistence between fragments (stages.) As a
result, Drill cannot restart a failed fragment: there is no way to identify
which data has to be discarded and reread. The only choice is to restart the
entire query.
Drill is designed to assume that end users can retry (short) queries when nodes
fail. Not elegant, but not entirely crazy. (I'm sure the end user does not
consider this an acceptable solution, however.)
When running longer queries, taking down a node causes all progress to be lost.
Say a query has run for an hour. Taking a node offline loses that work.
The graceful shutdown feature avoids the above problems. The "victim" drillbit
stays up as long as needed to complete in-flight queries. Now, in the worst
case, the victim might never shut down because new queries keep arriving. To
avoid that, the change causes all Forman nodes to stop sending fragments to the
quiescent node. So, eventually the "victim" node drains and shuts down. All
with no disruption to the end users running queries on Drill.
Now, if you get tired of waiting for a long-running query to complete, then you
can still kill the "victim" drillbit, which will kill the remaining, undrained
queries.
In short, the graceful shutdown is a pretty good compromise to assist both
users and admins given the way Drill works today. We can certainly imagine ways
to improve Drill (such as finding a way to restart individual fragments, or
automatic retry of failed queries), but that requires much more work and is
saved for a later effort.
All that said, within the confines of this change, all improvement suggestions
are welcome. In particular, we don't run a production Drill shop, so we'd love
to hear from those users that do: how might this feature be improved to work
better in a production environment?
> Have an ability to put server in quiescent mode of operation
> ------------------------------------------------------------
>
> Key: DRILL-4286
> URL: https://issues.apache.org/jira/browse/DRILL-4286
> Project: Apache Drill
> Issue Type: New Feature
> Components: Execution - Flow
> Reporter: Victoria Markman
> Assignee: Venkata Jyothsna Donapati
>
> I think drill will benefit from mode of operation that is called "quiescent"
> in some databases.
> From IBM Informix server documentation:
> {code}
> Change gracefully from online to quiescent mode
> Take the database server gracefully from online mode to quiescent mode to
> restrict access to the database server without interrupting current
> processing. After you perform this task, the database server sets a flag that
> prevents new sessions from gaining access to the database server. The current
> sessions are allowed to finish processing. After you initiate the mode
> change, it cannot be canceled. During the mode change from online to
> quiescent, the database server is considered to be in Shutdown mode.
> {code}
> This is different from shutdown, when processes are terminated.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)