Sorabh Hamirwasia created DRILL-5721:
----------------------------------------
Summary: Query with only root fragment and no non-root fragment
hangs when Drillbit to Drillbit Control Connection has network issues
Key: DRILL-5721
URL: https://issues.apache.org/jira/browse/DRILL-5721
Project: Apache Drill
Issue Type: Bug
Reporter: Sorabh Hamirwasia
Recently I found an issue (Thanks to [~knguyen] to create this scenario)
related to Fragment Status reporting and would like some feedback on it.
When a client submits a query to Foreman, then it is planned by Foreman and
later fragments are scheduled to root and non-root nodes. Foreman creates a
DriilbitStatusListener and FragmentStatusListener to know about the health of
Drillbit node and a fragment respectively. The way root and non-root fragments
are setup by Foreman are different:
Root fragments are setup without any communication over control channel (since
it is executed locally on Foreman)
Non-root fragments are setup by sending control message
(REQ_INITIALIZE_FRAGMENTS_VALUE) over wire. If there is failure in sending any
such control message (like due to network hiccup's) during query setup then the
query is failed and client is notified.
Each fragment is executed on it's node with the help Fragment Executor which
has an instance for FragmentStatusReporter. FragmentStatusReporter helps to
update the status of a fragment to Foreman node over a control tunnel or
connection using RPC message (REQ_FRAGMENT_STATUS) both for root and non-root
fragments.
Based on above when root fragment is submitted for setup then it is done
locally without any RPC communication whereas when status for that fragment is
reported by fragment executor that happens over control connection by sending a
RPC message. But for non-root fragment setup and status update both happens
using RPC message over control connection.
*Issue 1:*
What was observed is if for a simple query which has only 1 root fragment
running on Foreman node then setup will work fine. But as part of status update
when the fragment tries to create a control connection and fails to establish
that, then the query hangs. This is because the root fragment will complete
execution but will fail to update Foreman about it and Foreman think that the
query is running for ever.
*Proposed Solution:*
For root fragment the setup of fragment is happening locally without RPC
message, so we can do the same for status update of root fragments. This will
avoid RPC communication for status update of fragments running locally on the
foreman and hence will resolve issue 1.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)