[ https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Ho reassigned IMPALA-8339: ---------------------------------- Assignee: Thomas Tauber-Marshall > Coordinator should be more resilient to fragment instances startup failure > -------------------------------------------------------------------------- > > Key: IMPALA-8339 > URL: https://issues.apache.org/jira/browse/IMPALA-8339 > Project: IMPALA > Issue Type: Improvement > Components: Distributed Exec > Reporter: Michael Ho > Assignee: Thomas Tauber-Marshall > Priority: Major > Labels: Availability, resilience > > Impala currently relies on statestore for cluster membership. When an Impala > executor goes offline, it may take a while for statestore to declare that > node as unavailable and for that information to be propagated to all > coordinator nodes. Within this window, some coordinator nodes may still > attempt to issue RPCs to the faulty node, resulting in RPC failures which > resulted in query failures. In other words, many queries may fail to start > within this window until all coordinator nodes get the latest information on > cluster membership. > Going forward, coordinator may need to fall back to using backup executors > for each fragments in case some of the executors are not available. Moreover, > *coordinator should treat the cluster membership information from statestore > (or any external source of truth e.g. etcd) as hints instead of ground truth* > and adjust the scheduling of fragment instances based on the availability of > the executors from the coordinator's perspective. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org