[ https://issues.apache.org/jira/browse/IMPALA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970433#comment-16970433 ]
Sahil Takiar commented on IMPALA-9137: -------------------------------------- This improves resilience to network partitions between Impala executors as well. > Blacklist node if a DataStreamService RPC to the node fails > ----------------------------------------------------------- > > Key: IMPALA-9137 > URL: https://issues.apache.org/jira/browse/IMPALA-9137 > Project: IMPALA > Issue Type: Sub-task > Components: Backend > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Priority: Major > > If a query fails because a RPC to a specific node failed, the query error > message will similar to one of the following: > * {{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv > got EOF from 10.65.30.141:27000 (error 108)}} > * {{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv > error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}} > * {{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client > connection negotiation failed: client connection to 10.65.26.254:27000: > connect: Connection refused (error 111)}} > * {{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv > error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}} > RPCs are already retried, so it is likely that something is wrong with the > target node. Perhaps it crashed or is so overloaded that it can't process RPC > requests. In any case, the Impala Coordinator should blacklist the target of > the failed RPC so that future queries don't fail with the same error. > If the node crashed, the statestore will eventually remove the failed node > from the cluster as well. However, the statestore can take a while to detect > a failed node because it has a long timeout. The issue is that queries can > still fail in within the timeout window. > This is necessary for transparent query retries because if a node does crash, > it will take too long for the statestore to remove the crashed node from the > cluster. So any attempt at retrying a query will just fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org