[ 
https://issues.apache.org/jira/browse/IMPALA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970433#comment-16970433
 ] 

Sahil Takiar commented on IMPALA-9137:
--------------------------------------

This improves resilience to network partitions between Impala executors as well.

> Blacklist node if a DataStreamService RPC to the node fails
> -----------------------------------------------------------
>
>                 Key: IMPALA-9137
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9137
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> If a query fails because a RPC to a specific node failed, the query error 
> message will similar to one of the following:
> * {{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv 
> got EOF from 10.65.30.141:27000 (error 108)}}
> * {{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv 
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> * {{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client 
> connection negotiation failed: client connection to 10.65.26.254:27000: 
> connect: Connection refused (error 111)}}
> * {{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv 
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> RPCs are already retried, so it is likely that something is wrong with the 
> target node. Perhaps it crashed or is so overloaded that it can't process RPC 
> requests. In any case, the Impala Coordinator should blacklist the target of 
> the failed RPC so that future queries don't fail with the same error.
> If the node crashed, the statestore will eventually remove the failed node 
> from the cluster as well. However, the statestore can take a while to detect 
> a failed node because it has a long timeout. The issue is that queries can 
> still fail in within the timeout window. 
> This is necessary for transparent query retries because if a node does crash, 
> it will take too long for the statestore to remove the crashed node from the 
> cluster. So any attempt at retrying a query will just fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to