[jira] [Created] (IMPALA-9137) Blacklist node if a DataStreamService RPC to the node fails

2019-11-08 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9137:


 Summary: Blacklist node if a DataStreamService RPC to the node 
fails
 Key: IMPALA-9137
 URL: https://issues.apache.org/jira/browse/IMPALA-9137
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar


If a query fails because a RPC to a specific node failed, the query error 
message will of the form:

{{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv got 
EOF from 10.65.30.141:27000 (error 108)}}

or

{{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv error 
from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}

or

{{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client 
connection negotiation failed: client connection to 10.65.26.254:27000: 
connect: Connection refused (error 111)}}

or

{{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv error 
from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}

RPCs are already retried, so it is likely that something is wrong with the 
target node. Perhaps it crashed or is so overloaded that it can't process RPC 
requests. In any case, the Impala Coordinator should blacklist the target of 
the failed RPC so that future queries don't fail with the same error.

If the node crashed, the statestore will eventually remove the failed node from 
the cluster as well. However, the statestore can take a while to detect a 
failed node because it has a long timeout. The issue is that queries can still 
fail in within the timeout window. 

This is necessary for transparent query retries because if a node does crash, 
it will take too long for the statestore to remove the crashed node from the 
cluster. So any attempt at retrying a query will just fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9137) Blacklist node if a DataStreamService RPC to the node fails

2019-11-08 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9137:


 Summary: Blacklist node if a DataStreamService RPC to the node 
fails
 Key: IMPALA-9137
 URL: https://issues.apache.org/jira/browse/IMPALA-9137
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar


If a query fails because a RPC to a specific node failed, the query error 
message will of the form:

{{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv got 
EOF from 10.65.30.141:27000 (error 108)}}

or

{{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv error 
from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}

or

{{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client 
connection negotiation failed: client connection to 10.65.26.254:27000: 
connect: Connection refused (error 111)}}

or

{{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv error 
from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}

RPCs are already retried, so it is likely that something is wrong with the 
target node. Perhaps it crashed or is so overloaded that it can't process RPC 
requests. In any case, the Impala Coordinator should blacklist the target of 
the failed RPC so that future queries don't fail with the same error.

If the node crashed, the statestore will eventually remove the failed node from 
the cluster as well. However, the statestore can take a while to detect a 
failed node because it has a long timeout. The issue is that queries can still 
fail in within the timeout window. 

This is necessary for transparent query retries because if a node does crash, 
it will take too long for the statestore to remove the crashed node from the 
cluster. So any attempt at retrying a query will just fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)