[ 
https://issues.apache.org/jira/browse/TEZ-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-965:
-------------------------------
    Target Version/s: 0.7.0

> Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node 
> failures
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-965
>                 URL: https://issues.apache.org/jira/browse/TEZ-965
>             Project: Apache Tez
>          Issue Type: Bug
>         Environment: Flaky DNS cluster
>            Reporter: Gopal V
>
> If DNS resolution fails for a period of 5-10 seconds, Tez restarts & 
> contra-flows in the query triggering recovery of nearly everything it has run.
> Nodes are getting marked as bad because they can't shuffle (dns resolution 
> failed for all NMs), which results in log lines like 
> {code}
> attempt_1394928384313_0234_1_25_000654_0 blamed for read error from 
> attempt_1394928384313_0234_1_24_000366_0 
> {code}
> And the tasks restart from an earlier vertex.
> When a large number of such failures happen, the tasks shouldn't restart 
> previous vertexes, but instead should flip a circuit & back-off till the 
> network blip disappears.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to