[ https://issues.apache.org/jira/browse/TEZ-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated TEZ-965: ------------------------------- Target Version/s: 0.7.0 > Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node > failures > ------------------------------------------------------------------------------------ > > Key: TEZ-965 > URL: https://issues.apache.org/jira/browse/TEZ-965 > Project: Apache Tez > Issue Type: Bug > Environment: Flaky DNS cluster > Reporter: Gopal V > > If DNS resolution fails for a period of 5-10 seconds, Tez restarts & > contra-flows in the query triggering recovery of nearly everything it has run. > Nodes are getting marked as bad because they can't shuffle (dns resolution > failed for all NMs), which results in log lines like > {code} > attempt_1394928384313_0234_1_25_000654_0 blamed for read error from > attempt_1394928384313_0234_1_24_000366_0 > {code} > And the tasks restart from an earlier vertex. > When a large number of such failures happen, the tasks shouldn't restart > previous vertexes, but instead should flip a circuit & back-off till the > network blip disappears. -- This message was sent by Atlassian JIRA (v6.3.4#6332)