Ahmed Hussein created TEZ-4349: ---------------------------------- Summary: DAGClient gets stuck with invalid cached DAGStatus Key: TEZ-4349 URL: https://issues.apache.org/jira/browse/TEZ-4349 Project: Apache Tez Issue Type: Bug Reporter: Ahmed Hussein Assignee: Ahmed Hussein
I found that some Oozie launchers get stuck waiting for the job to complete. After investigation I found that {{dagClient.getDAGStatus(null)}} calls the override {{dagClient.getDAGStatus(null, 0)}} , which then calls {{getDAGStatusInternal}} making use of the cachedDagStatus field. The cachedDagStatus is never updated causing the launcher to wait indefinitely. [https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java#L212] {code:java} if (!dagCompleted) { if (dagStatus != null) { cachedDagStatus = dagStatus; return dagStatus; } if (cachedDagStatus != null) { // could not get from AM (not reachable/ was killed). return cached status. return cachedDagStatus; } } {code} +To Fix:+ The {{cachedDagStatus}} should be valid for a certain amount of time, or certain number of retires. When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the RM. An error in fetching the status from both AM and RM, would return null to the caller. -- This message was sent by Atlassian Jira (v8.20.1#820001)