abstractdog commented on a change in pull request #161: URL: https://github.com/apache/tez/pull/161#discussion_r748857804
########## File path: tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java ########## @@ -221,13 +245,14 @@ protected DAGStatus getDAGStatusInternal(@Nullable Set<StatusGetOpts> statusOpti final DAGStatus dagStatus = getDAGStatusViaAM(statusOptions, timeout); if (!dagCompleted) { - if (dagStatus != null) { - cachedDagStatus = dagStatus; + if (dagStatus != null) { // update the cached DAGStatus + cachedDAGStatusRef.setValue(dagStatus); return dagStatus; } - if (cachedDagStatus != null) { + DAGStatus cachedDAG = cachedDAGStatusRef.getValue(); + if (cachedDAG != null) { // could not get from AM (not reachable/ was killed). return cached status. - return cachedDagStatus; + return cachedDAG; Review comment: am I right to assume this is the codepath where the original issue happened? could you please clarify how can we indefinitely stuck here? I mean, we can only hit this part if getDAGStatusViaAM returns null but dagCompleted is not true, so when we hit this again in getDAGStatusViaAM and again: ``` } catch (TezException e) { // can be either due to a n/w issue of due to AM completed. } catch (IOException e) { // can be either due to a n/w issue of due to AM completed. } ``` also getApplicationReportInternal keeps returning null in checkAndSetDagCompletionStatus was it the case for you? if so, does it make sense to put at least debug level log messages to the silent catch branches? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@tez.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org