amahussein commented on a change in pull request #161:
URL: https://github.com/apache/tez/pull/161#discussion_r749556005
##########
File path:
tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java
##########
@@ -221,13 +245,14 @@ protected DAGStatus getDAGStatusInternal(@Nullable
Set<StatusGetOpts> statusOpti
final DAGStatus dagStatus = getDAGStatusViaAM(statusOptions, timeout);
if (!dagCompleted) {
- if (dagStatus != null) {
- cachedDagStatus = dagStatus;
+ if (dagStatus != null) { // update the cached DAGStatus
+ cachedDAGStatusRef.setValue(dagStatus);
return dagStatus;
}
- if (cachedDagStatus != null) {
+ DAGStatus cachedDAG = cachedDAGStatusRef.getValue();
+ if (cachedDAG != null) {
// could not get from AM (not reachable/ was killed). return cached
status.
- return cachedDagStatus;
+ return cachedDAG;
Review comment:
Thanks for the feedback.
Yes, considering the implementation of
[TezJob.run()-Line222](https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezJob.java#L222)
in Pig:
Pig polls on the DAGStatus inside an infinite loop:
```java
while (true) {
try {
dagStatus = dagClient.getDAGStatus(null);
} catch (Exception e) {
log.info("Cannot retrieve DAG status", e);
break;
}
if (dagStatus.isCompleted()) {
// do something
// break;
}
sleep(1000);
}
```
Let's assume the following scenario on Tez Side:
- Pig first iteration calls `getDAGStatusViaAM()` which successfully pulls
the DAGStatus and updates the cachedDAGStatus to running.
- Pig sleeps 1000
- second call from Pig calls `getDAGStatusViaAM()` which encounters
`TezException` or `IOException`. The call would return the last cachedDAGStatus
(which is running), instead of null.
- Since the status is `running`, the Pig-thread sleeps
- This will keep going as long as the `getDAGStatusViaAM()` fails, and the
last valid DAGStatus is still cached.
The problem in this corner case is that the Pig client will keep looping
indefinitely as long as it does not receive a null or
dagClient.getDAGStatus(null) does not throw an exception.
From a client perspective, it is better to fail early in order to recover
faster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]