zhengchenyu commented on issue #1011:
URL: 
https://github.com/apache/incubator-uniffle/issues/1011#issuecomment-1637522258

   > @zhengchenyu First, we can let job fail. Then, we can optimize the logic 
like #477
   
   For 2.4, I  prefer to let job fail firstly. My initial thought is enable job 
recovery, then recover the completed task, recompute the uncompleted task. And 
https://github.com/apache/incubator-uniffle/issues/477 is also a goo way. After 
this issue, I will reconsider the design.
   
   For 2.1 and 2.2, I think we can set tez.am.maxtaskfailures.per.node to 
Int.max.
   
   For 2.3, I think we can set tez.am.node-unhealthy-reschedule-tasks to false.
   
   The work is mainly about the test, the test should simulate all condition. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to