zhengchenyu commented on issue #1011: URL: https://github.com/apache/incubator-uniffle/issues/1011#issuecomment-1637522258
> @zhengchenyu First, we can let job fail. Then, we can optimize the logic like #477 For 2.4, I prefer to let job fail firstly. My initial thought is enable job recovery, then recover the completed task, recompute the uncompleted task. And https://github.com/apache/incubator-uniffle/issues/477 is also a goo way. After this issue, I will reconsider the design. For 2.1 and 2.2, I think we can set tez.am.maxtaskfailures.per.node to Int.max. For 2.3, I think we can set tez.am.node-unhealthy-reschedule-tasks to false. The work is mainly about the test, the test should simulate all condition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
