Re: [PR] Revert "YARN-11709. NodeManager should be shut down or blacklisted when it ca…" [hadoop]

via GitHub Fri, 06 Sep 2024 00:34:26 -0700


brumi1024 commented on PR #7028:
URL: https://github.com/apache/hadoop/pull/7028#issuecomment-2333429490


   @slfan1989 there is an issue with the current implementation: we catch every 
PrivilegedOperationException - including the ones caused by a user-requested 
application kill - and then proceed to mark the NM unhealthy. This should not 
happen. Actually bit down in this class there is a separate exit code handling 
method for the container launch, which throws a config related exception in 
cases where the error is truly unrecoverable without admin input, I plan to 
reuse here as well. 
   
   But I'll only have time to work on that next week, until then I think this 
state is harmful, as after a few applications kills most of the NMs will be 
marked unhealthy, requiring a restart.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] Revert "YARN-11709. NodeManager should be shut down or blacklisted when it ca…" [hadoop]

Reply via email to