[ https://issues.apache.org/jira/browse/YARN-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karthik Kambatla updated YARN-3797: ----------------------------------- Component/s: nodemanager > NodeManager not blacklisting the disk (shuffle) with errors > ----------------------------------------------------------- > > Key: YARN-3797 > URL: https://issues.apache.org/jira/browse/YARN-3797 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Rajesh Balamohan > > In a multi-node environment, one of the disk (where map outputs are written) > in a node went bad. Errors are given below. > {noformat} > Info fld=0x9ad090a > sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error > sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 ad 09 08 00 00 08 00 > end_request: critical medium error, dev sdf, sector 162334984 > mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) > sd 6:0:5:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > sd 6:0:5:0: [sdf] Sense Key : Medium Error [current] > Info fld=0x9af8892 > sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error > sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00 > end_request: critical medium error, dev sdf, sector 162498704 > mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) > mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) > sd 6:0:5:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > sd 6:0:5:0: [sdf] Sense Key : Medium Error [current] > Info fld=0x9af8892 > sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error > sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00 > end_request: critical medium error, dev sdf, sector 162498704 > {noformat} > Diskchecker would pass as the system allows to create directories and delete > directories without issues. But data being served out can be corrupt and > fetchers fail during CRC verification with unwanted delays and retries. > Ideally node manager should detect such errors and blacklist/remove those > disks from NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)