[ https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang updated HDFS-12703: ----------------------------------- Fix Version/s: 3.1.3 > Exceptions are fatal to decommissioning monitor > ----------------------------------------------- > > Key: HDFS-12703 > URL: https://issues.apache.org/jira/browse/HDFS-12703 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.7.0 > Reporter: Daryn Sharp > Assignee: He Xiaoqiao > Priority: Critical > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-12703.001.patch, HDFS-12703.002.patch, > HDFS-12703.003.patch, HDFS-12703.004.patch, HDFS-12703.005.patch, > HDFS-12703.006.patch, HDFS-12703.007.patch, HDFS-12703.008.patch, > HDFS-12703.009.patch, HDFS-12703.010.patch, HDFS-12703.011.patch, > HDFS-12703.012.patch, HDFS-12703.013.patch > > > The {{DecommissionManager.Monitor}} runs as an executor scheduled task. If > an exception occurs, all decommissioning ceases until the NN is restarted. > Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the > task encounters an exception, subsequent executions are suppressed*. The > monitor thread is alive but blocked waiting for an executor task that will > never come. The code currently disposes of the future so the actual > exception that aborted the task is gone. > Failover is insufficient since the task is also likely dead on the standby. > Replication queue init after the transition to active will fix the under > replication of blocks on currently decommissioning nodes but future nodes > never decommission. The standby must be bounced prior to failover – and > hopefully the error condition does not reoccur. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org