[ https://issues.apache.org/jira/browse/HDFS-16484?focusedWorklogId=755602&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755602 ]
ASF GitHub Bot logged work on HDFS-16484: ----------------------------------------- Author: ASF GitHub Bot Created on: 12/Apr/22 05:09 Start Date: 12/Apr/22 05:09 Worklog Time Spent: 10m Work Description: tasanuma commented on code in PR #4032: URL: https://github.com/apache/hadoop/pull/4032#discussion_r847973070 ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/sps/BlockStorageMovementNeeded.java: ########## @@ -248,13 +251,22 @@ public void run() { pendingWorkForDirectory.get(startINode); if (dirPendingWorkInfo != null && dirPendingWorkInfo.isDirWorkDone()) { - ctxt.removeSPSHint(startINode); + try { + ctxt.removeSPSHint(startINode); + } catch (FileNotFoundException e) { + // ignore if the file doesn't already exist + startINode = null; + } pendingWorkForDirectory.remove(startINode); } } startINode = null; // Current inode successfully scanned. } } catch (Throwable t) { + retryCount++; + if (retryCount >= 3) { + startINode = null; + } Review Comment: @liubingxing - Let's define the constant of the max retry count (`private static final int MAX_RETRY_COUNT = 3;`) in `SPSPathIdProcessor`. - How about logging a message when skipping the inode? ```suggestion retryCount++; if (retryCount >= MAX_RETRY_COUNT) { LOG.warn("Skipping this inode {} due to too many retries.", startINode); startINode = null; } ``` - And I think it's better to move the retry logic to the end of the catch block. Issue Time Tracking ------------------- Worklog Id: (was: 755602) Time Spent: 3h 10m (was: 3h) > [SPS]: Fix an infinite loop bug in SPSPathIdProcessor thread > ------------------------------------------------------------- > > Key: HDFS-16484 > URL: https://issues.apache.org/jira/browse/HDFS-16484 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: qinyuren > Assignee: qinyuren > Priority: Major > Labels: pull-request-available > Attachments: image-2022-02-25-14-35-42-255.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently, we ran SPS in our cluster and found this log. The > SPSPathIdProcessor thread enters an infinite loop and prints the same log all > the time. > !image-2022-02-25-14-35-42-255.png|width=682,height=195! > In SPSPathIdProcessor thread, if it get a inodeId which path does not exist, > then the SPSPathIdProcessor thread entry infinite loop and can't work > normally. > The reason is that #ctxt.getNextSPSPath() get a inodeId which path does not > exist. The inodeId will not be set to null, causing the thread hold this > inodeId forever. > {code:java} > public void run() { > LOG.info("Starting SPSPathIdProcessor!."); > Long startINode = null; > while (ctxt.isRunning()) { > try { > if (!ctxt.isInSafeMode()) { > if (startINode == null) { > startINode = ctxt.getNextSPSPath(); > } // else same id will be retried > if (startINode == null) { > // Waiting for SPS path > Thread.sleep(3000); > } else { > ctxt.scanAndCollectFiles(startINode); > // check if directory was empty and no child added to queue > DirPendingWorkInfo dirPendingWorkInfo = > pendingWorkForDirectory.get(startINode); > if (dirPendingWorkInfo != null > && dirPendingWorkInfo.isDirWorkDone()) { > ctxt.removeSPSHint(startINode); > pendingWorkForDirectory.remove(startINode); > } > } > startINode = null; // Current inode successfully scanned. > } > } catch (Throwable t) { > String reClass = t.getClass().getName(); > if (InterruptedException.class.getName().equals(reClass)) { > LOG.info("SPSPathIdProcessor thread is interrupted. Stopping.."); > break; > } > LOG.warn("Exception while scanning file inodes to satisfy the policy", > t); > try { > Thread.sleep(3000); > } catch (InterruptedException e) { > LOG.info("Interrupted while waiting in SPSPathIdProcessor", t); > break; > } > } > } > } {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org