[jira] [Work logged] (HDFS-16484) [SPS]: Fix an infinite loop bug in SPSPathIdProcessor thread

ASF GitHub Bot (Jira) Mon, 11 Apr 2022 22:10:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16484?focusedWorklogId=755602&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755602
 ]


ASF GitHub Bot logged work on HDFS-16484:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/22 05:09
            Start Date: 12/Apr/22 05:09
    Worklog Time Spent: 10m 
      Work Description: tasanuma commented on code in PR #4032:
URL: https://github.com/apache/hadoop/pull/4032#discussion_r847973070


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/sps/BlockStorageMovementNeeded.java:
##########
@@ -248,13 +251,22 @@ public void run() {
                   pendingWorkForDirectory.get(startINode);
               if (dirPendingWorkInfo != null
                   && dirPendingWorkInfo.isDirWorkDone()) {
-                ctxt.removeSPSHint(startINode);
+                try {
+                  ctxt.removeSPSHint(startINode);
+                } catch (FileNotFoundException e) {
+                  // ignore if the file doesn't already exist
+                  startINode = null;
+                }
                 pendingWorkForDirectory.remove(startINode);
               }
             }
             startINode = null; // Current inode successfully scanned.
           }
         } catch (Throwable t) {
+          retryCount++;
+          if (retryCount >= 3) {
+            startINode = null;
+          }

Review Comment:
   @liubingxing 
   - Let's define the constant of the max retry count (`private static final 
int MAX_RETRY_COUNT = 3;`) in `SPSPathIdProcessor`.
   - How about logging a message when skipping the inode?
   ```suggestion
             retryCount++;
             if (retryCount >= MAX_RETRY_COUNT) {
               LOG.warn("Skipping this inode {} due to too many retries.", 
startINode);
               startINode = null;
             }
   ```
   
   - And I think it's better to move the retry logic to the end of the catch 
block.
   





Issue Time Tracking
-------------------

    Worklog Id:     (was: 755602)
    Time Spent: 3h 10m  (was: 3h)

> [SPS]: Fix an infinite loop bug in SPSPathIdProcessor thread 
> -------------------------------------------------------------
>
>                 Key: HDFS-16484
>                 URL: https://issues.apache.org/jira/browse/HDFS-16484
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: qinyuren
>            Assignee: qinyuren
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-02-25-14-35-42-255.png
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently, we ran SPS in our cluster and found this log. The 
> SPSPathIdProcessor thread enters an infinite loop and prints the same log all 
> the time.
> !image-2022-02-25-14-35-42-255.png|width=682,height=195!
> In SPSPathIdProcessor thread, if it get a inodeId which path does not exist, 
> then the SPSPathIdProcessor thread entry infinite loop and can't work 
> normally. 
> The reason is that #ctxt.getNextSPSPath() get a inodeId which path does not 
> exist. The inodeId will not be set to null, causing the thread hold this 
> inodeId forever.
> {code:java}
> public void run() {
>   LOG.info("Starting SPSPathIdProcessor!.");
>   Long startINode = null;
>   while (ctxt.isRunning()) {
>     try {
>       if (!ctxt.isInSafeMode()) {
>         if (startINode == null) {
>           startINode = ctxt.getNextSPSPath();
>         } // else same id will be retried
>         if (startINode == null) {
>           // Waiting for SPS path
>           Thread.sleep(3000);
>         } else {
>           ctxt.scanAndCollectFiles(startINode);
>           // check if directory was empty and no child added to queue
>           DirPendingWorkInfo dirPendingWorkInfo =
>               pendingWorkForDirectory.get(startINode);
>           if (dirPendingWorkInfo != null
>               && dirPendingWorkInfo.isDirWorkDone()) {
>             ctxt.removeSPSHint(startINode);
>             pendingWorkForDirectory.remove(startINode);
>           }
>         }
>         startINode = null; // Current inode successfully scanned.
>       }
>     } catch (Throwable t) {
>       String reClass = t.getClass().getName();
>       if (InterruptedException.class.getName().equals(reClass)) {
>         LOG.info("SPSPathIdProcessor thread is interrupted. Stopping..");
>         break;
>       }
>       LOG.warn("Exception while scanning file inodes to satisfy the policy",
>           t);
>       try {
>         Thread.sleep(3000);
>       } catch (InterruptedException e) {
>         LOG.info("Interrupted while waiting in SPSPathIdProcessor", t);
>         break;
>       }
>     }
>   }
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16484) [SPS]: Fix an infinite loop bug in SPSPathIdProcessor thread

Reply via email to