[jira] [Comment Edited] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

Karthik Palanisamy (Jira) Wed, 25 Jan 2023 16:36:15 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680844#comment-17680844
 ]


Karthik Palanisamy edited comment on HDFS-15273 at 1/26/23 12:35 AM:
---------------------------------------------------------------------

We may not able to start namenode out of safemode, as it takes 1000x times to 
come out of safe mode.

Centralized cache management is not usable unless this fix, as we are unable to 
bring up namenode. 

Please review this JIRA. 


was (Author: kpalanisamy):
We may not able to start namenode out of safemode, as it takes 1000x times to 
come out of safe mode.

Centralized cache management is not usable unless this fix, as we are unable to 
bring up namenode. 

Please review this JIRA. 

 

 

 

 

 

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-15273
>                 URL: https://issues.apache.org/jira/browse/HDFS-15273
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: caching, namenode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>         Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch, 
> HDFS-15273.003.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
>     scannedDirectives = 0;
>     scannedBlocks = 0;
>     try {
>       namesystem.writeLock();
>       try {
>         lock.lock();
>         if (shutdown) {
>           throw new InterruptedException("CacheReplicationMonitor was " +
>               "shut down.");
>         }
>         curScanCount = completedScanCount + 1;
>       } finally {
>         lock.unlock();
>       }
>       resetStatistics();
>       rescanCacheDirectives();
>       rescanCachedBlockMap();
>       blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
>     } finally {
>       namesystem.writeUnlock();
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

Reply via email to