[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

Xiaoqiao He (Jira) Tue, 05 May 2020 00:33:35 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099603#comment-17099603
 ]


Xiaoqiao He commented on HDFS-15273:
------------------------------------

Hi [~weichiu], there are a different set of test data, which try to cache 
different scale directives and blocks. Scan cost shows during time of 
CacheReplicationMonitor holds the global write lock. During that time, NN would 
be out of service since it can not acquire lock anymore.
|directives|blocks|scan cost(ms)|
|500|10000|16370|
|500|20000|67513|
|500|30000|111834|
|500|40000|160500|
|1000|10000|16943|
|1000|20000|35461|
|1000|30000|84431|
|1000|40000|152480|
If we scale up the test set, scan cost will increases significantly.
IMO, if there are many directives and blocks need to scan, we should release 
lock and sleep for short time then try to acquire it again if hold lock time 
above threshold.

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-15273
>                 URL: https://issues.apache.org/jira/browse/HDFS-15273
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: caching, namenode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
>     scannedDirectives = 0;
>     scannedBlocks = 0;
>     try {
>       namesystem.writeLock();
>       try {
>         lock.lock();
>         if (shutdown) {
>           throw new InterruptedException("CacheReplicationMonitor was " +
>               "shut down.");
>         }
>         curScanCount = completedScanCount + 1;
>       } finally {
>         lock.unlock();
>       }
>       resetStatistics();
>       rescanCacheDirectives();
>       rescanCachedBlockMap();
>       blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
>     } finally {
>       namesystem.writeUnlock();
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

Reply via email to