[ 
https://issues.apache.org/jira/browse/HDFS-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364299#comment-17364299
 ] 

Kihwal Lee commented on HDFS-15963:
-----------------------------------

I've looked at heap dumps and confirm the analysis by [~zhangshuyan]. 
One failed volume's reference was closed (2^30), but the count never went down 
to 0. As long as this volume is in the head of {{volumesBeingRemoved}}, 
additional volume failures could not be handled, as the handler threads are all 
stuck looping forever for this volume to clear.

> Unreleased volume references cause an infinite loop
> ---------------------------------------------------
>
>                 Key: HDFS-15963
>                 URL: https://issues.apache.org/jira/browse/HDFS-15963
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Shuyan Zhang
>            Assignee: Shuyan Zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.3.1, 3.4.0, 2.10.2
>
>         Attachments: HDFS-15963.001.patch, HDFS-15963.002.patch, 
> HDFS-15963.003.patch
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> When BlockSender throws an exception because the meta-data cannot be found, 
> the volume reference obtained by the thread is not released, which causes the 
> thread trying to remove the volume to wait and fall into an infinite loop.
> {code:java}
> boolean checkVolumesRemoved() {
>   Iterator<FsVolumeImpl> it = volumesBeingRemoved.iterator();
>   while (it.hasNext()) {
>     FsVolumeImpl volume = it.next();
>     if (!volume.checkClosed()) {
>       return false;
>     }
>     it.remove();
>   }
>   return true;
> }
> boolean checkClosed() {
>   // always be true.
>   if (this.reference.getReferenceCount() > 0) {
>     FsDatasetImpl.LOG.debug("The reference count for {} is {}, wait to be 0.",
>         this, reference.getReferenceCount());
>     return false;
>   }
>   return true;
> }
> {code}
> At the same time, because the thread has been holding checkDirsLock when 
> removing the volume, other threads trying to acquire the same lock will be 
> permanently blocked.
> Similar problems also occur in RamDiskAsyncLazyPersistService and 
> FsDatasetAsyncDiskService.
> This patch releases the three previously unreleased volume references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to