[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Vinitha Reddy Gankidi (JIRA) Wed, 20 Jul 2016 17:55:08 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386922#comment-15386922
 ]


Vinitha Reddy Gankidi commented on HDFS-10301:
----------------------------------------------

Thanks for the review [~liuml07]. I have attached a new patch (012) that 
addresses your comments.

> FSImage#isUpgradeFinalized is not volatile and 
> nn.getFSImage().isUpgradeFinalized() is not holding the read lock in 
> NameNodeRpcServer#blockReport(). Is this a problem? This is not very related 
> to this issue though.

My patch does not make any changes to the isUpgradeFinalized method. If this is 
a problem, we should open another JIRA to address it.

> If you’re gonna process exceptions thrown by the task, I think we don’t need 
> to return it explicitly as Callable.call()is permitted to throw checked 
> exceptions

Thanks for the good suggestion! I have modified the Callable.call() to return a 
DataNodeCommand and throw IOException. I don't explicitly catch the exception 
since junit will take care of it.

> I think we need to interpret the return value of the future.get()?

 future.get() returns DataNodeCommand which we don’t take care about and don’t 
need to interpret. 

> do you mean Assert.assertArrayEquals(storageInfos, 
> dnDescriptor.getStorageInfos());

Yes, thanks for that! I have made the change.

> We should add javadoc for STORAGE_REPORT as it’s not that straightforward 
> defined in BlockListAsLongsabstract class.

Added the doc

> assert (blockList.getNumberOfBlocks() == -1); I believe we don’t need to use 
> assert statement along with Assert.asserEquals()?

I changed the assert to Assert.assertEquals. However, the existing test does 
use assert as well {{assert(numBlocksReported >= expectedTotalBlockCount);}}

> Always use slf4j placeholder in the code as you are doing int he latest 
> patch. 

Thanks for the tip! I noticed that placeholders were not used consistently. I 
tried to maintain the logging style that was already used in that particular 
file. I have modified all the log messages in my patch to use placeholders 
wherever possible. Sl4j was not used in some places,  for instance in 
TestNameNodePrunesMissingStorages.

> I see unnecessary blank lines in the v11 patch.I see not addressed long line 
> checkstyle warnings in BlockManager

I noticed two blank lines in TestNameNodePrunesMissingStorages inv11 patch. I 
removed that. I do not see any checkstyle warnings.

> if (nn.getFSImage().isUpgradeFinalized() &&
    context.getTotalRpcs() == context.getCurRpc() + 1) {
  Set<String> storageIDsInBlockReport = new HashSet<>();

Combined as suggested.

> BPServiceActor.java Let’s make cmd final.

Since cmd was not final previously, I have left it unchanged. 



> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Reply via email to