[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Mingliang Liu (JIRA) Tue, 19 Jul 2016 17:30:43 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385120#comment-15385120
 ]


Mingliang Liu commented on HDFS-10301:
--------------------------------------

Thanks for the patch, [~redvine]. I'm catching up all the insightful 
discussions here and learned a lot.

1. {{FSImage#isUpgradeFinalized}} is not volatile and 
{{nn.getFSImage().isUpgradeFinalized()}} is not holding the read lock in 
{{NameNodeRpcServer#blockReport()}}. Is this a problem? This is not very 
related to this issue though.

2. {code:title=TestNameNodePrunesMissingStorages.java}
   for (Future<IOException> future: futureList) {
     try {
       future.get();
     } catch (Exception e) {
       LOG.error("Processing block report failed due to {}", e);
     }
   }
{code}
I think we need to interpret the return value of the future.get()?
If you’re gonna process exceptions thrown by the task, I think we don’t need to 
return it explicitly as {{Callable.call()}} is permitted to throw checked 
exceptions which get propagated back to the calling thread (wrapped as 
{{ExecutionException}} IIRC).

3. {code:title=TestNameNodePrunesMissingStorages.java}
      DatanodeStorageInfo[] newStorageInfos = dnDescriptor.getStorageInfos();
      Assert.assertEquals(storageInfos.length, newStorageInfos.length);
      for (int i = 0; i < storageInfos.length; i++) {
        Assert.assertTrue(storageInfos[i] == newStorageInfos[i]);
      }
{code}
do you mean 
{code}
Assert.assertArrayEquals(storageInfos, dnDescriptor.getStorageInfos());
{code}

h6. Minor comments:
# We should add javadoc for {{STORAGE_REPORT}} as it’s not that straightforward 
defined in {{BlockListAsLongs}} abstract class.
# {{assert (blockList.getNumberOfBlocks() == -1);}} I believe we don’t need to 
use assert statement along with {{Assert.asserEquals()}}?
# Always use slf4j placeholder in the code as you are doing int he latest 
patch. Specifically 
{code:title=BlockManager.java}
        LOG.debug("Processing RPC with index " + context.getCurRpc()
                    + " out of total " + context.getTotalRpcs() + " RPCs in "
                    + "processReport 0x" +
                    Long.toHexString(context.getReportId()));
{code}
We MUST use placeholder here to avoid string construction if the log level is 
INFO and above.
More examples are:{{LOG.info("Block pool id: " + blockPoolId);}} can be 
simplified as {{LOG.info("Block pool id: {}“, blockPoolId);}}
And for exceptions we don’t need placeholder if it’s the last parameter. So 
{{LOG.error("Processing block report failed due to {}", e);}} can be 
{{LOG.error("Processing block report failed due to ", e);}}
# I see unnecessary blank lines in the v11 patch.
# I see not addressed long line checkstyle warnings in {{BlockManager}}
# {code}
if (nn.getFSImage().isUpgradeFinalized()
  Set<String> storageIDsInBlockReport = new HashSet<>();
   if (context.getTotalRpcs() == context.getCurRpc() + 1) {
{code}
can be 
{code} if (nn.getFSImage().isUpgradeFinalized() &&
    context.getTotalRpcs() == context.getCurRpc() + 1) {
  Set<String> storageIDsInBlockReport = new HashSet<>();
{code}
# {code:title=BPServiceActor.java}
DatanodeCommand cmd;
if () {
  cmd = …
else {
  cmd = …
}
{code}
Let’s make {{cmd}} final.


> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Reply via email to