[ 
https://issues.apache.org/jira/browse/HDFS-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012742#comment-13012742
 ] 

Hairong Kuang commented on HDFS-1767:
-------------------------------------

Matt, the committed patch cuts the non-initial block report processing time. 
But it does not remove the cost of sending/receiving the non-initial blcok 
reports as what you did in the first patch. Will you have a follow-up jira for 
removing it?

> Namenode should ignore non-initial block reports from datanodes when in 
> safemode during startup
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1767
>                 URL: https://issues.apache.org/jira/browse/HDFS-1767
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>    Affects Versions: 0.20.204, 0.22.0, 0.23.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>             Fix For: 0.23.0
>
>         Attachments: DelaySecondBR_v1.patch, DelaySecondBR_v2.patch, 
> DelaySecondBR_v3-0.20-security.patch, DelaySecondBR_v3.patch, table.csv, 
> table_tab.csv
>
>
> Consider a large cluster that takes 40 minutes to start up.  The datanodes 
> compete to register and send their Initial Block Reports (IBRs) as fast as 
> they can after startup (subject to a small sub-two-minute random delay, which 
> isn't relevant to this discussion).  
> As each datanode succeeds in sending its IBR, it schedules the starting time 
> for its regular cycle of reports, every hour (or other configured value of 
> dfs.blockreport.intervalMsec). In order to spread the reports evenly across 
> the block report interval, each datanode picks a random fraction of that 
> interval, for the starting point of its regular report cycle.  For example, 
> if a particular datanode ends up randomly selecting 18 minutes after the 
> hour, then that datanode will send a Block Report at 18 minutes after the 
> hour every hour as long as it remains up.  Other datanodes will start their 
> cycles at other randomly selected times.  This code is in 
> DataNode.blockReport() and DataNode.scheduleBlockReport().
> The "second Block Report" (2BR), is the start of these hourly reports.  The 
> problem is that some of these 2BRs get scheduled sooner rather than later, 
> and actually occur within the startup period.  For example, if the cluster 
> takes 40 minutes (2/3 of an hour) to start up, then out of the datanodes that 
> succeed in sending their IBRs during the first 10 minutes, between 1/2 and 
> 2/3 of them will send their 2BR before the 40-minute startup time has 
> completed!
> 2BRs sent within the startup time actually compete with the remaining IBRs, 
> and thereby slow down the overall startup process.  This can be seen in the 
> following data, which shows the startup process for a 3700-node cluster that 
> took about 17 minutes to finish startup:
> {noformat}
>       time    starts  sum   regs  sum   IBR  sum  2nd_BR sum total_BRs/min
> 0   1299799498  3042  3042  1969  1969  151   151          0  151
> 1   1299799558   665  3707  1470  3439  248   399          0  248
> 2   1299799618        3707   224  3663  270   669          0  270
> 3   1299799678        3707    14  3677  261   930    3     3  264
> 4   1299799738        3707    23  3700  288  1218    1     4  289
> 5   1299799798        3707     7  3707  258  1476    3     7  261
> 6   1299799858        3707        3707  317  1793    4    11  321
> 7   1299799918        3707        3707  292  2085    6    17  298
> 8   1299799978        3707        3707  292  2377    8    25  300
> 9   1299800038        3707        3707  272  2649         25  272
> 10  1299800098        3707        3707  280  2929   15    40  295
> 11  1299800158        3707        3707  223  3152   14    54  237
> 12  1299800218        3707        3707  143  3295         54  143
> 13  1299800278        3707        3707  141  3436   20    74  161
> 14  1299800338        3707        3707  195  3631   78   152  273
> 15  1299800398        3707        3707   51  3682  209   361  260
> 16  1299800458        3707        3707   25  3707  369   730  394
> 17  1299800518        3707        3707       3707  166   896  166
> 18  1299800578        3707        3707       3707   72   968   72
> 19  1299800638        3707        3707       3707   67  1035   67
> 20  1299800698        3707        3707       3707   75  1110   75
> 21  1299800758        3707        3707       3707   71  1181   71
> 22  1299800818        3707        3707       3707   67  1248   67
> 23  1299800878        3707        3707       3707   62  1310   62
> 24  1299800938        3707        3707       3707   56  1366   56
> 25  1299800998        3707        3707       3707   60  1426   60
> {noformat}
> This data was harvested from the startup logs of all the datanodes, and 
> correlated into one-minute buckets.  Each row of the table represents the 
> progress during one elapsed minute of clock time.  It seems that every 
> cluster startup is different, but this one showed the effect fairly well.
> The "starts" column shows that all the nodes started up within the first 2 
> minutes, and the "regs" column shows that all succeeded in registering by 
> minute 6.  The IBR column shows a sustained rate of Initial Block Report 
> processing of 250-300/minute for the first 10 minutes.
> The question is why, during minutes 11 through 16, the rate of IBR processing 
> slowed down.  Why didn't the startup just finish?  In the "2nd_BR" column, we 
> see the rate of 2BRs ramping up as more datanodes complete their IBRs.  As 
> the rate increases, they become more effective at competing with the IBRs, 
> and slow down the IBR processing even more.  After the IBRs finally finish in 
> minute 16, the rate of 2BRs settles down to a steady ~60-70/minute.
> In order to decrease competition for locks and other resources, to speed up 
> IBR processing during startup, we propose to delay 2BRs until later into the 
> cycle.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to