[ 
https://issues.apache.org/jira/browse/HDFS-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hairong Kuang updated HDFS-1767:
--------------------------------

    Description: 
Consider a large cluster that takes 40 minutes to start up.  The datanodes 
compete to register and send their Initial Block Reports (IBRs) as fast as they 
can after startup (subject to a small sub-two-minute random delay, which isn't 
relevant to this discussion).  

As each datanode succeeds in sending its IBR, it schedules the starting time 
for its regular cycle of reports, every hour (or other configured value of 
dfs.blockreport.intervalMsec). In order to spread the reports evenly across the 
block report interval, each datanode picks a random fraction of that interval, 
for the starting point of its regular report cycle.  For example, if a 
particular datanode ends up randomly selecting 18 minutes after the hour, then 
that datanode will send a Block Report at 18 minutes after the hour every hour 
as long as it remains up.  Other datanodes will start their cycles at other 
randomly selected times.  This code is in DataNode.blockReport() and 
DataNode.scheduleBlockReport().

The "second Block Report" (2BR), is the start of these hourly reports.  The 
problem is that some of these 2BRs get scheduled sooner rather than later, and 
actually occur within the startup period.  For example, if the cluster takes 40 
minutes (2/3 of an hour) to start up, then out of the datanodes that succeed in 
sending their IBRs during the first 10 minutes, between 1/2 and 2/3 of them 
will send their 2BR before the 40-minute startup time has completed!

2BRs sent within the startup time actually compete with the remaining IBRs, and 
thereby slow down the overall startup process.  This can be seen in the 
following data, which shows the startup process for a 3700-node cluster that 
took about 17 minutes to finish startup:

{noformat}
      time    starts  sum   regs  sum   IBR  sum  2nd_BR sum total_BRs/min
0   1299799498  3042  3042  1969  1969  151   151          0  151
1   1299799558   665  3707  1470  3439  248   399          0  248
2   1299799618        3707   224  3663  270   669          0  270
3   1299799678        3707    14  3677  261   930    3     3  264
4   1299799738        3707    23  3700  288  1218    1     4  289
5   1299799798        3707     7  3707  258  1476    3     7  261
6   1299799858        3707        3707  317  1793    4    11  321
7   1299799918        3707        3707  292  2085    6    17  298
8   1299799978        3707        3707  292  2377    8    25  300
9   1299800038        3707        3707  272  2649         25  272
10  1299800098        3707        3707  280  2929   15    40  295
11  1299800158        3707        3707  223  3152   14    54  237
12  1299800218        3707        3707  143  3295         54  143
13  1299800278        3707        3707  141  3436   20    74  161
14  1299800338        3707        3707  195  3631   78   152  273
15  1299800398        3707        3707   51  3682  209   361  260
16  1299800458        3707        3707   25  3707  369   730  394
17  1299800518        3707        3707       3707  166   896  166
18  1299800578        3707        3707       3707   72   968   72
19  1299800638        3707        3707       3707   67  1035   67
20  1299800698        3707        3707       3707   75  1110   75
21  1299800758        3707        3707       3707   71  1181   71
22  1299800818        3707        3707       3707   67  1248   67
23  1299800878        3707        3707       3707   62  1310   62
24  1299800938        3707        3707       3707   56  1366   56
25  1299800998        3707        3707       3707   60  1426   60
{noformat}

This data was harvested from the startup logs of all the datanodes, and 
correlated into one-minute buckets.  Each row of the table represents the 
progress during one elapsed minute of clock time.  It seems that every cluster 
startup is different, but this one showed the effect fairly well.

The "starts" column shows that all the nodes started up within the first 2 
minutes, and the "regs" column shows that all succeeded in registering by 
minute 6.  The IBR column shows a sustained rate of Initial Block Report 
processing of 250-300/minute for the first 10 minutes.

The question is why, during minutes 11 through 16, the rate of IBR processing 
slowed down.  Why didn't the startup just finish?  In the "2nd_BR" column, we 
see the rate of 2BRs ramping up as more datanodes complete their IBRs.  As the 
rate increases, they become more effective at competing with the IBRs, and slow 
down the IBR processing even more.  After the IBRs finally finish in minute 16, 
the rate of 2BRs settles down to a steady ~60-70/minute.

In order to decrease competition for locks and other resources, to speed up IBR 
processing during startup, we propose to delay 2BRs until later into the cycle.

  was:
Consider a large cluster that takes 40 minutes to start up.  The datanodes 
compete to register and send their Initial Block Reports (IBRs) as fast as they 
can after startup (subject to a small sub-two-minute random delay, which isn't 
relevant to this discussion).  

As each datanode succeeds in sending its IBR, it schedules the starting time 
for its regular cycle of reports, every hour (or other configured value of 
dfs.blockreport.intervalMsec). In order to spread the reports evenly across the 
block report interval, each datanode picks a random fraction of that interval, 
for the starting point of its regular report cycle.  For example, if a 
particular datanode ends up randomly selecting 18 minutes after the hour, then 
that datanode will send a Block Report at 18 minutes after the hour every hour 
as long as it remains up.  Other datanodes will start their cycles at other 
randomly selected times.  This code is in DataNode.blockReport() and 
DataNode.scheduleBlockReport().

The "second Block Report" (2BR), is the start of these hourly reports.  The 
problem is that some of these 2BRs get scheduled sooner rather than later, and 
actually occur within the startup period.  For example, if the cluster takes 40 
minutes (2/3 of an hour) to start up, then out of the datanodes that succeed in 
sending their IBRs during the first 10 minutes, between 1/2 and 2/3 of them 
will send their 2BR before the 40-minute startup time has completed!

2BRs sent within the startup time actually compete with the remaining IBRs, and 
thereby slow down the overall startup process.  This can be seen in the 
following data, which shows the startup process for a 3700-node cluster that 
took about 17 minutes to finish startup:

<code>
      time    starts  sum   regs  sum   IBR  sum  2nd_BR sum total_BRs/min
0   1299799498  3042  3042  1969  1969  151   151          0  151
1   1299799558   665  3707  1470  3439  248   399          0  248
2   1299799618        3707   224  3663  270   669          0  270
3   1299799678        3707    14  3677  261   930    3     3  264
4   1299799738        3707    23  3700  288  1218    1     4  289
5   1299799798        3707     7  3707  258  1476    3     7  261
6   1299799858        3707        3707  317  1793    4    11  321
7   1299799918        3707        3707  292  2085    6    17  298
8   1299799978        3707        3707  292  2377    8    25  300
9   1299800038        3707        3707  272  2649         25  272
10  1299800098        3707        3707  280  2929   15    40  295
11  1299800158        3707        3707  223  3152   14    54  237
12  1299800218        3707        3707  143  3295         54  143
13  1299800278        3707        3707  141  3436   20    74  161
14  1299800338        3707        3707  195  3631   78   152  273
15  1299800398        3707        3707   51  3682  209   361  260
16  1299800458        3707        3707   25  3707  369   730  394
17  1299800518        3707        3707       3707  166   896  166
18  1299800578        3707        3707       3707   72   968   72
19  1299800638        3707        3707       3707   67  1035   67
20  1299800698        3707        3707       3707   75  1110   75
21  1299800758        3707        3707       3707   71  1181   71
22  1299800818        3707        3707       3707   67  1248   67
23  1299800878        3707        3707       3707   62  1310   62
24  1299800938        3707        3707       3707   56  1366   56
25  1299800998        3707        3707       3707   60  1426   60
</code>

This data was harvested from the startup logs of all the datanodes, and 
correlated into one-minute buckets.  Each row of the table represents the 
progress during one elapsed minute of clock time.  It seems that every cluster 
startup is different, but this one showed the effect fairly well.

The "starts" column shows that all the nodes started up within the first 2 
minutes, and the "regs" column shows that all succeeded in registering by 
minute 6.  The IBR column shows a sustained rate of Initial Block Report 
processing of 250-300/minute for the first 10 minutes.

The question is why, during minutes 11 through 16, the rate of IBR processing 
slowed down.  Why didn't the startup just finish?  In the "2nd_BR" column, we 
see the rate of 2BRs ramping up as more datanodes complete their IBRs.  As the 
rate increases, they become more effective at competing with the IBRs, and slow 
down the IBR processing even more.  After the IBRs finally finish in minute 16, 
the rate of 2BRs settles down to a steady ~60-70/minute.

In order to decrease competition for locks and other resources, to speed up IBR 
processing during startup, we propose to delay 2BRs until later into the cycle.


> Delay second Block Reports until after cluster finishes startup, to improve 
> startup times
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-1767
>                 URL: https://issues.apache.org/jira/browse/HDFS-1767
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>             Fix For: 0.23.0
>
>         Attachments: DelaySecondBR_v1.patch, table.csv, table_tab.csv
>
>
> Consider a large cluster that takes 40 minutes to start up.  The datanodes 
> compete to register and send their Initial Block Reports (IBRs) as fast as 
> they can after startup (subject to a small sub-two-minute random delay, which 
> isn't relevant to this discussion).  
> As each datanode succeeds in sending its IBR, it schedules the starting time 
> for its regular cycle of reports, every hour (or other configured value of 
> dfs.blockreport.intervalMsec). In order to spread the reports evenly across 
> the block report interval, each datanode picks a random fraction of that 
> interval, for the starting point of its regular report cycle.  For example, 
> if a particular datanode ends up randomly selecting 18 minutes after the 
> hour, then that datanode will send a Block Report at 18 minutes after the 
> hour every hour as long as it remains up.  Other datanodes will start their 
> cycles at other randomly selected times.  This code is in 
> DataNode.blockReport() and DataNode.scheduleBlockReport().
> The "second Block Report" (2BR), is the start of these hourly reports.  The 
> problem is that some of these 2BRs get scheduled sooner rather than later, 
> and actually occur within the startup period.  For example, if the cluster 
> takes 40 minutes (2/3 of an hour) to start up, then out of the datanodes that 
> succeed in sending their IBRs during the first 10 minutes, between 1/2 and 
> 2/3 of them will send their 2BR before the 40-minute startup time has 
> completed!
> 2BRs sent within the startup time actually compete with the remaining IBRs, 
> and thereby slow down the overall startup process.  This can be seen in the 
> following data, which shows the startup process for a 3700-node cluster that 
> took about 17 minutes to finish startup:
> {noformat}
>       time    starts  sum   regs  sum   IBR  sum  2nd_BR sum total_BRs/min
> 0   1299799498  3042  3042  1969  1969  151   151          0  151
> 1   1299799558   665  3707  1470  3439  248   399          0  248
> 2   1299799618        3707   224  3663  270   669          0  270
> 3   1299799678        3707    14  3677  261   930    3     3  264
> 4   1299799738        3707    23  3700  288  1218    1     4  289
> 5   1299799798        3707     7  3707  258  1476    3     7  261
> 6   1299799858        3707        3707  317  1793    4    11  321
> 7   1299799918        3707        3707  292  2085    6    17  298
> 8   1299799978        3707        3707  292  2377    8    25  300
> 9   1299800038        3707        3707  272  2649         25  272
> 10  1299800098        3707        3707  280  2929   15    40  295
> 11  1299800158        3707        3707  223  3152   14    54  237
> 12  1299800218        3707        3707  143  3295         54  143
> 13  1299800278        3707        3707  141  3436   20    74  161
> 14  1299800338        3707        3707  195  3631   78   152  273
> 15  1299800398        3707        3707   51  3682  209   361  260
> 16  1299800458        3707        3707   25  3707  369   730  394
> 17  1299800518        3707        3707       3707  166   896  166
> 18  1299800578        3707        3707       3707   72   968   72
> 19  1299800638        3707        3707       3707   67  1035   67
> 20  1299800698        3707        3707       3707   75  1110   75
> 21  1299800758        3707        3707       3707   71  1181   71
> 22  1299800818        3707        3707       3707   67  1248   67
> 23  1299800878        3707        3707       3707   62  1310   62
> 24  1299800938        3707        3707       3707   56  1366   56
> 25  1299800998        3707        3707       3707   60  1426   60
> {noformat}
> This data was harvested from the startup logs of all the datanodes, and 
> correlated into one-minute buckets.  Each row of the table represents the 
> progress during one elapsed minute of clock time.  It seems that every 
> cluster startup is different, but this one showed the effect fairly well.
> The "starts" column shows that all the nodes started up within the first 2 
> minutes, and the "regs" column shows that all succeeded in registering by 
> minute 6.  The IBR column shows a sustained rate of Initial Block Report 
> processing of 250-300/minute for the first 10 minutes.
> The question is why, during minutes 11 through 16, the rate of IBR processing 
> slowed down.  Why didn't the startup just finish?  In the "2nd_BR" column, we 
> see the rate of 2BRs ramping up as more datanodes complete their IBRs.  As 
> the rate increases, they become more effective at competing with the IBRs, 
> and slow down the IBR processing even more.  After the IBRs finally finish in 
> minute 16, the rate of 2BRs settles down to a steady ~60-70/minute.
> In order to decrease competition for locks and other resources, to speed up 
> IBR processing during startup, we propose to delay 2BRs until later into the 
> cycle.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to