[ https://issues.apache.org/jira/browse/HDFS-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012738#comment-13012738 ]
Matt Foley commented on HDFS-1767: ---------------------------------- The findbugs warning is spurious - it is in FSDataset, which was not changed by this patch. Regarding the core and contrib test failures, the same comments apply as above (see https://issues.apache.org/jira/browse/HDFS-1767?focusedCommentId=13012616&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13012616) > Namenode should ignore non-initial block reports from datanodes when in > safemode during startup > ----------------------------------------------------------------------------------------------- > > Key: HDFS-1767 > URL: https://issues.apache.org/jira/browse/HDFS-1767 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node > Affects Versions: 0.20.204, 0.22.0, 0.23.0 > Reporter: Matt Foley > Assignee: Matt Foley > Fix For: 0.23.0 > > Attachments: DelaySecondBR_v1.patch, DelaySecondBR_v2.patch, > DelaySecondBR_v3-0.20-security.patch, DelaySecondBR_v3.patch, table.csv, > table_tab.csv > > > Consider a large cluster that takes 40 minutes to start up. The datanodes > compete to register and send their Initial Block Reports (IBRs) as fast as > they can after startup (subject to a small sub-two-minute random delay, which > isn't relevant to this discussion). > As each datanode succeeds in sending its IBR, it schedules the starting time > for its regular cycle of reports, every hour (or other configured value of > dfs.blockreport.intervalMsec). In order to spread the reports evenly across > the block report interval, each datanode picks a random fraction of that > interval, for the starting point of its regular report cycle. For example, > if a particular datanode ends up randomly selecting 18 minutes after the > hour, then that datanode will send a Block Report at 18 minutes after the > hour every hour as long as it remains up. Other datanodes will start their > cycles at other randomly selected times. This code is in > DataNode.blockReport() and DataNode.scheduleBlockReport(). > The "second Block Report" (2BR), is the start of these hourly reports. The > problem is that some of these 2BRs get scheduled sooner rather than later, > and actually occur within the startup period. For example, if the cluster > takes 40 minutes (2/3 of an hour) to start up, then out of the datanodes that > succeed in sending their IBRs during the first 10 minutes, between 1/2 and > 2/3 of them will send their 2BR before the 40-minute startup time has > completed! > 2BRs sent within the startup time actually compete with the remaining IBRs, > and thereby slow down the overall startup process. This can be seen in the > following data, which shows the startup process for a 3700-node cluster that > took about 17 minutes to finish startup: > {noformat} > time starts sum regs sum IBR sum 2nd_BR sum total_BRs/min > 0 1299799498 3042 3042 1969 1969 151 151 0 151 > 1 1299799558 665 3707 1470 3439 248 399 0 248 > 2 1299799618 3707 224 3663 270 669 0 270 > 3 1299799678 3707 14 3677 261 930 3 3 264 > 4 1299799738 3707 23 3700 288 1218 1 4 289 > 5 1299799798 3707 7 3707 258 1476 3 7 261 > 6 1299799858 3707 3707 317 1793 4 11 321 > 7 1299799918 3707 3707 292 2085 6 17 298 > 8 1299799978 3707 3707 292 2377 8 25 300 > 9 1299800038 3707 3707 272 2649 25 272 > 10 1299800098 3707 3707 280 2929 15 40 295 > 11 1299800158 3707 3707 223 3152 14 54 237 > 12 1299800218 3707 3707 143 3295 54 143 > 13 1299800278 3707 3707 141 3436 20 74 161 > 14 1299800338 3707 3707 195 3631 78 152 273 > 15 1299800398 3707 3707 51 3682 209 361 260 > 16 1299800458 3707 3707 25 3707 369 730 394 > 17 1299800518 3707 3707 3707 166 896 166 > 18 1299800578 3707 3707 3707 72 968 72 > 19 1299800638 3707 3707 3707 67 1035 67 > 20 1299800698 3707 3707 3707 75 1110 75 > 21 1299800758 3707 3707 3707 71 1181 71 > 22 1299800818 3707 3707 3707 67 1248 67 > 23 1299800878 3707 3707 3707 62 1310 62 > 24 1299800938 3707 3707 3707 56 1366 56 > 25 1299800998 3707 3707 3707 60 1426 60 > {noformat} > This data was harvested from the startup logs of all the datanodes, and > correlated into one-minute buckets. Each row of the table represents the > progress during one elapsed minute of clock time. It seems that every > cluster startup is different, but this one showed the effect fairly well. > The "starts" column shows that all the nodes started up within the first 2 > minutes, and the "regs" column shows that all succeeded in registering by > minute 6. The IBR column shows a sustained rate of Initial Block Report > processing of 250-300/minute for the first 10 minutes. > The question is why, during minutes 11 through 16, the rate of IBR processing > slowed down. Why didn't the startup just finish? In the "2nd_BR" column, we > see the rate of 2BRs ramping up as more datanodes complete their IBRs. As > the rate increases, they become more effective at competing with the IBRs, > and slow down the IBR processing even more. After the IBRs finally finish in > minute 16, the rate of 2BRs settles down to a steady ~60-70/minute. > In order to decrease competition for locks and other resources, to speed up > IBR processing during startup, we propose to delay 2BRs until later into the > cycle. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira