[ https://issues.apache.org/jira/browse/HADOOP-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539179 ]
Sameer Paranjpye commented on HADOOP-2012: ------------------------------------------ Why not have a scan period only? The scan period defines a window in which every block that exists at the beginning of the window will be examined (barring blocks that are deleted). A Datanode would construct a schedule for examining blocks in a scan period with least recently examined blocks going first. New blocks would be scheduled in the next window. The schedule could be constructed by dividing a window into _scanperiod/n_ intervals, one interval per block. A Datanode would make a determination of how much bandwidth it needs to scan a block based on when the next block is scheduled. This would guarantee that every block that exists at the beginning of a scan period is examined once in the scan period. It would also guarantee an upper bound of 2*scan period between 2 scans of a given block. This is also an upper bound on the amount of time that elapses before a new block is scanned. In both cases, the time elapsed will, in the average case, be close to scan period and approach 2*scan period if a large number of blocks are added in a window. These seem like reasonable guarantees. It would make sense to have a reasonable upper bound on the amount of bandwidth used for scanning and emit a warning if this is not enough to examine all blocks in a scan period. So if someone set a scan period of 1 minute or something else silly the Datanode doesn't spend all its time scanning. > Periodic verification at the Datanode > ------------------------------------- > > Key: HADOOP-2012 > URL: https://issues.apache.org/jira/browse/HADOOP-2012 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Fix For: 0.16.0 > > Attachments: HADOOP-2012.patch, HADOOP-2012.patch, HADOOP-2012.patch, > HADOOP-2012.patch > > > Currently on-disk data corruption on data blocks is detected only when it is > read by the client or by another datanode. These errors are detected much > earlier if datanode can periodically verify the data checksums for the local > blocks. > Some of the issues to consider : > - How should we check the blocks ( no more often than once every couple of > weeks ?) > - How do we keep track of when a block was last verfied ( there is a .meta > file associcated with each lock ). > - What action to take once a corruption is detected > - Scanning should be done as a very low priority with rest of the datanode > disk traffic in mind. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.