[ 
https://issues.apache.org/jira/browse/HDFS-13818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591765#comment-16591765
 ] 

Adam Antal commented on HDFS-13818:
-----------------------------------

Thanks for the review, [~gabor.bota]. I’ll incorporate the changes for the next 
patch.

As for the broader questions: The PBImageTextWriter’s assumption is that the 
fsimage is not corrupted. When a parent INode is not found it is considered to 
be a Reference (so it originates from INodeReferenceSection) - but in a 
corrupted case, non-reference INode’s parent can be missing as well, and it is 
mistakenly counted among the snapshots. I tried to overcome this by outputting 
the missing INodes in the afterOutput() after the original output() function, 
but some details (like the parentPath) is not written out, although the data is 
available. It needs further work, probably the cases where 
IgnoreSnapshotException is thrown must be split to distinguish real snapshots 
from corruptions. After this change we may have a clearer look on the 
functionality, and I can start working on the memory footprint and other 
questions. 

Thanks for the offline discussion, [~zvenczel] regarding the tests. As I see 
it, for functional testing, it is sufficient to amend the existing ones, but 
getting more into detail it is reasonable to have tests for 
PBImageDelimitedTextWriter, the existing Delimited processor as well as they 
extend the same core. This may require extra work, and another jira issue could 
address it. I add some unit test anyways, but for sake of completeness I wonder 
if we should do this or not.

I’ll start working on the doc as well, after missing points has been cleared 
out. I also uploaded some sort of documentation to the existing functionality - 
may extend it when uploading newer patches.

> Extend OIV to detect FSImage corruption
> ---------------------------------------
>
>                 Key: HDFS-13818
>                 URL: https://issues.apache.org/jira/browse/HDFS-13818
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: HDFS-13818.001.patch
>
>
> A follow-up Jira for HDFS-13031: an improvement of the OIV is suggested for 
> detecting corruptions like HDFS-13101 in an offline way.
> The reasoning is the following. Apart from a NN startup throwing the error, 
> there is nothing in the customer's hand that could reassure him/her that the 
> FSImages is good or corrupted.
> Although real full checking of the FSImage is only possible by the NN, for 
> stack traces associated with the observed corruption cases the solution of 
> putting up a tertiary NN is a little bit of overkill. The OIV would be a 
> handy choice, already having functionality like loading the fsimage and 
> constructing the folder structure, we just have to add the option of 
> detecting the null INodes. For e.g. the Delimited OIV processor can already 
> use in disk MetadataMap, which reduces memory consumption. Also there may be 
> a window for parallelizing: iterating through INodes for e.g. could be done 
> distributed, increasing efficiency, and we wouldn't need a high mem-high CPU 
> setup for just checking the FSImage.
> The suggestion is to add a --detectCorruption option to the OIV which would 
> check the FSImage for consistency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to