[ 
https://issues.apache.org/jira/browse/HDFS-13818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582536#comment-16582536
 ] 

Adam Antal commented on HDFS-13818:
-----------------------------------

Thanks for look into this, [~arpitagarwal].

Firstly, to the justification of the OIV method:

I agree that the most easiest way to check whether the NN wouldn't fail if it 
loaded the fsimage is to actually do it, but as opposed to that the OIV is an 
alternative to protect against corruption in an _offline_ way - particularly 
for the reason to do that in a light node regardless of the cluster.

As you wrote, for full checking (in an offline way) one has to replicate the 
same code paths that the NN does during startup. Starting a modified NN process 
or calling some modified functions from that path could require lot of work and 
cause further problems, so I don't see that track justified - in that case the 
best is to put up a new NN. As I see it, the OIV-detectCorruption utility 
should not address full checking, rather a way to look for the known corruption 
cases. I came to this conclusion in HDFS-13031, and I also added some other 
points of its practicability there.

Secondly, by corruption I mainly focused on stack traces like HDFS-9406: when 
fsimage _is being loaded_, and not after it has been successfully done so. 
Given a bad fsimage, there is no other choice of the corruption being detected 
other than to start a NN.

And thirdly, in my opinion you can target any of the following to handle the 
case:
 # The FSImage writer to not produce corrupted image
 # The FSImage reader to detect the corruption during read
 # an independent checker to check a written fsimage at any time

The optimal would be to prevent writing (first) and HDFS-13314 also went for 
the first one, but my solution is just another safety layer following the third 
option.

Although the safest option is the NN-startup, I still believe the OIV worth a 
shot. What is your opinion about this?

> Extend OIV to detect FSImage corruption
> ---------------------------------------
>
>                 Key: HDFS-13818
>                 URL: https://issues.apache.org/jira/browse/HDFS-13818
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>
> A follow-up Jira for HDFS-13031: an improvement of the OIV is suggested for 
> detecting corruptions like HDFS-13101 in an offline way.
> The reasoning is the following. Apart from a NN startup throwing the error, 
> there is nothing in the customer's hand that could reassure him/her that the 
> FSImages is good or corrupted.
> Although real full checking of the FSImage is only possible by the NN, for 
> stack traces associated with the observed corruption cases the solution of 
> putting up a tertiary NN is a little bit of overkill. The OIV would be a 
> handy choice, already having functionality like loading the fsimage and 
> constructing the folder structure, we just have to add the option of 
> detecting the null INodes. For e.g. the Delimited OIV processor can already 
> use in disk MetadataMap, which reduces memory consumption. Also there may be 
> a window for parallelizing: iterating through INodes for e.g. could be done 
> distributed, increasing efficiency, and we wouldn't need a high mem-high CPU 
> setup for just checking the FSImage.
> The suggestion is to add a --detectCorruption option to the OIV which would 
> check the FSImage for consistency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to