[ https://issues.apache.org/jira/browse/HDFS-13818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582536#comment-16582536 ]
Adam Antal commented on HDFS-13818: ----------------------------------- Thanks for look into this, [~arpitagarwal]. Firstly, to the justification of the OIV method: I agree that the most easiest way to check whether the NN wouldn't fail if it loaded the fsimage is to actually do it, but as opposed to that the OIV is an alternative to protect against corruption in an _offline_ way - particularly for the reason to do that in a light node regardless of the cluster. As you wrote, for full checking (in an offline way) one has to replicate the same code paths that the NN does during startup. Starting a modified NN process or calling some modified functions from that path could require lot of work and cause further problems, so I don't see that track justified - in that case the best is to put up a new NN. As I see it, the OIV-detectCorruption utility should not address full checking, rather a way to look for the known corruption cases. I came to this conclusion in HDFS-13031, and I also added some other points of its practicability there. Secondly, by corruption I mainly focused on stack traces like HDFS-9406: when fsimage _is being loaded_, and not after it has been successfully done so. Given a bad fsimage, there is no other choice of the corruption being detected other than to start a NN. And thirdly, in my opinion you can target any of the following to handle the case: # The FSImage writer to not produce corrupted image # The FSImage reader to detect the corruption during read # an independent checker to check a written fsimage at any time The optimal would be to prevent writing (first) and HDFS-13314 also went for the first one, but my solution is just another safety layer following the third option. Although the safest option is the NN-startup, I still believe the OIV worth a shot. What is your opinion about this? > Extend OIV to detect FSImage corruption > --------------------------------------- > > Key: HDFS-13818 > URL: https://issues.apache.org/jira/browse/HDFS-13818 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Reporter: Adam Antal > Assignee: Adam Antal > Priority: Major > > A follow-up Jira for HDFS-13031: an improvement of the OIV is suggested for > detecting corruptions like HDFS-13101 in an offline way. > The reasoning is the following. Apart from a NN startup throwing the error, > there is nothing in the customer's hand that could reassure him/her that the > FSImages is good or corrupted. > Although real full checking of the FSImage is only possible by the NN, for > stack traces associated with the observed corruption cases the solution of > putting up a tertiary NN is a little bit of overkill. The OIV would be a > handy choice, already having functionality like loading the fsimage and > constructing the folder structure, we just have to add the option of > detecting the null INodes. For e.g. the Delimited OIV processor can already > use in disk MetadataMap, which reduces memory consumption. Also there may be > a window for parallelizing: iterating through INodes for e.g. could be done > distributed, increasing efficiency, and we wouldn't need a high mem-high CPU > setup for just checking the FSImage. > The suggestion is to add a --detectCorruption option to the OIV which would > check the FSImage for consistency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org