[ 
https://issues.apache.org/jira/browse/HDFS-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855096#action_12855096
 ] 

Todd Lipcon commented on HDFS-1073:
-----------------------------------

bq. The serial numbering of the files solution requires that checkpoints occur 
only at a edits split boundaries.

Yes, but since we can split edits at will, I don't think there's any problem 
just having the backupnode asking the active NN to roll whenver the BN would 
like to do a checkpoint. The nice thing about this is that an image file from 
the BN can be lined up exactly with the corresponding edit logs from the NN, 
etc.

bq. The transaction ID one does not have that restriction but it does require 
that in order to detect a gap in edits one has to look inside the logs. The 
txId one can avoid that if we are prepared to rename the edits log when you 
split (roll) it (Ugh!)

Agreed re ugh! The renaming is the complexity we're trying to avoid, no?

bq. The txId numbering scheme also has the advantage that multiple backups can 
roll and do checkpoints independently (we DONOT want to do that as it will 
confuse the operators – but it shows that the design is very robust.

I still think this is possible with sequential numbering. And I agree that not 
confusing operators is a key design goal for this JIRA. The whole image/edit 
log thing in normal operation should be an implementation detail, and when 
operators have to look at it they're usually very stressed out because a 
cluster is corrupt - so we want to make it _very_ clear what's going on, and 
_very_ hard to create any state that is unrecoverable.

--

I've started working on this patch and it's coming along nicely. The NN and 
secondary NN are working great, and just started on the BN/Checkpointer. Here's 
a brief overview of the design I'm going with - hopefully I will answer the 
above questions along the way.


h2. Storage contents

The NN storage directories continue to be organized in the same way - either 
edits, images, or both. The difference is that each edits or fsimage file now 
has a suffix indicating its "roll index". For example, a newly formatted NN has 
the following contents:

- fsimage_0 - empty image
- edits_0_inprogress - the edit log currently being appended

When edits are rolled, the current 'edits_N_inprogress' file is "finalized" by 
renaming to simply edits_N. So, if we roll the edits of the above image, we end 
up with:

- fsimage_0 - same empty image
- edits_0 - any edits made before the roll
- edits_1_inprogress

When an image is saved or uploaded via a checkpoint, the validity rule is as 
follows: any fsimage with roll index N must incorporate all edits from logs 
with a roll index less than N. So, if we enter safe mode and call saveNamespace 
on the above example, we end up with:

- fsimage_0 - original empty imagge
- edits_0 - edits before first roll
- edits_1 - edits before saveNamespace
- fsimage_2 - all edits from edits_0 and edits_1
- edits_2_inprogress - the edit log where new edits will be appended

h2. Log Rolling Triggers

The following events can trigger a log roll:
- NN startup (see below)
- saveNamespace
- a secondary or backup node wants to begin a checkpoint
- an IOException has occurred on one of the current edit logs
- potentially we may find it useful to expose this as an admin function? (eg 
mysql offers a flush logs; command)

h2. Log rolling behavior:

- The current edits_N_inprogress log is closed
- The current edits_N_inprogress log is renamed to edits_N in all valid edits 
directories.
- Any edits directories that previously had problems will be left with 
edits_N_inprogress (since we don't know whether all of the edits made it into 
that log before the roll, in fact they probably did not)
- The next edits_N+1_inprogress is opened in all directories, including an 
attempt to reopen any failed directories.

h2. Startup behavior

First we initiate log recovery:

- Across all edits directories, look for any edits_N_inprogress:
-- If one is found, look for a finalized edits_N file in any other log directory
--- If there is at least one finalized edits_N, then the edits_N_inprogress is 
likely corrupt -- rename it to edits_N_corrupt (or delete it if we are less 
cautious)
-- If there are no finalized edits_N files, then the NN crashed while we were 
writing log index N. Initiate recovery process across all edits_N_inprogress:
--- Currently this isn't fancy - I just pick one. However, we could scan each 
of the logs for OP_INVALID and find the longest one, ensure that they have the 
same length, etc (eg one log must not have caught the last edit, or been 
truncated, etc)
--- This is very simple to do since across all directories (including 
secondaries) edits_M for any M should be identical!
--- After we've determined the correct log(s), finalize it and remove the others

Next, find the fsimage_N with the highest N across all image directories.
Then, find the edits_M with the highest M across all edits directories.

For safety, we check that there exists an edits_X for all X between N and M 
inclusive.

We then start up the NN by the following sequence:
- load fsimage_N
- for each M through N inclusive, load edits_N
- if we loaded any edits, save fsimage_N+1
- open edits_inprogress_N+1

h2. Checkpoint process

- Checkpoint Signature is modified to include the latest image index and the 
current log index in progress.
- Checkpointing node issues beginCheckpoint to NN
- NN rolls edit logs, and returns a checkpoint signature that includes the 
latest stored fsimage_N, as well as the index of the log it just rolled to
- Image transfer servlet is augmented to allow the downloader to specify which 
image or edits file to download
- Checkpointer downloads fsimage_N and edits_N through edits_M (where M is the 
new finalized edit log from the roll)
- Checkpointer saves local fsimage_M+1, and uploads to NN
- NN validation of the checkpoint signature is much simpler - just needs to 
make sure it came from the same filesystem, check any security tokens, etc. The 
old fstime and editstime constructs are no longer necessary since it's all 
encapsulated in the index numbers. For extra safety we can easily add some 
checksum or log length info to the CheckpointSignature
- NN saves fsimage_M+1 into its local image dirs, but does not need to do any 
log manipulation.

I'm still working out the backupnode operation, but I think it will actually be 
simplified by this proposal. Rather than having a special journaling mode, I 
think the NN can simply push any log roll events through the edit log stream to 
the BN. This will keep the roll indexes (and log contents) on the BN exactly 
identical to the indexes on the NN, which has good operational advantages and 
also reduces code complexity in the BN.

h2. Handling multiple checkpointers

Note that in the above process there is no state stored on the NN with regard 
to ongoing checkpoint processes. If multiple checkpoint nodes checkpoint 
simultaneously, the NN will simply roll twice and hand a different index to 
each. Each will then upload fsimages with different indexes.

h2. Image/edits file retention policies

There are a number of policies that should be simple to implement:

- *Number of saved images* - ensure that we have at least N saved images in our 
image directories, can delete any that are more than N versions old. Maintain 
edit lots that have index >= the index of the Nth oldest image.
- *Time* - ensure that we maintain all images within a trailing time window - 
again maintain all edit logs with index >= index of oldest maintained image.
- *Archival* - for audit purposes, the deletion mechanism could very easily be 
augmented to archive the edit logs for later analysis (eg to HDFS, tape, SAN, 
etc)

So long as any fsimage_N and all edits_M where M >= N are retained somewhere, 
they can be copied back into the NN's storage directories and full PITR is 
possible.


> Simpler model for Namenode's fs Image and edit Logs 
> ----------------------------------------------------
>
>                 Key: HDFS-1073
>                 URL: https://issues.apache.org/jira/browse/HDFS-1073
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Sanjay Radia
>            Assignee: Todd Lipcon
>
> The naming and handling of  NN's fsImage and edit logs can be significantly 
> improved resulting simpler and more robust code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to