[ 
https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038898#comment-13038898
 ] 

Todd Lipcon commented on HDFS-1984:
-----------------------------------

bq. Can't these two threads in the test race? Imagine they would never in 
practice.

It's OK - delayer.waitForCall will just sit there and wait until the checkpoint 
thread gets to the instrumented method. It works pretty well in the 
TestFileAppend4 tests.

bq. It should be rare that there's no MD5 file for an image, ie only happens 
when there's an image from a previous version, therefore would it make sense to 
warn in places like setVerificationHeaders where an MD5 file is not present

This same code path is also used for transferring edits. Though perhaps we can 
add some flag like "requireMd5File". I'll make a note of that as a TODO.

bq. Not your change, but would be less error prone if ErrorSimulation used eg 
an enum CORRUPT_IMG_XFER instead of "4".
agreed

> HDFS-1073: Enable multiple checkpointers to run simultaneously
> --------------------------------------------------------------
>
>                 Key: HDFS-1984
>                 URL: https://issues.apache.org/jira/browse/HDFS-1984
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: name-node
>    Affects Versions: Edit log branch (HDFS-1073)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: hdfs-1984.txt
>
>
> One of the motivations of HDFS-1073 is that it decouples the checkpoint 
> process so that multiple checkpoints could be taken at the same time and not 
> interfere with each other.
> Currently on the 1073 branch this doesn't quite work right, since we have 
> some state and validation in FSImage that's tied to a single fsimage_N -- 
> thus if two 2NNs perform a checkpoint at different transaction IDs, only one 
> will succeed.
> As a stress test, we can run two 2NNs each configured with the 
> fs.checkpoint.interval set to "0" which causes them to continuously 
> checkpoint as fast as they can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to