[ https://issues.apache.org/jira/browse/HDFS-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon updated HDFS-1994: ------------------------------ Attachment: hdfs-1994.txt Updated patch that does solution "a" above. Also includes a new test case to trigger the interleaving and make sure only one is successful. If we decide we also want to do "b" above, it could be done in a followup. But it's a very rare race that only really shows up in this kind of stress test, and it doesn't cause any corruption, just a missed checkpoint. > Fix race conditions when running two rapidly checkpointing 2NNs > --------------------------------------------------------------- > > Key: HDFS-1994 > URL: https://issues.apache.org/jira/browse/HDFS-1994 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node > Affects Versions: Edit log branch (HDFS-1073) > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Fix For: Edit log branch (HDFS-1073) > > Attachments: hdfs-1994.txt, hdfs-1994.txt > > > HDFS-1984 added the ability to run two secondary namenodes at the same time. > However, there were two races I found when stress testing this (by running > two NNs each checkpointing in a tight loop with no sleep): > 1) the writing of the seen_txid file was not atomic, so it was at some points > reading an empty file > 2) it was possible for two checkpointers to try to take a checkpoint at the > same transaction ID, which would cause the two image downloads to collide and > fail -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira