[jira] [Commented] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679486#comment-15679486 ] Abhishek Rai commented on ZOOKEEPER-2574: - [~hanm] and [~rakeshr], thanks for finding the relation to ZOOKEEPER-2420 and thanks for your guidance. I've created a pull request as per your suggestion with the following changes: (1) Patch previously uploaded containing fix and tests. (2) Tests from ZOOKEEPER-2420 and enabling code. (3) Documentation fixes. [~rakeshr] great call on documentation review, as I went through it I found multiple inconsistencies about the snapshot-log dependency. I've fixed all that I could find in the docs/ directory. > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, > ZOOKEEPER-2574.4.patch, ZOOKEEPER-2574.5.patch, ZOOKEEPER-2574.6.patch, > ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-1621: Attachment: ZOOKEEPER-1621.2.patch Based on the discussion with [~mkizner] above, skipping of the truncated txn log file is insufficient, and its deletion is necessary. Otherwise we can run into problems in two places: - FileTxnLog is required to include the latest txn log before the snapshot that it's loading. If that latest txn log is truncated (and previously skipped), then it can incorrectly satisfy this requirement. Instead, if we delete the truncated file, then we are forced to reach back into the older valid txn log. - PurgeTxnLog has similar logic about retaining the latest txn log before the last retained snapshot. Therefore, without the deletion, its requirements would similarly be met by a truncated and useless txn log. I've now updated [~michim]'s patch with two changes and corresponding testing changes: - Deletion as described here. - Use a tighter exception (EOFException) instead of IOException. > ZooKeeper does not recover from crash when disk was full > > > Key: ZOOKEEPER-1621 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.3 > Environment: Ubuntu 12.04, Amazon EC2 instance >Reporter: David Arthur >Assignee: Michi Mutsuzaki > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-1621.2.patch, ZOOKEEPER-1621.patch, > zookeeper.log.gz > > > The disk that ZooKeeper was using filled up. During a snapshot write, I got > the following exception > 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - > Severe unrecoverable error, exiting > java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:282) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > at > org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) > at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) > Then many subsequent exceptions like: > 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was > partial. > 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected > exception, exiting abnormally > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.(FileTxnLog.java:504) > at > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) > at > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) > at > org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) > at > org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) > at > org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) > at > org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) > at > org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) > at >
[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616237#comment-15616237 ] Abhishek Rai commented on ZOOKEEPER-1621: - Thanks [~mkizner]. Your suggestion of doing this only for the most recent txn log file is sound. Are you also suggesting that we delete this truncated txn log file? Cause, if we skip it and don't delete, then in the future, newer txn log files will get created. So, the truncated txn log file will no longer be the latest txn log when we do a purge afterwards. Deletion seems consistent with this approach as well as consistent with PurgeTxnLog's behavior. > ZooKeeper does not recover from crash when disk was full > > > Key: ZOOKEEPER-1621 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.3 > Environment: Ubuntu 12.04, Amazon EC2 instance >Reporter: David Arthur >Assignee: Michi Mutsuzaki > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-1621.patch, zookeeper.log.gz > > > The disk that ZooKeeper was using filled up. During a snapshot write, I got > the following exception > 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - > Severe unrecoverable error, exiting > java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:282) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > at > org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) > at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) > Then many subsequent exceptions like: > 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was > partial. > 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected > exception, exiting abnormally > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.(FileTxnLog.java:504) > at > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) > at > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) > at > org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) > at > org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138) > at > org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112) > at > org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) > at > org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) > It seems to me that writing the transaction log should be fully atomic to > avoid such situations. Is this not the case? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.6.patch Thanks [~rakeshr]. I've updated the doc now, please take another look. Thanks > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, > ZOOKEEPER-2574.4.patch, ZOOKEEPER-2574.5.patch, ZOOKEEPER-2574.6.patch, > ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15560526#comment-15560526 ] Abhishek Rai commented on ZOOKEEPER-1621: - Reviving this old thread. [~shralex] has a valid concern about trading off consistency for availability. However, for the specific issue being addressed here, we can have both. The patch skips transaction logs with an incomplete header (the first 16 bytes). Skipping such files should not cause any loss of data as the header is an internal bookkeeping write from Zookeeper and does not contain any user data. This avoids the current behavior of Zookeeper crashing on encountering an incomplete header, which compromises availability. This has been a recurring problem for us in production because our app's operating environment occasionally causes a Zookeeper server's disk to become full. After that, the server invariably runs into this problem - perhaps because there's something else that deterministically triggers a log rotation when the previous txn log throws an IOException due to disk full? That said, we can tighten the exception being caught in [~michim]'s patch to EOFException instead of IOException to make sure that the log we are skipping indeed only has a partially written header and nothing else (in FileTxnLog.goToNextLog). Additionally, I have written a test to verify that EOFException is thrown if and only if the header is truncated. Zookeeper already ignores any other partially written transactions in the txn log. If that's useful, I can upload the test, thanks. > ZooKeeper does not recover from crash when disk was full > > > Key: ZOOKEEPER-1621 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.3 > Environment: Ubuntu 12.04, Amazon EC2 instance >Reporter: David Arthur >Assignee: Michi Mutsuzaki > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-1621.patch, zookeeper.log.gz > > > The disk that ZooKeeper was using filled up. During a snapshot write, I got > the following exception > 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - > Severe unrecoverable error, exiting > java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:282) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > at > org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306) > at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) > Then many subsequent exceptions like: > 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was > partial. > 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected > exception, exiting abnormally > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529) > at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.(FileTxnLog.java:504) > at > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130) > at > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) > at > org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259) > at > org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386) > at >
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.5.patch Thanks [~abrahamfine]. >> I switched logsToPurge from a List to an ArrayList so I can >> simply use remove(0) to remove the first element in the list on line 239 > I think I must be missing something as all of the lists are ArrayLists. For > example, this still passes: Sorry I was confused about something, fixed the usage of logsToPurge as you suggested, thanks for persisting. >> Is there a way to achieve both goals, logging and console output (preferably >> stdout) without any duplication. > I'm not sure, perhaps system.err? I tried System.err.println, but then this output comes at the end of the test log under "stderr" section. It may have limited utility in debugging since it's not inline with other related logging. Thanks > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, > ZOOKEEPER-2574.4.patch, ZOOKEEPER-2574.5.patch, ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.4.patch Thanks for the review [~abrahamfine]. I've applied your comments and uploaded a new patch set, please take another look. > PurgeTxnTest.java:224 Can we change ArrayList logsToPurge back to > List logsToPurge? I switched logsToPurge from a List to an ArrayList so I can simply use remove(0) to remove the first element in the list on line 239. However, as you pointed out, this is probably not obvious given that all other lists around it are List, so I've added a comment explaining the choice. > PurgeTxnLog.java:138 Do we need to use the FileFilter here since we do > "filtering" on line 142? Both filtering are required. The FileFilter used in lines 134-138 are useful for listing all snapshot and log files with zxid >= leastZxidToBeRetain. The check on 142 is to skip deletion of the newest log file that comes before the oldest retained snapshot. However, I agree that the logic would be simpler if all filtering logic is in one place, in MyFileFilter.accept(). I've moved it there now. > PurgeTxnLog.java:148 We do logging and System.out.println for the same > String, do we need both? My goal here was to capture the output in the log file generated by the ant test run. System.out.println wasn't useful in this context. However, I needed to retain System.out.println cause PurgeTxnLog can also be invoked interactively from a console. Is there a way to achieve both goals, logging and console output (preferably stdout) without any duplication. > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, > ZOOKEEPER-2574.4.patch, ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483721#comment-15483721 ] Abhishek Rai commented on ZOOKEEPER-2574: - Thanks for the references [~rakeshr]. The learner writes the snapshot in response to the NEWLEADER message received from the leader. Based on my understanding, this is because the leader could be ahead of the learner - meaning that the learner is missing some transactions that the leader has. So receiving a snapshot and committing it locally is a valid option for the learner to catch up and join the quorum. However, going forward it will receive subsequent transactions, which as you mentioned get appended to the existing txn log file. It seems a log rollover could have been done before snapshotting in the learner, but perhaps changing behaviors at this point is not worth it given the need to support old behavior too? > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, > ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.3.patch Thanks [~arshad.mohammad] for the review. I've applied your suggestions and uploaded the latest patch. Also, I noticed that on Hadoop QA, a test is failing (org.apache.zookeeper.test.QuorumTest) but I cannot reproduce this failure locally and it also seems unrelated. Thanks! > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.3.patch, > ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2310: Attachment: ZOOKEEPER-2310.3.patch > Snapshot files must be synced to prevent inconsistency or data loss > --- > > Key: ZOOKEEPER-2310 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Abhishek Rai >Assignee: Abhishek Rai > Attachments: ZOOKEEPER-2310.3.patch, zookeeper-2310-version-2.patch, > zookeeper-2310.patch > > > Today, Zookeeper server syncs transaction log files to disk by default, but > does not sync snapshot files. Consequently, an untimely crash may result in > a lost or incomplete snapshot file. During recovery, if the server finds a > valid older snapshot file, it will load it and replay subsequent log(s), > skipping the incomplete snapshot file. It's possible that the skipped file > had some transactions which are not present in the replayed transaction logs. > Since quorum synchronization is based on last transaction ID of each server, > this will never get noticed, resulting in inconsistency between servers and > possible data loss. > Following sequence of events describes a sample scenario where this can > happen: > # Server F is a follower in a Zookeeper ensemble. > # F's most recent valid snapshot file is named "snapshot.10" containing state > up to zxid = 10. F is currently writing to the transaction log file > "log.11", with the most recent zxid = 20. > # Fresh round of election. > # F receives a few new transactions 21 to 30 from new leader L as the "diff". > Current server behavior is to dump current state plus diff to a new snapshot > file, "snapshot.30". > # F finalizes the snapshot file, but file contents are still buffered in OS > caches. Zookeeper does not sync snapshot file contents to disk. > # F receives a new transaction 31 from the leader, which it appends to the > existing transaction log file, "log.11" and syncs the file to disk. > # Server machine crashes or is cold rebooted. > # After recovery, snapshot file "snapshot.30" may not exist or may be empty. > See below for why that may happen. > # In either case, F looks for the last finalized snapshot file, finds and > loads "snapshot.10". It then replays transactions from "log.11". > Ultimately, its last seen zxid will be 31, but it would not have replayed > transactions 21 to 30 received via the "diff" from the leader. > # Clients which are connected to F may see different data than clients > connected to other members of the ensemble, violating single system image > invariant. Also, if F were to become a leader at some point, it could use > its state to seed other servers, and they all could lose the writes in the > missing interval above. > *Notes:* > - Reason why snapshot file may be missing or incomplete: > -- Zookeeper does not sync the data directory after creating a snapshot file. > Even if a newly created file is synced to disk, if the corresponding > directory entry is not, then the file will not be visible in the namespace. > -- Zookeeper does not sync snapshot files. So, they may be empty or > incomplete during recovery from an untimely crash. > - In step (6) above, the server could also have written the new transaction > 31 to a new log file, "log.31". The final outcome would still be the same. > We are able to deterministically reproduce this problem using the following > steps: > # Create a new Zookeeper ensemble on 3 hosts: A, B, and C. > # Ensured each server has at least one snapshot file in its data dir. > # Stop Zookeeper process on server A. > # Slow down disk syncs on server A (see example script below). This ensures > that snapshot files written by Zookeeper don't make it to disk spontaneously. > Log files will be written to disk as Zookeeper explicitly issues a sync call > on such files. > # Connect to server B and create a new znode /test1. > # Start Zookeeper process on A, wait for it to write a new snapshot to its > datadir. This snapshot would contain /test1 but it won’t be synced to disk > yet. > # Connect to A and verify that /test1 is visible. > # Connect to B and create another znode /test2. This will cause A’s > transaction log to grow further to receive /test2. > # Cold reboot A. > # A’s last snapshot is a zero-sized file or is missing altogether since it > did not get synced to disk before reboot. We have seen both in different > runs. > # Connect to A and verify that /test1 does not exist. It exists on B and C. > Slowing down disk syncs: > {noformat} > echo 36 | sudo tee /proc/sys/vm/dirty_writeback_centisecs > echo 36 | sudo tee
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: (was: ZOOKEEPER-2574.2.patch) > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.2.patch > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.2.patch Uploading patch for trunk, previous patch does not work on trunk (works on 3.4.8 and 3.5.2). > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.2.patch, ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: (was: ZOOKEEPER-2574.patch) > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.patch > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.patch, ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481129#comment-15481129 ] Abhishek Rai commented on ZOOKEEPER-2574: - Thanks [~phunt], I've uploaded a fix and unittest. Without the fix, the unittest fails in the assertion below, thanks. {noformat} /** * Verify that the last znode that was created above exists. This znode's creation was * captured by the transaction log which was created before any of the above * SNAP_RETAIN_COUNT snapshots were created, but it's not captured in any of these * snapshots. So for it it exist, the (only) existing log file should not have been purged. */ final String lastZnode = "/snap-" + (unique - 1); final Stat stat = zk.exists(lastZnode, false); Assert.assertNotNull("Last znode does not exist: " + lastZnode, stat); {noformat} > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Assignee: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Attachment: ZOOKEEPER-2574.patch Fix and unittest for ZOOKEEPER-2574. > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Priority: Blocker > Fix For: 3.4.10, 3.5.3 > > Attachments: ZOOKEEPER-2574.patch > > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2574: Description: As part of the fix for ZOOKEEPER-1797, the call to FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a result, some old-looking but required txn log files can be deleted, resulting in data corruption or loss. For example, consider the following: 1. Configuration: autopurge.snapRetainCount=3 2. Following files exist: log.100 spans transactions from zxid=100 till zxid=140 (inclusive) snapshot.110 - snapshot as of zxid=110 snapshot.120 - snapshot as of zxid=120 snapshot.130 - snapshot as of zxid=130 Above scenario is possible when snapshotting has happened multiple times but without accompanying log rollover, which is possible if the server was running as a learner. 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is older than the zxid of the oldest snapshot (110). This results in loss of transactions in the range 131-140. Before the fix for ZOOKEEPER-1797, this was avoided by the call to FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log file with starting zxid < oldest retained snapshot's highest zxid. was: As part of the fix for ZOOKEEPER-1797, the call to FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a result, some old-looking but required txn log files can be deleted, resulting in data corruption or loss. For example, consider the following: 1. Configuration: autopurge.snapRetainCount=3 2. Following files exist: log.100 spans transactions from zxid=100 till zxid=140 (inclusive) snapshot.110 - snapshot as of zxid=110 snapshot.120 - snapshot as of zxid=120 snapshot.130 - snapshot as of zxid=130 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is older than the zxid of the oldest snapshot (110). This results in loss of transactions in the range 131-140. Before the fix for ZOOKEEPER-1797, this was avoided by the call to FileTxnSnapLog.getSnapshotLogs() which finds the newest txn log file with starting zxid < snapshot zxid. > PurgeTxnLog can inadvertently delete required txn log files > --- > > Key: ZOOKEEPER-2574 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.7, 3.4.8, 3.5.0, 3.5.1, 3.5.2 > Environment: Zookeeper 3.4.8, standalone, and 3-server quorum >Reporter: Abhishek Rai >Priority: Critical > > As part of the fix for ZOOKEEPER-1797, the call to > FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a > result, some old-looking but required txn log files can be deleted, resulting > in data corruption or loss. > For example, consider the following: > 1. Configuration: > autopurge.snapRetainCount=3 > 2. Following files exist: > log.100 spans transactions from zxid=100 till zxid=140 (inclusive) > snapshot.110 - snapshot as of zxid=110 > snapshot.120 - snapshot as of zxid=120 > snapshot.130 - snapshot as of zxid=130 > Above scenario is possible when snapshotting has happened multiple times but > without accompanying log rollover, which is possible if the server was > running as a learner. > 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is > older than the zxid of the oldest snapshot (110). This results in loss of > transactions in the range 131-140. > Before the fix for ZOOKEEPER-1797, this was avoided by the call to > FileTxnSnapLog.getSnapshotLogs() which finds and retains the newest txn log > file with starting zxid < oldest retained snapshot's highest zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ZOOKEEPER-2574) PurgeTxnLog can inadvertently delete required txn log files
Abhishek Rai created ZOOKEEPER-2574: --- Summary: PurgeTxnLog can inadvertently delete required txn log files Key: ZOOKEEPER-2574 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2574 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.2, 3.5.1, 3.5.0, 3.4.8, 3.4.7 Environment: Zookeeper 3.4.8, standalone, and 3-server quorum Reporter: Abhishek Rai Priority: Critical As part of the fix for ZOOKEEPER-1797, the call to FileTxnSnapLog.getSnapshotLogs() was removed from PurgeTxnLog.java. As a result, some old-looking but required txn log files can be deleted, resulting in data corruption or loss. For example, consider the following: 1. Configuration: autopurge.snapRetainCount=3 2. Following files exist: log.100 spans transactions from zxid=100 till zxid=140 (inclusive) snapshot.110 - snapshot as of zxid=110 snapshot.120 - snapshot as of zxid=120 snapshot.130 - snapshot as of zxid=130 3. PurgeTxnLog retains all snapshots but deletes log.100 because its zxid is older than the zxid of the oldest snapshot (110). This results in loss of transactions in the range 131-140. Before the fix for ZOOKEEPER-1797, this was avoided by the call to FileTxnSnapLog.getSnapshotLogs() which finds the newest txn log file with starting zxid < snapshot zxid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269602#comment-15269602 ] Abhishek Rai commented on ZOOKEEPER-2310: - Thanks for bringing this up [~zhangyongxyz]. As you pointed out, FileChannel does not provide a way of accomplishing this in Windows. There are conflicting opinions online about whether it's even necessary for Windows based on how it automatically handles updates to folders. I've provided a modified patch (zookeeper-2310-version-2.patch) which skips syncing of directory on Windows. The pattern I used has been used elsewhere in Zookeeper source, so should be safe. > Snapshot files must be synced to prevent inconsistency or data loss > --- > > Key: ZOOKEEPER-2310 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Abhishek Rai >Assignee: Abhishek Rai > Attachments: zookeeper-2310-version-2.patch, zookeeper-2310.patch > > > Today, Zookeeper server syncs transaction log files to disk by default, but > does not sync snapshot files. Consequently, an untimely crash may result in > a lost or incomplete snapshot file. During recovery, if the server finds a > valid older snapshot file, it will load it and replay subsequent log(s), > skipping the incomplete snapshot file. It's possible that the skipped file > had some transactions which are not present in the replayed transaction logs. > Since quorum synchronization is based on last transaction ID of each server, > this will never get noticed, resulting in inconsistency between servers and > possible data loss. > Following sequence of events describes a sample scenario where this can > happen: > # Server F is a follower in a Zookeeper ensemble. > # F's most recent valid snapshot file is named "snapshot.10" containing state > up to zxid = 10. F is currently writing to the transaction log file > "log.11", with the most recent zxid = 20. > # Fresh round of election. > # F receives a few new transactions 21 to 30 from new leader L as the "diff". > Current server behavior is to dump current state plus diff to a new snapshot > file, "snapshot.30". > # F finalizes the snapshot file, but file contents are still buffered in OS > caches. Zookeeper does not sync snapshot file contents to disk. > # F receives a new transaction 31 from the leader, which it appends to the > existing transaction log file, "log.11" and syncs the file to disk. > # Server machine crashes or is cold rebooted. > # After recovery, snapshot file "snapshot.30" may not exist or may be empty. > See below for why that may happen. > # In either case, F looks for the last finalized snapshot file, finds and > loads "snapshot.10". It then replays transactions from "log.11". > Ultimately, its last seen zxid will be 31, but it would not have replayed > transactions 21 to 30 received via the "diff" from the leader. > # Clients which are connected to F may see different data than clients > connected to other members of the ensemble, violating single system image > invariant. Also, if F were to become a leader at some point, it could use > its state to seed other servers, and they all could lose the writes in the > missing interval above. > *Notes:* > - Reason why snapshot file may be missing or incomplete: > -- Zookeeper does not sync the data directory after creating a snapshot file. > Even if a newly created file is synced to disk, if the corresponding > directory entry is not, then the file will not be visible in the namespace. > -- Zookeeper does not sync snapshot files. So, they may be empty or > incomplete during recovery from an untimely crash. > - In step (6) above, the server could also have written the new transaction > 31 to a new log file, "log.31". The final outcome would still be the same. > We are able to deterministically reproduce this problem using the following > steps: > # Create a new Zookeeper ensemble on 3 hosts: A, B, and C. > # Ensured each server has at least one snapshot file in its data dir. > # Stop Zookeeper process on server A. > # Slow down disk syncs on server A (see example script below). This ensures > that snapshot files written by Zookeeper don't make it to disk spontaneously. > Log files will be written to disk as Zookeeper explicitly issues a sync call > on such files. > # Connect to server B and create a new znode /test1. > # Start Zookeeper process on A, wait for it to write a new snapshot to its > datadir. This snapshot would contain /test1 but it won’t be synced to disk > yet. > # Connect to A and verify that /test1 is visible. > # Connect to B and create another znode /test2. This will cause A’s > transaction
[jira] [Updated] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2310: Attachment: zookeeper-2310-version-2.patch > Snapshot files must be synced to prevent inconsistency or data loss > --- > > Key: ZOOKEEPER-2310 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Abhishek Rai >Assignee: Abhishek Rai > Attachments: zookeeper-2310-version-2.patch, zookeeper-2310.patch > > > Today, Zookeeper server syncs transaction log files to disk by default, but > does not sync snapshot files. Consequently, an untimely crash may result in > a lost or incomplete snapshot file. During recovery, if the server finds a > valid older snapshot file, it will load it and replay subsequent log(s), > skipping the incomplete snapshot file. It's possible that the skipped file > had some transactions which are not present in the replayed transaction logs. > Since quorum synchronization is based on last transaction ID of each server, > this will never get noticed, resulting in inconsistency between servers and > possible data loss. > Following sequence of events describes a sample scenario where this can > happen: > # Server F is a follower in a Zookeeper ensemble. > # F's most recent valid snapshot file is named "snapshot.10" containing state > up to zxid = 10. F is currently writing to the transaction log file > "log.11", with the most recent zxid = 20. > # Fresh round of election. > # F receives a few new transactions 21 to 30 from new leader L as the "diff". > Current server behavior is to dump current state plus diff to a new snapshot > file, "snapshot.30". > # F finalizes the snapshot file, but file contents are still buffered in OS > caches. Zookeeper does not sync snapshot file contents to disk. > # F receives a new transaction 31 from the leader, which it appends to the > existing transaction log file, "log.11" and syncs the file to disk. > # Server machine crashes or is cold rebooted. > # After recovery, snapshot file "snapshot.30" may not exist or may be empty. > See below for why that may happen. > # In either case, F looks for the last finalized snapshot file, finds and > loads "snapshot.10". It then replays transactions from "log.11". > Ultimately, its last seen zxid will be 31, but it would not have replayed > transactions 21 to 30 received via the "diff" from the leader. > # Clients which are connected to F may see different data than clients > connected to other members of the ensemble, violating single system image > invariant. Also, if F were to become a leader at some point, it could use > its state to seed other servers, and they all could lose the writes in the > missing interval above. > *Notes:* > - Reason why snapshot file may be missing or incomplete: > -- Zookeeper does not sync the data directory after creating a snapshot file. > Even if a newly created file is synced to disk, if the corresponding > directory entry is not, then the file will not be visible in the namespace. > -- Zookeeper does not sync snapshot files. So, they may be empty or > incomplete during recovery from an untimely crash. > - In step (6) above, the server could also have written the new transaction > 31 to a new log file, "log.31". The final outcome would still be the same. > We are able to deterministically reproduce this problem using the following > steps: > # Create a new Zookeeper ensemble on 3 hosts: A, B, and C. > # Ensured each server has at least one snapshot file in its data dir. > # Stop Zookeeper process on server A. > # Slow down disk syncs on server A (see example script below). This ensures > that snapshot files written by Zookeeper don't make it to disk spontaneously. > Log files will be written to disk as Zookeeper explicitly issues a sync call > on such files. > # Connect to server B and create a new znode /test1. > # Start Zookeeper process on A, wait for it to write a new snapshot to its > datadir. This snapshot would contain /test1 but it won’t be synced to disk > yet. > # Connect to A and verify that /test1 is visible. > # Connect to B and create another znode /test2. This will cause A’s > transaction log to grow further to receive /test2. > # Cold reboot A. > # A’s last snapshot is a zero-sized file or is missing altogether since it > did not get synced to disk before reboot. We have seen both in different > runs. > # Connect to A and verify that /test1 does not exist. It exists on B and C. > Slowing down disk syncs: > {noformat} > echo 36 | sudo tee /proc/sys/vm/dirty_writeback_centisecs > echo 36 | sudo tee /proc/sys/vm/dirty_expire_centisecs > echo
[jira] [Created] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss
Abhishek Rai created ZOOKEEPER-2310: --- Summary: Snapshot files must be synced to prevent inconsistency or data loss Key: ZOOKEEPER-2310 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.6 Reporter: Abhishek Rai Today, Zookeeper server syncs transaction log files to disk by default, but does not sync snapshot files. Consequently, an untimely crash may result in a lost or incomplete snapshot file. During recovery, if the server finds a valid older snapshot file, it will load it and replay subsequent log(s), skipping the incomplete snapshot file. It's possible that the skipped file had some transactions which are not present in the replayed transaction logs. Since quorum synchronization is based on last transaction ID of each server, this will never get noticed, resulting in inconsistency between servers and possible data loss. Following sequence of events describes a sample scenario where this can happen: # Server F is a follower in a Zookeeper ensemble. # F's most recent valid snapshot file is named "snapshot.10" containing state up to zxid = 10. F is currently writing to the transaction log file "log.11", with the most recent zxid = 20. # Fresh round of election. # F receives a few new transactions 21 to 30 from new leader L as the "diff". Current server behavior is to dump current state plus diff to a new snapshot file, "snapshot.30". # F finalizes the snapshot file, but file contents are still buffered in OS caches. Zookeeper does not sync snapshot file contents to disk. # F receives a new transaction 31 from the leader, which it appends to the existing transaction log file, "log.11" and syncs the file to disk. # Server machine crashes or is cold rebooted. # After recovery, snapshot file "snapshot.30" may not exist or may be empty. See below for why that may happen. # In either case, F looks for the last finalized snapshot file, finds and loads "snapshot.10". It then replays transactions from "log.11". Ultimately, its last seen zxid will be 31, but it would not have replayed transactions 21 to 30 received via the "diff" from the leader. # Clients which are connected to F may see different data than clients connected to other members of the ensemble, violating single system image invariant. Also, if F were to become a leader at some point, it could use its state to seed other servers, and they all could lose the writes in the missing interval above. *Notes:* - Reason why snapshot file may be missing or incomplete: -- Zookeeper does not sync the data directory after creating a snapshot file. Even if a newly created file is synced to disk, if the corresponding directory entry is not, then the file will not be visible in the namespace. -- Zookeeper does not sync snapshot files. So, they may be empty or incomplete during recovery from an untimely crash. - In step (6) above, the server could also have written the new transaction 31 to a new log file, "log.31". The final outcome would still be the same. We are able to deterministically reproduce this problem using the following steps: # Create a new Zookeeper ensemble on 3 hosts: A, B, and C. # Ensured each server has at least one snapshot file in its data dir. # Stop Zookeeper process on server A. # Slow down disk syncs on server A (see example script below). This ensures that snapshot files written by Zookeeper don't make it to disk spontaneously. Log files will be written to disk as Zookeeper explicitly issues a sync call on such files. # Connect to server B and create a new znode /test1. # Start Zookeeper process on A, wait for it to write a new snapshot to its datadir. This snapshot would contain /test1 but it won’t be synced to disk yet. # Connect to A and verify that /test1 is visible. # Connect to B and create another znode /test2. This will cause A’s transaction log to grow further to receive /test2. # Cold reboot A. # A’s last snapshot is a zero-sized file or is missing altogether since it did not get synced to disk before reboot. We have seen both in different runs. # Connect to A and verify that /test1 does not exist. It exists on B and C. Slowing down disk syncs: {noformat} echo 36 | sudo tee /proc/sys/vm/dirty_writeback_centisecs echo 36 | sudo tee /proc/sys/vm/dirty_expire_centisecs echo 99 | sudo tee /proc/sys/vm/dirty_background_ratio echo 99 | sudo tee /proc/sys/vm/dirty_ratio {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rai updated ZOOKEEPER-2310: Attachment: zookeeper-2310.patch Patch for above issue which: # Syncs snapshot file # Syncs snapshot directory # Debug log message about snapshot file once written. > Snapshot files must be synced to prevent inconsistency or data loss > --- > > Key: ZOOKEEPER-2310 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Abhishek Rai > Attachments: zookeeper-2310.patch > > > Today, Zookeeper server syncs transaction log files to disk by default, but > does not sync snapshot files. Consequently, an untimely crash may result in > a lost or incomplete snapshot file. During recovery, if the server finds a > valid older snapshot file, it will load it and replay subsequent log(s), > skipping the incomplete snapshot file. It's possible that the skipped file > had some transactions which are not present in the replayed transaction logs. > Since quorum synchronization is based on last transaction ID of each server, > this will never get noticed, resulting in inconsistency between servers and > possible data loss. > Following sequence of events describes a sample scenario where this can > happen: > # Server F is a follower in a Zookeeper ensemble. > # F's most recent valid snapshot file is named "snapshot.10" containing state > up to zxid = 10. F is currently writing to the transaction log file > "log.11", with the most recent zxid = 20. > # Fresh round of election. > # F receives a few new transactions 21 to 30 from new leader L as the "diff". > Current server behavior is to dump current state plus diff to a new snapshot > file, "snapshot.30". > # F finalizes the snapshot file, but file contents are still buffered in OS > caches. Zookeeper does not sync snapshot file contents to disk. > # F receives a new transaction 31 from the leader, which it appends to the > existing transaction log file, "log.11" and syncs the file to disk. > # Server machine crashes or is cold rebooted. > # After recovery, snapshot file "snapshot.30" may not exist or may be empty. > See below for why that may happen. > # In either case, F looks for the last finalized snapshot file, finds and > loads "snapshot.10". It then replays transactions from "log.11". > Ultimately, its last seen zxid will be 31, but it would not have replayed > transactions 21 to 30 received via the "diff" from the leader. > # Clients which are connected to F may see different data than clients > connected to other members of the ensemble, violating single system image > invariant. Also, if F were to become a leader at some point, it could use > its state to seed other servers, and they all could lose the writes in the > missing interval above. > *Notes:* > - Reason why snapshot file may be missing or incomplete: > -- Zookeeper does not sync the data directory after creating a snapshot file. > Even if a newly created file is synced to disk, if the corresponding > directory entry is not, then the file will not be visible in the namespace. > -- Zookeeper does not sync snapshot files. So, they may be empty or > incomplete during recovery from an untimely crash. > - In step (6) above, the server could also have written the new transaction > 31 to a new log file, "log.31". The final outcome would still be the same. > We are able to deterministically reproduce this problem using the following > steps: > # Create a new Zookeeper ensemble on 3 hosts: A, B, and C. > # Ensured each server has at least one snapshot file in its data dir. > # Stop Zookeeper process on server A. > # Slow down disk syncs on server A (see example script below). This ensures > that snapshot files written by Zookeeper don't make it to disk spontaneously. > Log files will be written to disk as Zookeeper explicitly issues a sync call > on such files. > # Connect to server B and create a new znode /test1. > # Start Zookeeper process on A, wait for it to write a new snapshot to its > datadir. This snapshot would contain /test1 but it won’t be synced to disk > yet. > # Connect to A and verify that /test1 is visible. > # Connect to B and create another znode /test2. This will cause A’s > transaction log to grow further to receive /test2. > # Cold reboot A. > # A’s last snapshot is a zero-sized file or is missing altogether since it > did not get synced to disk before reboot. We have seen both in different > runs. > # Connect to A and verify that /test1 does not exist. It exists on B and C. > Slowing down disk syncs: > {noformat} > echo 36 | sudo tee /proc/sys/vm/dirty_writeback_centisecs > echo
[jira] [Commented] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984796#comment-14984796 ] Abhishek Rai commented on ZOOKEEPER-2310: - Thanks for your response [~fpj]. I think my claim about the "diff" being present in the snapshot and not in the log looks incorrect. When pushing a diff, leader (LearnerHandler) pushes individual transactions which the follower writes to its log (Learner.syncWithLeader). Leader eventually sends a "NEWLEADER", in response, the follower snapshots. Ultimately, the diff is visible in both the log and snapshot. But consider the case of leader (LearnerHandler) pushing a full snapshot to the follower. In this case, the follower does not receive the individual transactions contributing to that snapshot. In fact, it's not practical to do so - by design, the snapshot is sent when the diff is too large. Thus, the follower can have a snapshot which reflects some transactions that are not present in the log. After writing the snapshot, the follower continues writing subsequent transactions to the log. Imagine a crash + recovery is induced at this point, such that the latest snapshot file is incomplete or non-existent. The follower would try to load the preceding healthy snapshot, and replay the log since then. Since the log does not contain some transactions corresponding to the missing snapshot file, the follower would never find out about them. This would cause the inconsistency scenario I described above. Without syncing the snapshot file (and its parent directory) to disk, we cannot guarantee that the snapshot file exists during recovery. And the loss of finalized snapshot files can result in data loss since all transactions may not be present in the log. > Snapshot files must be synced to prevent inconsistency or data loss > --- > > Key: ZOOKEEPER-2310 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Abhishek Rai >Assignee: Abhishek Rai > Attachments: zookeeper-2310.patch > > > Today, Zookeeper server syncs transaction log files to disk by default, but > does not sync snapshot files. Consequently, an untimely crash may result in > a lost or incomplete snapshot file. During recovery, if the server finds a > valid older snapshot file, it will load it and replay subsequent log(s), > skipping the incomplete snapshot file. It's possible that the skipped file > had some transactions which are not present in the replayed transaction logs. > Since quorum synchronization is based on last transaction ID of each server, > this will never get noticed, resulting in inconsistency between servers and > possible data loss. > Following sequence of events describes a sample scenario where this can > happen: > # Server F is a follower in a Zookeeper ensemble. > # F's most recent valid snapshot file is named "snapshot.10" containing state > up to zxid = 10. F is currently writing to the transaction log file > "log.11", with the most recent zxid = 20. > # Fresh round of election. > # F receives a few new transactions 21 to 30 from new leader L as the "diff". > Current server behavior is to dump current state plus diff to a new snapshot > file, "snapshot.30". > # F finalizes the snapshot file, but file contents are still buffered in OS > caches. Zookeeper does not sync snapshot file contents to disk. > # F receives a new transaction 31 from the leader, which it appends to the > existing transaction log file, "log.11" and syncs the file to disk. > # Server machine crashes or is cold rebooted. > # After recovery, snapshot file "snapshot.30" may not exist or may be empty. > See below for why that may happen. > # In either case, F looks for the last finalized snapshot file, finds and > loads "snapshot.10". It then replays transactions from "log.11". > Ultimately, its last seen zxid will be 31, but it would not have replayed > transactions 21 to 30 received via the "diff" from the leader. > # Clients which are connected to F may see different data than clients > connected to other members of the ensemble, violating single system image > invariant. Also, if F were to become a leader at some point, it could use > its state to seed other servers, and they all could lose the writes in the > missing interval above. > *Notes:* > - Reason why snapshot file may be missing or incomplete: > -- Zookeeper does not sync the data directory after creating a snapshot file. > Even if a newly created file is synced to disk, if the corresponding > directory entry is not, then the file will not be visible in the namespace. > -- Zookeeper does not sync snapshot files. So, they may