[ https://issues.apache.org/jira/browse/ZOOKEEPER-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maria Ramos updated ZOOKEEPER-4744: ----------------------------------- Description: The underlying issue stems from consecutive writes to the log file that are not interleaved with {{fsync}} operations. This is a well-documented behavior of operating systems, and there are several references addressing this problem: - [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai] - [https://dl.acm.org/doi/pdf/10.1145/2872362.2872406] - [https://mariadb.com/kb/en/atomic-write-support/] - [https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf] (Page 9) This issue can be replicated using [LazyFS|https://github.com/dsrhaslab/lazyfs], a file system capable of simulating power failures and exhibiting the OS behavior mentioned above, i.e., the out-of-order file writes at the disk level. LazyFS persists these writes out of order and then crashes to simulate a power failure. To reproduce this problem, one can follow these steps: {*}1{*}. Mount LazyFS on a directory where ZooKeeper data will be saved, with a specified root directory. Assuming the data path for ZooKeeper is {{/home/data/zk}} and the root directory is {{{}/home/data/zk-root{}}}, add the following lines to the default configuration file (located in the {{config/default.toml}} directory): {{[[injection]] }} {{type="reorder" }} {{occurrence=1 }} {{op="write" }} {{file="/home/data/zk-root/version-2/log.100000001" }} {{persist=[3]}} These lines define a fault to be injected. A power failure will be simulated after the third write to the {{/home/data/zk-root/version-2/log.100000001}} file. The `occurrence` parameter allows specifying that this is the first group where this happens, as there might be more than one group of consecutive writes. {*}2{*}. Start LazyFS as the underlying file system of a node_ in the cluster with the following command: {{ ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/zk -r /home/data/zk-root -f}} {*}3{*}. Start ZooKeeper with the command: {{ apache-zookeeper-3.7.1-bin/bin/zkServer.sh start-foreground}} {*}4{*}. Connect a client to the node that has LazyFS as the underlying file system: {{apache-zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181}} Immediately after this step, LazyFS will be unmounted, simulating a power failure, and ZooKeeper will keep printing error messages in the terminal, requiring a forced shutdown. At this point, one can analyze the logs produced by LazyFS to examine the system calls issued up to the moment of the fault. Here is a simplified version of the log: {'syscall': 'create', 'path': '/home/gsd/data/zk37-root/version-2/log.100000001', 'mode': 'O_TRUNC'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '16', 'off': '0'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '1', 'off': '67108879'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '67108863', 'off': '16'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '61', 'off': '16'} Note that the third write is issued by LazyFS for padding. {*}5{*}. Remove the fault from the configuration file, unmount the file system with {{fusermount -uz /home/data/zk}} {*}6{*}. Mount LazyFS again with the previously provided command. {*}7{*}. Attempt to start ZooKeeper (it fails). By following these steps, one can replicate the issue and analyze the effects of the power failure on ZooKeeper's restart process. was: The underlying issue stems from consecutive writes to the log file that are not interleaved with {{fsync}} operations. This is a well-documented behavior of operating systems, and there are several references addressing this problem: - [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai] - [https://dl.acm.org/doi/pdf/10.1145/2872362.2872406] - [https://mariadb.com/kb/en/atomic-write-support/] - [https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf] (Page 9) This issue can be replicated using [LazyFS|https://github.com/dsrhaslab/lazyfs], a file system capable of simulating power failures and exhibiting the OS behavior mentioned above, i.e., the out-of-order file writes at the disk level. LazyFS persists these writes out of order and then crashes to simulate a power failure. To reproduce this problem, one can follow these steps: {*}1{*}. Mount LazyFS on a directory where ZooKeeper data will be saved, with a specified root directory. Assuming the data path for ZooKeeper is {{/home/data/zk}} and the root directory is {{{}/home/data/zk-root{}}}, add the following lines to the default configuration file (located in the {{config/default.toml}} directory): {{[[injection]] }} {{type="reorder" }} {{occurrence=1 }} {{op="write" }} {{file="/home/data/zk-root/version-2/log.100000001" }} {{persist=[3]}} These lines define a fault to be injected. A power failure will be simulated after the third write to the {{/home/data/zk-root/version-2/log.100000001}} file. The `occurrence` parameter allows specifying that this is the first group where this happens, as there might be more than one group of consecutive writes. {*}2{*}. Start LazyFS as the underlying file system of a node_ in the cluster with the following command: {{ ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/zk -r /home/data/zk-root -f}} {*}3{*}. Start ZooKeeper with the command: {{ apache-zookeeper-3.7.1-bin/bin/zkServer.sh start-foreground}} {*}4{*}. Connect a client to the node that has LazyFS as the underlying file system: {{apache-zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181}} Immediately after this step, LazyFS will be unmounted, simulating a power failure, and ZooKeeper will keep printing error messages in the terminal, requiring a forced shutdown. At this point, one can analyze the logs produced by LazyFS to examine the system calls issued up to the moment of the fault. Here is a simplified version of the log: {'syscall': 'create', 'path': '/home/gsd/data/zk37-root/version-2/log.100000001', 'mode': 'O_TRUNC'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '16', 'off': '0'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '1', 'off': '67108879'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '67108863', 'off': '16'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', 'size': '61', 'off': '16'} Note that the third write is issued by LazyFS for padding. {*}5{*}. Remove the fault from the configuration file, unmount the file system with {{fusermount -uz /home/data/zk}} {*}6{*}. Mount LazyFS again with the previously provided command. {*}7{*}. Attempt to start ZooKeeper (it fails). By following these steps, one can replicate the issue and analyze the effects of the power failure on ZooKeeper's restart process. > Zookeeper fails to start after power failure > -------------------------------------------- > > Key: ZOOKEEPER-4744 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4744 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.7.1 > Environment: These are the configurations of the ZooKeeper cluster > (omitting IPs): > {{tickTime=2000}} > {{dataDir=/home/data/zk37}} > {{clientPort=2181}} > {{maxClientCnxns=60}} > {{initLimit=100}} > {{syncLimit=100}} > {{server.1=[IP1]:2888:3888}} > {{server.2=[IP2]:2888:3888}} > {{server.3=[IP3]:2888:3888}} > Reporter: Maria Ramos > Priority: Critical > Attachments: reported_error.txt > > > The underlying issue stems from consecutive writes to the log file that are > not interleaved with {{fsync}} operations. This is a well-documented behavior > of operating systems, and there are several references addressing this > problem: > - > [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai] > - [https://dl.acm.org/doi/pdf/10.1145/2872362.2872406] > - [https://mariadb.com/kb/en/atomic-write-support/] > - [https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf] (Page 9) > This issue can be replicated using > [LazyFS|https://github.com/dsrhaslab/lazyfs], a file system capable of > simulating power failures and exhibiting the OS behavior mentioned above, > i.e., the out-of-order file writes at the disk level. LazyFS persists these > writes out of order and then crashes to simulate a power failure. > To reproduce this problem, one can follow these steps: > {*}1{*}. Mount LazyFS on a directory where ZooKeeper data will be saved, with > a specified root directory. Assuming the data path for ZooKeeper is > {{/home/data/zk}} and the root directory is {{{}/home/data/zk-root{}}}, add > the following lines to the default configuration file (located in the > {{config/default.toml}} directory): > {{[[injection]] }} > {{type="reorder" }} > {{occurrence=1 }} > {{op="write" }} > {{file="/home/data/zk-root/version-2/log.100000001" }} > {{persist=[3]}} > These lines define a fault to be injected. A power failure will be simulated > after the third write to the {{/home/data/zk-root/version-2/log.100000001}} > file. The `occurrence` parameter allows specifying that this is the first > group where this happens, as there might be more than one group of > consecutive writes. > {*}2{*}. Start LazyFS as the underlying file system of a node_ in the cluster > with the following command: > {{ ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/zk -r > /home/data/zk-root -f}} > {*}3{*}. Start ZooKeeper with the command: > {{ apache-zookeeper-3.7.1-bin/bin/zkServer.sh start-foreground}} > {*}4{*}. Connect a client to the node that has LazyFS as the underlying > file system: > {{apache-zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181}} > Immediately after this step, LazyFS will be unmounted, simulating a power > failure, and ZooKeeper will keep printing error messages in the terminal, > requiring a forced shutdown. > At this point, one can analyze the logs produced by LazyFS to examine the > system calls issued up to the moment of the fault. Here is a simplified > version of the log: > {'syscall': 'create', 'path': > '/home/gsd/data/zk37-root/version-2/log.100000001', 'mode': 'O_TRUNC'} > {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', > 'size': '16', 'off': '0'} > {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', > 'size': '1', 'off': '67108879'} > {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', > 'size': '67108863', 'off': '16'} > {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001', > 'size': '61', 'off': '16'} > Note that the third write is issued by LazyFS for padding. > > {*}5{*}. Remove the fault from the configuration file, unmount the file > system with > {{fusermount -uz /home/data/zk}} > {*}6{*}. Mount LazyFS again with the previously provided command. > {*}7{*}. Attempt to start ZooKeeper (it fails). > By following these steps, one can replicate the issue and analyze the effects > of the power failure on ZooKeeper's restart process. -- This message was sent by Atlassian Jira (v8.20.10#820010)