Maria Ramos created ZOOKEEPER-4744:
--------------------------------------
Summary: Zookeeper fails to start after power failure
Key: ZOOKEEPER-4744
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4744
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.7.1
Environment: These are the configurations of the ZooKeeper cluster
(omitting IPs):
{{tickTime=2000}}
{{dataDir=/home/data/zk37}}
{{clientPort=2181}}
{{maxClientCnxns=60}}
{{initLimit=100}}
{{syncLimit=100}}
{{server.1=[IP1]:2888:3888}}
{{server.2=[IP2]:2888:3888}}
{{server.3=[IP3]:2888:3888}}
Reporter: Maria Ramos
Attachments: reported_error.txt
The underlying issue stems from consecutive writes to the log file that are not
interleaved with {{fsync}} operations. This is a well-documented behavior of
operating systems, and there are several references addressing this problem:
-
[https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai]
- [https://dl.acm.org/doi/pdf/10.1145/2872362.2872406]
- [https://mariadb.com/kb/en/atomic-write-support/]
- [https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf] (Page 9)
This issue can be replicated using
[LazyFS|https://github.com/dsrhaslab/lazyfs], a file system capable of
simulating power failures and exhibiting the OS behavior mentioned above, i.e.,
the out-of-order file writes at the disk level. LazyFS persists these writes
out of order and then crashes to simulate a power failure.
To reproduce this problem, one can follow these steps:
{*}1{*}. Mount LazyFS on a directory where ZooKeeper data will be saved,
with a specified root directory. Assuming the data path for ZooKeeper is
{{/home/data/zk}} and the root directory is {{{}/home/data/zk-root{}}}, add the
following lines to the default configuration file (located in the
{{config/default.toml}} directory):
{{[[injection]] }}
{{type="reorder" }}
{{occurrence=1 }}
{{op="write" }}
{{file="/home/data/zk-root/version-2/log.100000001" }}
{{persist=[3]}}
These lines define a fault to be injected. A power failure will be simulated
after the third write to the {{/home/data/zk-root/version-2/log.100000001}}
file. The `occurrence` parameter allows specifying that this is the first group
where this happens, as there might be more than one group of consecutive writes.
{*}2{*}. Start LazyFS as the underlying file system of a node_ in the
cluster with the following command:
{{ ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/zk -r
/home/data/zk-root -f}}
{*}3{*}. Start ZooKeeper with the command:
{{ apache-zookeeper-3.7.1-bin/bin/zkServer.sh start-foreground}}
{*}4{*}. Connect a client to the node that has LazyFS as the underlying file
system:
{{apache-zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181}}
Immediately after this step, LazyFS will be unmounted, simulating a power
failure, and ZooKeeper will keep printing error messages in the terminal,
requiring a forced shutdown.
At this point, one can analyze the logs produced by LazyFS to examine the
system calls issued up to the moment of the fault. Here is a simplified version
of the log:
{'syscall': 'create', 'path':
'/home/gsd/data/zk37-root/version-2/log.100000001', 'mode': 'O_TRUNC'}
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001',
'size': '16', 'off': '0'}
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001',
'size': '1', 'off': '67108879'}
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001',
'size': '67108863', 'off': '16'}
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.100000001',
'size': '61', 'off': '16'}
Note that the third write is issued by LazyFS for padding.
{*}5{*}. Remove the fault from the configuration file, unmount the file
system with
{{fusermount -uz /home/data/zk}}
{*}6{*}. Mount LazyFS again with the previously provided command.
{*}7{*}. Attempt to start ZooKeeper (it fails).
By following these steps, one can replicate the issue and analyze the effects
of the power failure on ZooKeeper's restart process.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)