Young Xu created ZOOKEEPER-4624:
-----------------------------------
Summary: Zookeeper service cannot restarted because the IO Inject
filesystem fd is used up.
Key: ZOOKEEPER-4624
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4624
Project: ZooKeeper
Issue Type: Bug
Environment: environment: *{color:#FF0000}K8S{color}*
deployment: *{color:#FF0000}statefulset replicas 3{color}*
zookeeper version: *{color:#FF0000}3.8.0{color}*
Reporter: Young Xu
We're running a chaos test. and we've using this scenarios:
ZooKeeper pod is deployed on three nodes. We use {color:#FF0000}*IO
injection*{color} to fill up the fd of one node(test one pod), and filesytem
all operations return "Too many files". After a period of time, the ZooKeeper
service stops running. Then we stopped the injection. When I manually start the
process again, the ZooKeeper reports an error.
{code:java}
2022-10-19 02:03:07,876 [myid:3] - INFO [main:o.a.z.s.q.QuorumPeer@2549] -
QuorumPeer communication is not secured! (SASL auth disabled)2022-10-19
02:03:07,876 [myid:3] - INFO [main:o.a.z.s.q.QuorumPeer@2574] -
quorum.cnxn.threads.size set to 202022-10-19 02:03:07,877 [myid:3] - INFO
[main:o.a.z.s.p.FileSnap@85] - Reading snapshot
/home/edge/middleware/zookeeper/data/data/version-2/snapshot.1409ce9ac72022-10-19
02:03:07,883 [myid:3] - INFO [main:o.a.z.s.DataTree@1705] - The digest in the
snapshot has digest version of 2, with zxid as 0x1409ce9acc, and digest value
as 816041257652022-10-19 02:03:11,662 [myid:3] - ERROR
[main:o.a.z.s.q.QuorumPeer@1200] - Unable to load database on
diskjava.io.EOFException: null at
java.base/java.io.DataInputStream.readInt(Unknown Source) at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:707)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:725)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:693)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:774)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132) at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)2022-10-19
02:03:11,663 [myid:3] - INFO [main:o.a.z.m.p.PrometheusMetricsProvider@570] -
Shutdown executor service with timeout 10002022-10-19 02:03:11,739 [myid:3] -
INFO [main:o.e.j.s.AbstractConnector@383] - Stopped
ServerConnector@5b03b9fe{HTTP/1.1,
(http/1.1)}{zookeeper-default-2.zookeeper.default.svc.cluster.local:8080}2022-10-19
02:03:11,742 [myid:3] - INFO [main:o.e.j.s.h.ContextHandler@1159] - Stopped
o.e.j.s.ServletContextHandler@17bffc17{/,null,STOPPED}2022-10-19 02:03:11,746
[myid:3] - ERROR [main:o.a.z.s.q.QuorumPeerMain@114] - Unexpected exception,
exiting abnormallyjava.lang.RuntimeException: Unable to run quorum server
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1201)
at
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132) at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)Caused
by: java.io.EOFException: null at
java.base/java.io.DataInputStream.readInt(Unknown Source) at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:707)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:725)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:693)
at
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:774)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
at
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146)
... 4 common frames omitted2022-10-19 02:03:11,747 [myid:3] - INFO
[main:o.a.z.a.ZKAuditProvider@42] - ZooKeeper audit is disabled.2022-10-19
02:03:11,749 [myid:3] - ERROR [main:o.a.z.u.ServiceUtils@48] - Exiting JVM with
code 1 {code}
Now I know delete data directory can fix this and get the service up and
running. but I dont know why the file is corrupted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)