[
https://issues.apache.org/jira/browse/ZOOKEEPER-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dharani updated ZOOKEEPER-4878:
-------------------------------
Attachment: zoo.cfg
> Zookeeper servers not running after Chaos mesh IO fault experiment
> ------------------------------------------------------------------
>
> Key: ZOOKEEPER-4878
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4878
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.8.3
> Reporter: Dharani
> Priority: Major
> Attachments: zoo.cfg
>
>
> We are running zookeeper in kubernetes as stateful set with 3 replicas. when
> we performed chaos mesh IO fault experiment using , zookeeper servers are not
> recovering.
> {code:java}
> 2024-10-24T09:43:40.896+0000 [myid:] - ERROR
> [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.s.ZooKeeperServer@552]
> - Severe unrecoverable error, exiting
> java.io.FileNotFoundException:
> /var/lib/zookeeper/data/version-2/snapshot.1100000859 (Input/output error)
> at java.base/java.io.FileOutputStream.open0(Native Method)
> at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
> at
> java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
> at
> java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
> at
> org.apache.zookeeper.server.persistence.SnapStream.getOutputStream(SnapStream.java:133)
> at
> org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:242)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:481)
> at
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:550)
> at
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:544)
> at
> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:540)
> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:597)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)
> 2024-10-24T09:43:40.898+0000 [myid:] - ERROR
> [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):o.a.z.u.ServiceUtils@48]
> - Exiting JVM with code 10 {code}
> Expectation: When IO_fault experiment using chaos mesh is performed for 60
> sec, all the zookeeper servers should recover by itself without any manual
> intervention. Is it possible to have partial traffic when PV is hanged?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)