[
https://issues.apache.org/jira/browse/ZOOKEEPER-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869601#comment-17869601
]
Xin Chen edited comment on ZOOKEEPER-4847 at 7/30/24 10:01 AM:
---------------------------------------------------------------
The severity of this problem lies in the fact that once the ‘currentEpoch‘
content becomes letters, if zk restarts due to other factors, it will never be
able to start successfully. I have proposed a solution to avoid and fix the
issue, which allows the zk process to automatically recover when restarted.
This involves a file deletion operation when incorrect file content is found
during the startup process. This way, when zk is pulled up again, a new valid
file can be created by default.
{code:java}
private long readLongFromFile(String name) throws IOException {
File file = new File(logFactory.getSnapDir(), name);
BufferedReader br = new BufferedReader(new FileReader(file));
String line = "";
try {
line = br.readLine();
return Long.parseLong(line);
} catch (NumberFormatException e) {
// Delete the file and print a log when deletion fails
if (!file.delete()) {
LOG.error("Unable to delete wrong file {}", file);
}
throw new IOException("Found " + line + " in " + file);
} finally {
br.close();
}
} {code}
If the file does not exist at startup, it will be created by default, which is
an existing logic:
{code:java}
private void loadDataBase() {
try {
zkDb.loadDataBase();
// load the epochs
long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
try {
currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
} catch (FileNotFoundException e) {
// pick a reasonable epoch number
// this should only happen once when moving to a
// new code version
currentEpoch = epochOfZxid;
LOG.info(
"{} not found! Creating with a reasonable default of {}. "
+ "This should only happen when you are upgrading your
installation",
CURRENT_EPOCH_FILENAME,
currentEpoch);
writeLongToFile(CURRENT_EPOCH_FILENAME, currentEpoch);
} {code}
was (Author: JIRAUSER298666):
The severity of this problem lies in the fact that once the ‘currentEpoch‘
content becomes letters, if zk restarts due to other factors, it will never be
able to start successfully. I have proposed a solution to avoid and fix the
issue, which allows the zk process to automatically recover when restarted.
This involves a file deletion operation when incorrect file content is found
during the startup process. This way, when zk is pulled up again, a new valid
file can be created by default.
{code:java}
private long readLongFromFile(String name) throws IOException {
File file = new File(logFactory.getSnapDir(), name);
BufferedReader br = new BufferedReader(new FileReader(file));
String line = "";
try {
line = br.readLine();
return Long.parseLong(line);
} catch (NumberFormatException e) {
//########### Delete the file and print a log when deletion fails
if (!file.delete()) {
LOG.error("Unable to delete wrong file {}", file);
}
throw new IOException("Found " + line + " in " + file);
} finally {
br.close();
}
} {code}
If the file does not exist at startup, it will be created by default, which is
an existing logic:
{code:java}
private void loadDataBase() {
try {
zkDb.loadDataBase();
// load the epochs
long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
try {
currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
} catch (FileNotFoundException e) {
// pick a reasonable epoch number
// this should only happen once when moving to a
// new code version
currentEpoch = epochOfZxid;
LOG.info(
"{} not found! Creating with a reasonable default of {}. "
+ "This should only happen when you are upgrading your
installation",
CURRENT_EPOCH_FILENAME,
currentEpoch);
writeLongToFile(CURRENT_EPOCH_FILENAME, currentEpoch);
} {code}
> Found od in /cloud/data/zookeeper/data/version-2/currentEpoch
> -------------------------------------------------------------
>
> Key: ZOOKEEPER-4847
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4847
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.14, 3.6.4, 3.9.2
> Reporter: Xin Chen
> Priority: Major
>
> {code:java}
> 2024-07-30 16:32:48,950 [myid:1] - ERROR [main:QuorumPeer@1148] - Unable to
> load database on disk
> java.io.IOException: Found od in
> /cloud/data/zookeeper/data/version-2/currentEpoch
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.readLongFromFile(QuorumPeer.java:2126)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1100)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1079)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
> 2024-07-30 16:32:48,953 [myid:1] - ERROR [main:QuorumPeerMain@113] -
> Unexpected exception, exiting abnormally
> java.lang.RuntimeException: Unable to run quorum server
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1149)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1079)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
> Caused by: java.io.IOException: Found od in
> /cloud/data/zookeeper/data/version-2/currentEpoch
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.readLongFromFile(QuorumPeer.java:2126)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1100)
> ... 4 more
> 2024-07-30 16:32:48,954 [myid:1] - INFO [main:ZKAuditProvider@42] -
> ZooKeeper audit is disabled.
> 2024-07-30 16:32:48,955 [myid:1] - ERROR [main:ServiceUtils@48] - Exiting JVM
> with code 1
> {code}
> I accidentally encountered this error and found that the current Epoch file
> had been written with letters. Then, the zk process detected the contents of
> this file during restart and threw an exception before exiting the process.
> However, zk was unable to recover it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)