kezhuw commented on code in PR #2141:
URL: https://github.com/apache/zookeeper/pull/2141#discussion_r1758154068
##########
zookeeper-server/src/main/java/org/apache/zookeeper/server/persistence/FileTxnLog.java:
##########
@@ -701,14 +708,27 @@ public long getStorageSize() {
/**
* go to the next logfile
+ *
* @return true if there is one and false if there is no
* new file to be read
- * @throws IOException
*/
private boolean goToNextLog() throws IOException {
- if (storedFiles.size() > 0) {
+ if (!storedFiles.isEmpty()) {
this.logFile = storedFiles.remove(storedFiles.size() - 1);
- ia = createInputArchive(this.logFile);
+ try {
+ ia = createInputArchive(this.logFile);
+ } catch (EOFException ex) {
+ // If this file is the last log file in the database and
is empty,
+ // it means that the last time the file was created
+ // before the header was written.
+ if (storedFiles.isEmpty() && this.logFile.length() == 0) {
+ boolean deleted = this.logFile.delete();
Review Comment:
> Delete failures are usually due to permissions or disk failure. For
example, the zookeeper user only has read access to the data directory, so
deleting the empty log file will fail. 1) If the disk completes the I/O in less
time than the I/O timeout, it may be retrying in the blk layer, just slower,
but it won't fail. 2) When the disk I/O timeout occurs, most filesystems will
remount as read-only, and zookeeper will suffer with insufficient permissions.
3) I/O may not always return properly if the disk is broken from my experience,
the process is blocked in the uninterruptible state. When delete failed
(filesystem readonly) and throw I/O exception, the server restarted, and failed
again. If we didn't throw the exception, the server will fail on the next write
with fsync (usually writing new epoch files after FLE completed and reaches
initLimit timeout).
This sounds sophisticated and complicated. It requires too much knowledge to
be convinced.
> If we didn't throw the exception, the server will fail on the next write
with fsync (usually writing new epoch files after FLE completed and reaches
initLimit timeout).
The evaluation chain is much longer than throwing exception here.
> I don't think we can do much more about the deletion failure.
This is the point. So why not complain the failure in first place ? Throwing
exception here is much more simple for us to be convinced correct than ignoring
it and rely on any sophisticated experiences.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]