[
https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923405#comment-16923405
]
Andrey Gura commented on IGNITE-12127:
--------------------------------------
[~DmitriyGovorukhin] LGTM. Thanks for contribution! Please merge to master
branch.
> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -------------------------------------------------------------------------
>
> Key: IGNITE-12127
> URL: https://issues.apache.org/jira/browse/IGNITE-12127
> Project: Ignite
> Issue Type: Bug
> Reporter: Dmitriy Govorukhin
> Assignee: Dmitriy Govorukhin
> Priority: Critical
> Fix For: 2.7.6
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT :
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
> o.a.i.i.processors.cache.persistence.StorageException: Failed to write
> buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException:
> Failed to write buffer.
> at
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
> [ignite-core-2.5.7.jar!/:2.5.7]
> at
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
> [ignite-core-2.5.7.jar!/:2.5.7]
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> [ignite-core-2.5.7.jar!/:2.5.7]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
> ~[na:1.8.0_201]
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253)
> ~[na:1.8.0_201]
> at
> org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
> ~[ignite-core-2.5.7.jar!/:2.5.7]
> at
> org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
> ~[ignite-core-2.5.7.jar!/:2.5.7]
> at
> org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
> ~[ignite-core-2.5.7.jar!/:2.5.7]
> at
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
> [ignite-core-2.5.7.jar!/:2.5.7]
> ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
> * Thread A attempts to log a large record which does not fit segment,
> {{addRecord}} fails and the thread A starts segment rollover. I successfully
> runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment
> record
> * Thread B attempts to log another record, which fits exactly till the end
> of the current segment. The record is added to the buffer
> * Thread A resumes and fails to add the switch segment record. No flush is
> performed and the thread immediately proceeds for wal-writer close
> * WAL writer thread wakes up, sees that there is a CLOSE request, closes the
> file IO and immediately proceeds to write unflushed changes causing the
> exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the
> {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
> * There is an {{fsync(filePtr)}} call inside {{close()}}; however,
> {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and
> returns immediately after {{flushOrWait()}} if the flag is set - this is very
> confusing. After all, the {{close()}} itself explicitly calls {{force}} after
> flush
> * There is an ignored IO exception in mmap mode - this should be propagated
> to the failure handler
> * In WAL writer, we check for file CLOSE and then attemp to write to
> (possibly) the same write handle - write should be always before close
> * In WAL writer, there are racy reads of current handle - it would be better
> if we read the current handle once and then operate on it during the whole
> loop iteration
--
This message was sent by Atlassian Jira
(v8.3.2#803003)