[ https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923412#comment-16923412 ]
Dmitriy Govorukhin commented on IGNITE-12127: --------------------------------------------- Merged to master a13337d94755d7e1cc097c6f00311552fea25ae6 > WAL writer may close file IO with unflushed changes when MMAP is disabled > ------------------------------------------------------------------------- > > Key: IGNITE-12127 > URL: https://issues.apache.org/jira/browse/IGNITE-12127 > Project: Ignite > Issue Type: Bug > Reporter: Dmitriy Govorukhin > Assignee: Dmitriy Govorukhin > Priority: Critical > Fix For: 2.7.6 > > Time Spent: 20m > Remaining Estimate: 0h > > Most likely the issue manifests itself as the following critical error: > {code} > 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : > Critical system error detected. Will be handled accordingly to configured > handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, > failureCtx=FailureContext [type=CRITICAL_ERROR, err=class > o.a.i.i.processors.cache.persistence.StorageException: Failed to write > buffer.]] > org.apache.ignite.internal.processors.cache.persistence.StorageException: > Failed to write buffer. > at > org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444) > [ignite-core-2.5.7.jar!/:2.5.7] > at > org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249) > [ignite-core-2.5.7.jar!/:2.5.7] > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > [ignite-core-2.5.7.jar!/:2.5.7] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201] > Caused by: java.nio.channels.ClosedChannelException: null > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) > ~[na:1.8.0_201] > at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) > ~[na:1.8.0_201] > at > org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48) > ~[ignite-core-2.5.7.jar!/:2.5.7] > at > org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41) > ~[ignite-core-2.5.7.jar!/:2.5.7] > at > org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111) > ~[ignite-core-2.5.7.jar!/:2.5.7] > at > org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437) > [ignite-core-2.5.7.jar!/:2.5.7] > ... 3 common frames omitted > {code} > It appears that there following sequence is possible: > * Thread A attempts to log a large record which does not fit segment, > {{addRecord}} fails and the thread A starts segment rollover. I successfully > runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment > record > * Thread B attempts to log another record, which fits exactly till the end > of the current segment. The record is added to the buffer > * Thread A resumes and fails to add the switch segment record. No flush is > performed and the thread immediately proceeds for wal-writer close > * WAL writer thread wakes up, sees that there is a CLOSE request, closes the > file IO and immediately proceeds to write unflushed changes causing the > exception. > Unconditional flush after switch segment record write should fix the issue. > Besides the bug itself, I suggest the following changes to the > {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions): > * There is an {{fsync(filePtr)}} call inside {{close()}}; however, > {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and > returns immediately after {{flushOrWait()}} if the flag is set - this is very > confusing. After all, the {{close()}} itself explicitly calls {{force}} after > flush > * There is an ignored IO exception in mmap mode - this should be propagated > to the failure handler > * In WAL writer, we check for file CLOSE and then attemp to write to > (possibly) the same write handle - write should be always before close > * In WAL writer, there are racy reads of current handle - it would be better > if we read the current handle once and then operate on it during the whole > loop iteration -- This message was sent by Atlassian Jira (v8.3.2#803003)