[ 
https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923412#comment-16923412
 ] 

Dmitriy Govorukhin commented on IGNITE-12127:
---------------------------------------------

Merged to master a13337d94755d7e1cc097c6f00311552fea25ae6

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-12127
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12127
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Dmitriy Govorukhin
>            Assignee: Dmitriy Govorukhin
>            Priority: Critical
>             Fix For: 2.7.6
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : 
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.StorageException: Failed to write 
> buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: 
> Failed to write buffer.
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
>  [ignite-core-2.5.7.jar!/:2.5.7]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
>  [ignite-core-2.5.7.jar!/:2.5.7]
>     at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-2.5.7.jar!/:2.5.7]
>     at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
>     at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) 
> ~[na:1.8.0_201]
>     at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) 
> ~[na:1.8.0_201]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
>  ~[ignite-core-2.5.7.jar!/:2.5.7]
>     at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
>  [ignite-core-2.5.7.jar!/:2.5.7]
>     ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, 
> {{addRecord}} fails and the thread A starts segment rollover. I successfully 
> runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment 
> record
>  * Thread B attempts to log another record, which fits exactly till the end 
> of the current segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is 
> performed and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the 
> file IO and immediately proceeds to write unflushed changes causing the 
> exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the 
> {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, 
> {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and 
> returns immediately after {{flushOrWait()}} if the flag is set - this is very 
> confusing. After all, the {{close()}} itself explicitly calls {{force}} after 
> flush
>  * There is an ignored IO exception in mmap mode - this should be propagated 
> to the failure handler
>  * In WAL writer, we check for file CLOSE and then attemp to write to 
> (possibly) the same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better 
> if we read the current handle once and then operate on it during the whole 
> loop iteration



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to