[ https://issues.apache.org/jira/browse/DERBY-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856182#comment-16856182 ]
David Sitsky commented on DERBY-7034: ------------------------------------- Hi Rick.. I think a more accurate summary is fsync() does obey its contract when no exception is raised. The problem is if an exception is raised from fsync() and you call it again, it may "succeed" on the 2nd call but in actually fact it has silently dropped the previous write blocks which failed - so you end up with corruption. All the recommendations state that you should never retry fsync() but simply panic instead. Going ahead people seem to recommend using direct I/O instead for better control. I tried hard to find out if Windows has similar issues but was unsuccessful. But logging an error when fsync() fails would be a great change too. > Derby's sync() handling can lead to database corruption (at least on Linux) > --------------------------------------------------------------------------- > > Key: DERBY-7034 > URL: https://issues.apache.org/jira/browse/DERBY-7034 > Project: Derby > Issue Type: Bug > Components: Store > Affects Versions: 10.14.2.0 > Reporter: David Sitsky > Priority: Major > > I recently read about "fsyncgate 2018" that the Postgres team raised: > https://wiki.postgresql.org/wiki/Fsync_Errors. > https://lwn.net/Articles/752063/ has a good overview of the issue relating to > fsync() behaviour on Linux. The short summary is on some versions of Linux > if you retry fsync() after it failed, it will succeed and you will end up > with corrupted data on disk. > At a quick glance at the Derby code, I have already seen two places where > sync() is retried in a loop which is clearly dangerous. There could be other > areas too. > In LogAccessFile: > {code} > /** > * Guarantee all writes up to the last call to flushLogAccessFile on disk. > * <p> > * A call for clients of LogAccessFile to insure that all data written > * up to the last call to flushLogAccessFile() are written to disk. > * This call will not return until those writes have hit disk. > * <p> > * Note that this routine may block waiting for I/O to complete so > * callers should limit the number of resource held locked while this > * operation is called. It is expected that the caller > * Note that this routine only "writes" the data to the file, this does > not > * mean that the data has been synced to disk. The only way to insure > that > * is to first call switchLogBuffer() and then follow by a call of sync(). > * > **/ > public void syncLogAccessFile() > throws IOException, StandardException > { > for( int i=0; ; ) > { > // 3311: JVM sync call sometimes fails under high load against > NFS > // mounted disk. We re-try to do this 20 times. > try > { > synchronized( this) > { > log.sync(); > } > // the sync succeed, so return > break; > } > catch( SyncFailedException sfe ) > { > i++; > try > { > // wait for .2 of a second, hopefully I/O is done by now > // we wait a max of 4 seconds before we give up > Thread.sleep( 200 ); > } > catch( InterruptedException ie ) > { > InterruptStatus.setInterrupted(); > } > if( i > 20 ) > throw StandardException.newException( > SQLState.LOG_FULL, sfe); > } > } > } > {code} > And LogToFile has similar retry code.. but without handling for > SyncFailedException: > {code} > /** > * Utility routine to call sync() on the input file descriptor. > * <p> > */ > private void syncFile( StorageRandomAccessFile raf) > throws StandardException > { > for( int i=0; ; ) > { > // 3311: JVM sync call sometimes fails under high load against > NFS > // mounted disk. We re-try to do this 20 times. > try > { > raf.sync(); > // the sync succeed, so return > break; > } > catch (IOException ioe) > { > i++; > try > { > // wait for .2 of a second, hopefully I/O is done by now > // we wait a max of 4 seconds before we give up > Thread.sleep(200); > } > catch( InterruptedException ie ) > { > InterruptStatus.setInterrupted(); > } > if( i > 20 ) > { > throw StandardException.newException( > SQLState.LOG_FULL, ioe); > } > } > } > } > {code} > It seems Postgres, MySQL and MongoDB have already changed their code to > "panic" if an error comes from an fsync() call. > There is a lot more complexities with how fsync() reports errors (if at all). > It is worth getting into it further as I am not familiar with Derby's > internals and how affected it could be by this. > Interestingly people have indicated this issue is more likely to happen for > network filesystems (since write failures are more common due to the network > going down) and in the past it was easy just to say "NFS is broken".. but in > actual fact the problem was in some cases with fsync() and how it was called > in a loop. > I've been trying to find out if Windows has similar issues without much luck. > But given the mysterious corruption issues I have seen on the past with > Windows/CIFS.. I do wonder if this is related somehow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)