Hey all,
Joan sent this along in IRC and it reads bad enough[tm] that we should at least have a few pairs of eyes looking at if we have to do anything: http://danluu.com/fsyncgate/ (It is long and dense, you’ll need to read 10-15% of that page to get the main picture. tl;dr: running fsync() after an fsync() that reported EIO clears that error state with no way of recovery on Linux. There are two ways of handling this correctly: 1. whatever you wrote() between the last successful fsync() and the fsync() that raised the error, keep around until after the second fsync(), so you can write() it again. 2. if any one fsync() returns EIO, report this back up immediately, so whoever calls you can retry. * * * We seem to be doing 2. as per my reading. Erlang looks like it correctly just raises whatever error fsync() might return: 1. https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809 2. https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163 couch_file too: 1. https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223 I glanced at a few paths going up this chain and couldn’t spot a catch where we’d hide that error, but it’d be great to get some confirmation on this. * * * Please double-check my understanding of the issue, the correct ways forward and the findings in Erlang and CouchDB. Best Jan -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/