Hey all,

Joan sent this along in IRC and it reads bad enough[tm] that we should at least 
have a few pairs of eyes looking at if we have to do anything:

    http://danluu.com/fsyncgate/

(It is long and dense, you’ll need to read 10-15% of that page to get the main 
picture.

tl;dr: running fsync() after an fsync() that reported EIO clears that error 
state with no way of recovery on Linux.

There are two ways of handling this correctly:

1. whatever you wrote() between the last successful fsync() and the fsync() 
that raised the error, keep around until after the second fsync(), so you can 
write() it again.

2. if any one fsync() returns EIO, report this back up immediately, so whoever 
calls you can retry.

* * *

We seem to be doing 2. as per my reading.

Erlang looks like it correctly just raises whatever error fsync() might return:

1. 
https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L792-L809
2. 
https://github.com/erlang/otp/blob/maint-r14/erts/emulator/drivers/unix/unix_efile.c#L151-L163

couch_file too:

1. 
https://github.com/apache/couchdb/blob/master/src/couch/src/couch_file.erl#L215-L223

I glanced at a few paths going up this chain and couldn’t spot a catch where 
we’d hide that error, but it’d be great to get some confirmation on this.

* * *

Please double-check my understanding of the issue, the correct ways forward and 
the findings in Erlang and CouchDB.

Best
Jan
-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Reply via email to