Re: fsync(2) manual and hdd write caching

2010-11-30 Thread David Schultz
On Thu, Oct 28, 2010, per...@pluto.rain.com wrote:
> Ivan Voras  wrote:
> 
> > ... The problem is actually pretty hard - since AFAIK SoftUpdates
> > doesn't have "checkpoints" in the sense that it groups writes and
> > all data "before" can guaranteed to be on-disk, the problem is
> > *when* to issue BIO_FLUSH requests.
> 
> Seems to me the originally-stated problem -- making fsync(2)
> do what it claims to do -- is not hard at all.  Just issue a
> BIO_FLUSH request as the final step in handling fsync(2).

Yes, for correctness, fsync(2) needs to flush the relevant parts
of the disk's volatile write cache before returning.  If it
doesn't, applications like databases can fail if there is a power
loss.

Unfortunately, this isn't really practical.  First, performance is
poor: you generally can't flush a particular sector without
flushing the entire write cache, and many disks (including all ATA
disks) don't differentiate between volatile and non-volatile
caches.  Second, many disks ignore the command.

So the status quo for all the major Unix variants is apparently to
favor performance over correctness.  However, FlushFileBuffers()
in Windows does the right thing and flushes the disk write cache,
and I've heard that ZFS and ext4 also do the right thing (subject
to the correctness of the disk controller, of course).

So FreeBSD isn't any worse than most of the world here.  FreeBSD
used to turn off disk write caches by default, but many people
complained about FreeBSD being slow.  Far fewer people complain
about corruptions due to power failure.  Usually people who
require stronger reliability guarantees invest in replicated
storage and battery backups anyway.

Note that the "broken" behavior is still protective against kernel
and application crashes -- just not power failures and certain
types of disk faults.

An informative article on the topic is here:

   http://www.postgresql.org/docs/9.0/static/wal-reliability.html

> While we're at it, perhaps do the same in close(2).
> I _hope_ we are already doing it in unmount(2).

close(2) is a different beast; flushes would be too expensive, and
they aren't needed except for NFS.  Apps are expected to use
fsync(2) if they require it.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-28 Thread perryh
Ivan Voras  wrote:

> ... The problem is actually pretty hard - since AFAIK SoftUpdates
> doesn't have "checkpoints" in the sense that it groups writes and
> all data "before" can guaranteed to be on-disk, the problem is
> *when* to issue BIO_FLUSH requests.

Seems to me the originally-stated problem -- making fsync(2)
do what it claims to do -- is not hard at all.  Just issue a
BIO_FLUSH request as the final step in handling fsync(2).

While we're at it, perhaps do the same in close(2).
I _hope_ we are already doing it in unmount(2).
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-27 Thread Ivan Voras
On 10/27/10 12:11, Bruce Cran wrote:
> On Wed, 27 Oct 2010 02:00:51 -0700
> per...@pluto.rain.com wrote:
> 
>> Short of mounting synchronously, with the attendant performance
>> hit, would it not make sense for fsync(2) to issue ATA_FLUSHCACHE
>> or SCSI "SYNCHRONIZE CACHE" after it has finished writing data
>> to the drive?  Surely the low-level capability to issue those
>> commands must already exist, else we would have no way to safely
>> prepare for power off.
> 
> mounting synchronously won't help, will it? As I understand it that
> just makes sure that data is sent straight to disk and not left in
> memory; the data will still be stored in the HDD cache for a
> while.

Correct. The problem is actually pretty hard - since AFAIK SoftUpdates
doesn't have "checkpoints" in the sense that it groups writes and all
data "before" can guaranteed to be on-disk, the problem is *when* to
issue BIO_FLUSH requests. One possible solution is to simply decide on a
heuristic like: "ok, doing BIO_FLUSH all the time will destroy
performance, we will only do it for every metadata write". Possibly with
a sysctl tunable or per-mount option.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-27 Thread Bruce Cran
On Wed, 27 Oct 2010 02:00:51 -0700
per...@pluto.rain.com wrote:

> Short of mounting synchronously, with the attendant performance
> hit, would it not make sense for fsync(2) to issue ATA_FLUSHCACHE
> or SCSI "SYNCHRONIZE CACHE" after it has finished writing data
> to the drive?  Surely the low-level capability to issue those
> commands must already exist, else we would have no way to safely
> prepare for power off.

mounting synchronously won't help, will it? As I understand it that
just makes sure that data is sent straight to disk and not left in
memory; the data will still be stored in the HDD cache for a
while.

-- 
Bruce Cran
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-27 Thread perryh
Ivan Voras  wrote:
> fsync(2) actually does behave as advertised, "auses all modified
> data and attributes of fd to be moved to a permanent storage
> device". It is the problem of the "permanent storage device"
> if it caches this data further.

IMO, volatile RAM without battery backup cannot reasonably be
considered a "permanent storage device", regardless of where
it is physically located.

Short of mounting synchronously, with the attendant performance
hit, would it not make sense for fsync(2) to issue ATA_FLUSHCACHE
or SCSI "SYNCHRONIZE CACHE" after it has finished writing data
to the drive?  Surely the low-level capability to issue those
commands must already exist, else we would have no way to safely
prepare for power off.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-26 Thread Garrett Cooper
On Tue, Oct 26, 2010 at 4:40 PM, Alexander Best  wrote:
> On Wed Oct 27 10, Bruce Cran wrote:
>> On Tue, 26 Oct 2010 21:36:18 +
>> Alexander Best  wrote:
>>
>> > since there's a thread on freebsd-questions@ concerning fsync(2) and
>> > the fact that hdd write caching can cause this syscall to basically
>> > be a no op, could somebody please copy the BUGS section from sync(2)
>> > to fsync(2)?
>>
>> Shouldn't the BUGS section of sync(2) be removed?
>>
>> "The sync() system call may return before the buffers are completely
>>      flushed."
>>
>> But from
>> http://www.opengroup.org/onlinepubs/009695399/functions/sync.html :
>>
>> "The writing, although scheduled, is not necessarily complete upon
>> return from sync()."
>>
>> That would suggest it's not actually a bug.
>
> well...you are right on the one hand. but still this should be documented imo.
> how about turning BUGS into a CAVEATS section and then adding that section to
> fsync(2)?
>
> the reason posix mentions this sync/fsync behavior is probably the fact that
> they know that this cannot be avoided. so that statement seems itself to be a
> caveat rather than a feature. ;)

Just a sidenote, but that's POSIX 2004[.6?] spec, not POSIX 2008.1
(which is the most current spec -- http://www.unix.org/2008edition/ ).
I double checked and the wording didn't differ for the fsync(2) system
interface, but it could differ in others.
HTH,
-Garrett
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-26 Thread Alexander Best
On Wed Oct 27 10, Bruce Cran wrote:
> On Tue, 26 Oct 2010 21:36:18 +
> Alexander Best  wrote:
> 
> > since there's a thread on freebsd-questions@ concerning fsync(2) and
> > the fact that hdd write caching can cause this syscall to basically
> > be a no op, could somebody please copy the BUGS section from sync(2)
> > to fsync(2)?
> 
> Shouldn't the BUGS section of sync(2) be removed?
> 
> "The sync() system call may return before the buffers are completely
>  flushed."
> 
> But from
> http://www.opengroup.org/onlinepubs/009695399/functions/sync.html : 
> 
> "The writing, although scheduled, is not necessarily complete upon
> return from sync()."
> 
> That would suggest it's not actually a bug.

well...you are right on the one hand. but still this should be documented imo.
how about turning BUGS into a CAVEATS section and then adding that section to
fsync(2)?

the reason posix mentions this sync/fsync behavior is probably the fact that
they know that this cannot be avoided. so that statement seems itself to be a
caveat rather than a feature. ;)

cheers.
alex

> 
> -- 
> Bruce Cran

-- 
a13x
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-26 Thread Bruce Cran
On Tue, 26 Oct 2010 21:36:18 +
Alexander Best  wrote:

> since there's a thread on freebsd-questions@ concerning fsync(2) and
> the fact that hdd write caching can cause this syscall to basically
> be a no op, could somebody please copy the BUGS section from sync(2)
> to fsync(2)?

Shouldn't the BUGS section of sync(2) be removed?

"The sync() system call may return before the buffers are completely
 flushed."

But from
http://www.opengroup.org/onlinepubs/009695399/functions/sync.html : 

"The writing, although scheduled, is not necessarily complete upon
return from sync()."

That would suggest it's not actually a bug.

-- 
Bruce Cran
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-26 Thread Bruce Cran
On Wed, 27 Oct 2010 01:19:18 +0200
Ivan Voras  wrote:

> fsync(2) actually does behave as advertised, "auses all modified data 
> and attributes of fd to be moved to a permanent storage device". It
> is the problem of the "permanent storage device" if it caches this
> data further.

http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html at
first suggests it should flush write caches, but does allow for
implementations that don't:

"The fsync() function is intended to force a physical write of data
from the buffer cache, and to assure that after a system crash or other
failure that all data up to the time of the fsync() call is recorded on
the disk."

...

"In the middle ground between these extremes, fsync() might or might
not actually cause data to be written where it is safe from a power
failure."

-- 
Bruce Cran
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: fsync(2) manual and hdd write caching

2010-10-26 Thread Ivan Voras

On 10/26/10 23:36, Alexander Best wrote:

hi there,

since there's a thread on freebsd-questions@ concerning fsync(2) and the fact
that hdd write caching can cause this syscall to basically be a no op, could
somebody please copy the BUGS section from sync(2) to fsync(2)?


I don't think they are the same.

The "buffers" of sync(2) are not those from the discussion on fsync(2) 
safety. Or more correctly, they are but those 2 calls work on a 
different scope.


fsync(2) actually does behave as advertised, "auses all modified data 
and attributes of fd to be moved to a permanent storage device". It is 
the problem of the "permanent storage device" if it caches this data 
further.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


fsync(2) manual and hdd write caching

2010-10-26 Thread Alexander Best
hi there,

since there's a thread on freebsd-questions@ concerning fsync(2) and the fact
that hdd write caching can cause this syscall to basically be a no op, could
somebody please copy the BUGS section from sync(2) to fsync(2)?

cheers.
alex

-- 
a13x
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"