On Wed, Dec 18, 2013 at 9:31 PM, Cameron Simpson <c...@zip.com.au> wrote: > On 18Dec2013 14:35, Chris Angelico <ros...@gmail.com> wrote: >> An SQL database *is* a different form of storage. It's storing tabular >> data, not a stream of bytes in a file. You're supposed to be able to >> treat it as an efficient way to locate a particular tuple based on a >> set of rules, not a different way to format a file on the disk. > > Shrug. It's all just data to me. I don't _care_ about the particular > internal storage format.
Then use a file, because you want file semantics. That's why you have both options available. > Commit() is a logical operation saying this SQL changeset is now > part of the global state. The global state is defined by what's on the disk. Specifically, by what would be read if the power failed right at that moment. In the case of PostgreSQL, a commit doesn't actually write the table pages - it just writes the WAL (Write-Ahead Log), which is used to recreate the transaction. If something fails hard, the WAL replay will apply the change perfectly. That's the global state. It's not there till the WAL's been fsync'd. >> Also: the filesystem layer doesn't guarantee integrity. If you don't >> fsync() or fdatasync() or some other equivalent [1], it's not on the >> disk yet, so you can't trust it. > > Course I can. There's plenty of scope within the disc physical layer > (buffering, caching, RAID card buffering) for an fsync() to return > _before_ the data are written to ferrous oxide (or whatever) because > the OS DOES NOT KNOW. The theory of fsync is that it's actually written. If it's been written to a battery-backed cache that will be flushed to platters successfully even if the power fails, then it's been fsync'd. That's not a problem. It *is* a problem if it's been written to a volatile cache on an SSD and there's more than can be written in the event of a power failure. That's why there are only two lines of SSD (Intel 320 and 710 series) that are recommended for use with PGSQL. > All that has happened after an fsync() is that the OS taken your > SQL changeset that you commited to the OS data abstraction and > pushed it one layer lower into the "disk" abstraction. There's more > going on in there. Not just pushed it one layer lower; the point of fsync is that it's been pushed all the way down. See its man page [1]: """fsync() transfers ("flushes") all modified in-core data ... to the disk ... so that all changed information can be retrieved even after the system crashed or was rebooted.""" It's fundamentally about crash recovery, not about "passing it to a lower abstraction". Of course, the OS isn't always *able* to guarantee things (NFS shares are notoriously hard to pin down), but the intention of fsync is that it won't return (and therefore the COMMIT operation won't finish) until the data can be read back reliably even in the event of a major failure. Databases protect against that. If you want that protection, use a database. If you don't, use a file. There's nothing wrong with either option. ChrisA [1] on the web here, for those who don't have them handy: http://linux.die.net/man/2/fsync -- https://mail.python.org/mailman/listinfo/python-list