Andrew McNamara wrote: > Note that ext3 effectively does the same thing as ZFS on fsync() - because > the journal layer is block based and does no know which block belongs > to which file, the entire journal must be applied to the filesystem to > achieve the expected fsync() symantics (at least, with data=ordered, > it does).
Well, "does not know which block belongs to which file" sounds weird. :) With data=ordered, the journal holds only metadata. If you fsync() a file, "ordered" means that ext3 syncs the data blocks first (with no overhead, just like any other filesystem, of course it knows what blocks to write), then the journal. Now, yes, the journal possibly contains metadata updates for other files too, and the "ordered" semantics requires the data blocks of those files to be synced as well, before the journal sync. I'm not sure if a fsync() flushes the whole journal or just up to the point it's necessary (that is, up to the last update on the file you're fsync()ing). data=writeback is what some (most) other journalled filesystems do. Metadata updates are allowed to hit the disk _before_ data updates. So, on fsync(), the FS writes all data blocks (still required by fsync() semantics), then the journal (or part of it), but if updates of other files metadata are included in the journal sync, there's not need to write the corresponding data blocks. They'll be written later, and they'll hit the disk _after_ the metadata changes. If power fails in between, you can have a file whose size/time is updated, but contents not. That's the problem with data=writeback, but it should be noted that's pretty normal for other journalled filesystems, too. It applies only to files that were not fsync()'ed. I think that if you're running into performance problems, and your system is doing a lot of fsync(), data=orderer is the worst option. data=journal is fsync()-friendly in one sense, it does write *everything* out, but in one nice sequential (thus extremely fast) shot. Later, data blocks will be written again to the right places. It doubles the I/O bandwith requirements, but if you have a lot of bandwidth, it may be a win. We're talking sequential write bandwidth, which is hardly a problem. data=writeback is fsync() friendly in the sense that it writes only the data blocks of the fsync()'ed file plus (all) metadata. It's the lowest overhead option. If you have a heavy sustained write traffic _and_ lots of fsync()'s, then data=writeback may be the only option. I think some people are scared by data=writeback, but they don't realize it's just what other journalled FS do. I'm not familiar with ReiserFS, it think it's metadata-only as well. data=ordered is good, for general purpose systems. For any application that uses fsync(), it's useless overhead. I've never hit performance problems, my numbers are 200 users with 2000 messages/day delivered to lmtp, _any_ decent PC handles that load easily, and I've never considered turning data=ordered to data=writeback for my filesystems. Now that I think about it, I've also forgot to set noatime after the last HW upgrade (what a luxury!). /me fires vi on /etc/fstab and adds 'noatime' .TM. ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html