Re: [zfs-discuss] Implementing fbarrier() on ZFS
> > Given ZFS's copy-on-write transactional model, would it not be almost trivial > to implement fbarrier()? Basically just choose to wrap up the transaction at > the point of fbarrier() and that's it. > > Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn't reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. Or am _I_ now missing something :) FrankH. As Jeff said, ZFS guarantees the write(2) are ordered by the fact that either they show up in the order supplied or they don't at all. So as the transaction closes, we can issue all the I/Os we want in whatever order we choose (more or less), then flush the caches. Up to here none of the I/O would actually be visible upon a reboot. But then, we update the ueberblock, flush the cache and we're done. All writes that associated with a transaction group show up at once in the main tree. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
> > That is interesting. Could this account for disproportionate kernel > > CPU usage for applications that perform I/O one byte at a time, as > > compared to other filesystems? (Nevermind that the application > > shouldn't do that to begin with.) > > I just quickly measured this (overwritting files in CHUNKS); > This is a software benchmark (I/O is non-factor) > > CHUNK ZFS vz UFS > > 1B 4X slower > 1K 2X slower > 8K 25% slower > 32K equal > 64K 30% faster > > Quick and dirty but I think it paints a picture. > I can't really answer your question though. I should probably have said "other filesystems on other platforms", I did not really compare properly on the Solaris box. In this case it was actually BitTorrent (the official python client) that was completely CPU bound in kernel space, and tracing showed single-byte I/O. Regardless, the above stats are interesting and I suppose consistent with what one might expect, from previous discussion on this list. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Peter Schuller writes: > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > > thing is that on ZFS, fbarrier() is a no-op. It's implicit after > > every system call. > > That is interesting. Could this account for disproportionate kernel > CPU usage for applications that perform I/O one byte at a time, as > compared to other filesystems? (Nevermind that the application > shouldn't do that to begin with.) I just quickly measured this (overwritting files in CHUNKS); This is a software benchmark (I/O is non-factor) CHUNK ZFS vz UFS 1B 4X slower 1K 2X slower 8K 25% slower 32K equal 64K 30% faster Quick and dirty but I think it paints a picture. I can't really answer your question though. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
That is interesting. Could this account for disproportionate kernel CPU usage for applications that perform I/O one byte at a time, as compared to other filesystems? (Nevermind that the application shouldn't do that to begin with.) No, this is entirely a matter of CPU efficiency in the current code. There are two issues; we know what they are; and we're fixing them. The first is that as we translate from znode to dnode, we throw away information along the way -- we go from znode to object number (fast), but then we have to do an object lookup to get from object number to dnode (slow, by comparison -- or more to the point, slow relative to the cost of writing a single byte). But this is just stupid, since we already have a dnode pointer sitting right there in the znode. We just need to fix our internal interfaces to expose it. The second problem is that we're not very fast at partial-block updates. Again, this is entirely a matter of code efficiency, not anything fundamental. I still would love to see something like fbarrier() defined by some standrd (de facto or otherwise) to make the distinction between ordered writes and guaranteed persistence more easily exploited in the general case for applications, and encourage filesystems/storage systems to optimize for that case (i.e., not have fbarrier() simply fsync()). Totally agree. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
> I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > thing is that on ZFS, fbarrier() is a no-op. It's implicit after > every system call. That is interesting. Could this account for disproportionate kernel CPU usage for applications that perform I/O one byte at a time, as compared to other filesystems? (Nevermind that the application shouldn't do that to begin with.) But the fact that you effectively have an fbarrier() is extremely nice. Guess that is yet another reason to prefer ZFS for certrain (granted, very specific) cases. I still would love to see something like fbarrier() defined by some standrd (de facto or otherwise) to make the distinction between ordered writes and guaranteed persistence more easily exploited in the general case for applications, and encourage filesystems/storage systems to optimize for that case (i.e., not have fbarrier() simply fsync()). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
> That said, actually implementing the underlying mechanisms may not be > worth the trouble. It is only a matter of time before disks have fast > non-volatile memory like PRAM or MRAM, and then the need to do > explicit cache management basically disappears. I meant fbarrier() as a syscall exposed to userland, like fsync(), so that userland applications can achieve ordered semantics without synchronous writes. Whether or not ZFS in turn manages to eliminate synchronous writes by some feature of the underlying storage mechanism is a separate issue. But even if not, an fbarrier() exposes an asynchronous method of ensuring relative order of I/O operations to userland, which is often useful. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Do you agree that their is a major tradeoff of "builds up a wad of transactions in memory"? I don't think so. We trigger a transaction group commit when we have lots of dirty data, or 5 seconds elapse, whichever comes first. In other words, we don't let updates get stale. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Jeff Bonwick, Do you agree that their is a major tradeoff of "builds up a wad of transactions in memory"? We loose the changes if we have an unstable environment. Thus, I don't quite understand why a 2-phase approach to commits isn't done. First, take the transactions as they come and do a minimal amount of a delayed write. If the number of transactions build up, then convert to the delayed write scheme. This assumption is that not all ZFS envs are write heavy versus write once and read-many type accesses. My assumption is that attribute/meta reading outweighs all other accesses. Wouldn't this approach allow minimal outstanding transactions and favor read access. Yes, the assumption is that once the "wad" is started, the amount of writing could be substantial and thus the amount of available bandwidth for reading is reduced. This would then allow for a more N states to be available. Right? Second, their are a multiple uses of "then: (then pushes, then flushes all disk..., then writes the new uberblock, then flushes the caches again), in which seems to have some level of possible parallelism which should reduce the latency from the start to the final write. Or did you just say that for simplicity sake? Mitchell Erblich --- Jeff Bonwick wrote: > > Toby Thain wrote: > > I'm no guru, but would not ZFS already require strict ordering for its > > transactions ... which property Peter was exploiting to get "fbarrier()" > > for free? > > Exactly. Even if you disable the intent log, the transactional nature > of ZFS ensures preservation of event ordering. Note that disk caches > don't come into it: ZFS builds up a wad of transactions in memory, > then pushes them out as a transaction group. That entire group will > either commit or not. ZFS writes all the new data to new locations, > then flushes all disk write caches, then writes the new uberblock, > then flushes the caches again. Thus you can lose power at any point > in the middle of committing transaction group N, and you're guaranteed > that upon reboot, everything will either be at state N or state N-1. > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > thing is that on ZFS, fbarrier() is a no-op. It's implicit after > every system call. > > Jeff > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Toby Thain wrote: I'm no guru, but would not ZFS already require strict ordering for its transactions ... which property Peter was exploiting to get "fbarrier()" for free? Exactly. Even if you disable the intent log, the transactional nature of ZFS ensures preservation of event ordering. Note that disk caches don't come into it: ZFS builds up a wad of transactions in memory, then pushes them out as a transaction group. That entire group will either commit or not. ZFS writes all the new data to new locations, then flushes all disk write caches, then writes the new uberblock, then flushes the caches again. Thus you can lose power at any point in the middle of committing transaction group N, and you're guaranteed that upon reboot, everything will either be at state N or state N-1. I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool thing is that on ZFS, fbarrier() is a no-op. It's implicit after every system call. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
2007/2/12, Frank Hofmann <[EMAIL PROTECTED]>: On Mon, 12 Feb 2007, Chris Csanady wrote: > This is true for NCQ with SATA, but SCSI also supports ordered tags, > so it should not be necessary. > > At least, that is my understanding. Except that ZFS doesn't talk SCSI, it talks to a target driver. And that one may or may not treat async I/O requests dispatched via its strategy() entry point as strictly ordered / non-coalescible / non-cancellable. See e.g. disksort(9F). Yes, however, this functionality could be exposed through the target driver. While the implementation does not (yet) take full advantage of ordered tags, linux does provide an interface to do this: http://www.mjmwired.net/kernel/Documentation/block/barrier.txt From a correctness standpoint, the interface seems worthwhile, even if the mechanisms are never implemented. It just feels wrong to execute a synchronize cache command from ZFS, when often that is not the intention. The changes to ZFS itself would be very minor. That said, actually implementing the underlying mechanisms may not be worth the trouble. It is only a matter of time before disks have fast non-volatile memory like PRAM or MRAM, and then the need to do explicit cache management basically disappears. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
On Mon, 12 Feb 2007, Toby Thain wrote: [ ... ] I'm no guru, but would not ZFS already require strict ordering for its transactions ... which property Peter was exploiting to get "fbarrier()" for free? It achieves this by flushing the disk write cache when there's need to barrier. Which completes outstanding writes. A "perfect fsync()" for ZFS shouldn't need to do way more; that it does right now is something, as I understand, that is being worked on. FrankH. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
On 12-Feb-07, at 5:55 PM, Frank Hofmann wrote: On Mon, 12 Feb 2007, Peter Schuller wrote: Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS's copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that's it. Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn't reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. Or am _I_ now missing something :) I'm no guru, but would not ZFS already require strict ordering for its transactions ... which property Peter was exploiting to get "fbarrier()" for free? --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Peter Schuller wrote: Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS's copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that's it. Am I missing something? (I do not actually have a use case for this on ZFS, since my experience with ZFS is thus far limited to my home storage server. But I have wished for an fbarrier() many many times over the past few years...) Hmmm... is store ordering what you're looking for? Eg make sure that in the case of power failure, all previous writes will be visible after reboot if any subsequent write are visible. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
On Mon, 12 Feb 2007, Chris Csanady wrote: [ ... ] > Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn't reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. This is true for NCQ with SATA, but SCSI also supports ordered tags, so it should not be necessary. At least, that is my understanding. Except that ZFS doesn't talk SCSI, it talks to a target driver. And that one may or may not treat async I/O requests dispatched via its strategy() entry point as strictly ordered / non-coalescible / non-cancellable. See e.g. disksort(9F). FrankH. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
2007/2/12, Frank Hofmann <[EMAIL PROTECTED]>: On Mon, 12 Feb 2007, Peter Schuller wrote: > Hello, > > Often fsync() is used not because one cares that some piece of data is on > stable storage, but because one wants to ensure the subsequent I/O operations > are performed after previous I/O operations are on stable storage. In these > cases the latency introduced by an fsync() is completely unnecessary. An > fbarrier() or similar would be extremely useful to get the proper semantics > while still allowing for better performance than what you get with fsync(). > > My assumption has been that this has not been traditionally implemented for > reasons of implementation complexity. > > Given ZFS's copy-on-write transactional model, would it not be almost trivial > to implement fbarrier()? Basically just choose to wrap up the transaction at > the point of fbarrier() and that's it. > > Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn't reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. This is true for NCQ with SATA, but SCSI also supports ordered tags, so it should not be necessary. At least, that is my understanding. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
On Mon, 12 Feb 2007, Peter Schuller wrote: Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS's copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that's it. Am I missing something? How do you guarantee that the disk driver and/or the disk firmware doesn't reorder writes ? The only guarantee for in-order writes, on actual storage level, is to complete the outstanding ones before issuing new ones. Or am _I_ now missing something :) FrankH. (I do not actually have a use case for this on ZFS, since my experience with ZFS is thus far limited to my home storage server. But I have wished for an fbarrier() many many times over the past few years...) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Implementing fbarrier() on ZFS
Hello, Often fsync() is used not because one cares that some piece of data is on stable storage, but because one wants to ensure the subsequent I/O operations are performed after previous I/O operations are on stable storage. In these cases the latency introduced by an fsync() is completely unnecessary. An fbarrier() or similar would be extremely useful to get the proper semantics while still allowing for better performance than what you get with fsync(). My assumption has been that this has not been traditionally implemented for reasons of implementation complexity. Given ZFS's copy-on-write transactional model, would it not be almost trivial to implement fbarrier()? Basically just choose to wrap up the transaction at the point of fbarrier() and that's it. Am I missing something? (I do not actually have a use case for this on ZFS, since my experience with ZFS is thus far limited to my home storage server. But I have wished for an fbarrier() many many times over the past few years...) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss