I am talking about having a write queue, which points to ready to write, full stripes.
Ready to write full stripes would be *The last byte of the full stripe has been updated. *The file has been closed for writing. (Exception to the above rule) I believe there is now a scheduler for ZFS, to handle reads and write conflicts. For example on a large Multi-Gigabyte NVRAM array, the only big consideration is how big is the Fibre Channel pipe is and the limit on outstanding I/Os But on SATA off the motherboard, then it is about how much RAM cache each disk has is a consideration as well as the speed of the SATA connection as well as the number of outstanding I/Os When it comes time to do txg some of the record blocks (most of the full 128k ones) will have been written out already. If we have only written out full record blocks then there has been no performance loss. Eventually a txg going to happen, eventually these full writes will need to happen, but if we can choose a less busy time for them all the better. e.g. on a raidz with 5 disks, if I have 128x4 worth of data to write, lets write it. on a mirror if I have 128k worth to write, lets write it. (record size 128k), or let it be a tunable for zpool, as some arrays (RAID5) like to have larger chunks of data. Why wait for the txg if the disk are not being pressured for reads. Rather than a pause every 30 seconds. Bob wrote :> (I may not have explained it well enough) >It is not true that there is "no cost" though. Since ZFS uses COW, >this approach requires that new blocks be allocated and written at a >much higher rate. There is also an "opportunity cost" in that if a >read comes in while these continuous writes are occurring, the read >will be delayed. At some stage a write needs to happen. **Full** writes have very small COW cost compare with small writes. As I said above I talking about a write of 4x128k on a 5 disk raidz before the write would happen early. >There are many applications which continually write/overwrite file >content, or which update a file at a slow pace. For example, log >files are typically updated at a slow rate. Updating a block requires >reading it first (if it is not already cached in the ARC), which can >be quite expensive. By waiting a bit longer, there is a much better >chance that the whole block is overwritten, so zfs can discard the >existing block on disk without bothering to re-read it. Apps which update at slow pace will not trigger the above early write, until they have at least written a record size worth of data, application which write slow than 128k (recordsize) in more than 30 secs will never trigger the early write on a mirrored disk or even a raidz setup. What this will catch is the big writer of files greater than 128k (recordsize) on mirrored disk; and files larger than (4x128k) on RaidZ 5disks sets. So that commands like dd if=x of=y bs=512k will not cause issues (pauses/delays) when the txg timeout. PS I already set zfs:zfs_write_limit_override and I would not recommend anyone to set this very low to get the above effect. It's just an idea on how to prevent the delay effect, it may not be practical? -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss