Hi, I've been poking around in the WAPBL sources and some of the email threads, also read the doc/roadmaps comments, so I'm aware of some of the sentiment.
I think it would still be useful to get WAPBL safe to enable by default again in NetBSD. Neither lfs64 nor Harvard journalling fs is currently in tree. So it's unknown when they would be stable enough to replace ffs by default. Also, I think that it is useful to keep some kind of generic[*] journalling code, perhaps for use also for ext2fs or maybe xfs one day. In either case, IMO it is good to do also some generic system improvements usable by any journalling solution. I see following groups of useful changes. Reasonably for -8 timeframe, IMO only group one really needs to be resolved to safely enable wapbl journalling by default. 1. critical fixes for WAPBL 2. less critical fixes for WAPBL 3. performance improvements for WAPBL 4. disk subsystem and journalling-related improvements 1. Critical fixes for WAPBL 1.1 kern/47146 kernel panic when many files are unlinked 1.2 kern/50725 discard handling 1.3 kern/49175 degenerate truncate() case - too embarassing to leave in 2. Less critical fixes for WAPBL 2.1 kern/45676 flush semantics 2.2 (no PR) make group descriptor updates part of change transaction The transaction, which changed the group descriptor, should contain also the cg block write. Now the group descriptor blocks are written to disk during filesystem sync via separate transaction, so it's quite frequent they do not survive crash if it happens before sync. Normally fsck fixes these easily using inode metadata, but fsck is skipped for journalled filesystems. This IMO can lead to incorrect block allocation, until fsck is actually run. 2.3 file data leaks on crashes File data content blocks are written asynchronously, some of it can make it to the disk before journal is commited, hence blocks can end up back in different file on system crash. FFS always had it, even with softdep albait more limited. 2.4 buffer blocks kept in memory until commit Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl, starving the buffer cache subsystem. 3. WAPBL performance fixes 3.1 checksum journal data for commit Avoid one of the two DIOCCACHESYNC by computing checksum over data and storing it in the commit record; there is even field for it already, so matter of implementation. There is however CPU use concern maybe. crc32c hash is good candidate, do we need to have hash alternatives? This seems to be reasonably simple to implement, needs just some hooks into journal writes and journal replay logic. 3.2 use FUA (Force Unit Access) for commit record write This avoids need to issue even the second DIOCCACHESYNC, as flushing the disk cache is not really all that useful, I like the thread over at: http://yarchive.net/comp/linux/drive_caches.html Slightly less controversially, this would allow the rest of the journal records to be written asynchronously, leaving them to execute even after commit if so desired. It may be useful to have this behaviour optional. I lean towards skipping the disk cache flush as default behaviour however, if we implement write barrier for the commit record (see below). WAPBL would need to deal with drives without FUA, i.e fall back to cache flush. 3.3 async, or 'group sync' writes Submit all the journal block writes to the drive at once, instead of writing the blocks synchronously one by one. We could even have the journal block writes completely async if we have the commit record checksum. Implementing 'group sync' write would be quite simple, making it full async is more difficult and actually not very useful for journalling, since commit would force those writes to disk drive anyway if it's write barrier (see below) 4. disk subsystem and journalling-related improvements 4.1 write barriers The current DIOCCACHESYNC has a problem in that it could be quite easily I/O starved if the drive is very loaded. Normally, the drive firmware flushes the disk buffer very soon (i.e in region of milliseconds, i.e. when it has full track of data), but concurrent disk activity might prevent it from doing it soon enough. More serious NetBSD kernel problem is however that DIOCCACHESYNC bypasses bufq, so if there are any queued writes, DIOCCACHESYNC sends the command do disk before those writes are sent to the drive. In order to avoid both of them, it would be good to have a way to mark a buf as barrier. bufq and/or disk routines would be changed to drain the write queue before barrier write is sent to drive, and any later writes would wait until barrier write completes. On sane hardware like SCSI/SAS, this could be almost completely offloaded to the controller by just using ORDERED tags, without need to drain the queue. This would be semi-hard to implement, especially if it would require changes to disk drivers. 4.2 scsipi default to ORDERED tags, change to SIMPLE >From a quick scsipi_base.c inspection, it seems we use ordered tag if it was not specified in the request. This seems like a waste. This probably assumes disksort() does miracle job, but bufq disksort can't account for e.g. head positions, so this is actually misoptimization even for spinning rust, and not useful at all for SSDs. We should change default to SIMPLE and rely on disk firmware to do it's job. This is very simple to do. 4.3 generic FUA flag support In order to avoid full cache sync after journal comit, it would be useful to mark certain writes (like the journal commit record write) to bypass disk write cache. There is FUA bit in SCSI/SAS word, NCQ for SATA, and on NVMe. This could be as simple as struct buf flag, which the disk driver would act upon. Very simple to do for SCSI and NVMe since the support for tags is already there, for SATA we'd need to first implement NCQ support :D This is quite easy to do, it's just struct buf flag and tweaks to scsipi/nvme code. 4.4 NCQ support for AHCI We need NCQ to support FUA flag. We have AHCI support, but without NCQ. FreeBSD has support for AHCI NCQ, OpenBSD seems to have some kind of support also, so can be used as reference besides the official AHCI specification. Most of recent motherboards support AHCI mode. Some non-AHCI PCI SATA controllers support NCQ too, but those would be out of scope for now. This is semi-hard to do. I plan to start on group 1, followed by 3.1 checksum, 4.1 write barrier, 4.3 generic FUA support, and finally 3.2 FUA usage. Comments are welcome :) Jaromir [*] WAPBL is of course not so generic since it forces on-disk format right now, but it could eventually be made more flexible.