Module Name:    src
Committed By:   riastradh
Date:           Thu Mar 26 21:38:49 UTC 2015

Added Files:
        src/share/man/man9: wapbl.9

Log Message:
Add wapbl(9) man page.


To generate a diff of this commit:
cvs rdiff -u -r0 -r1.1 src/share/man/man9/wapbl.9

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Added files:

Index: src/share/man/man9/wapbl.9
diff -u /dev/null src/share/man/man9/wapbl.9:1.1
--- /dev/null	Thu Mar 26 21:38:49 2015
+++ src/share/man/man9/wapbl.9	Thu Mar 26 21:38:49 2015
@@ -0,0 +1,442 @@
+.\"	$NetBSD: wapbl.9,v 1.1 2015/03/26 21:38:49 riastradh Exp $
+.\"
+.\" Copyright (c) 2015 The NetBSD Foundation, Inc.
+.\" All rights reserved.
+.\"
+.\" This code is derived from software contributed to The NetBSD Foundation
+.\" by Taylor R. Campbell.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
+.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+.\" PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
+.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+.\" POSSIBILITY OF SUCH DAMAGE.
+.\"
+.Dd March 26, 2015
+.Dt WAPBL 9
+.Os
+.Sh NAME
+.Nm WAPBL ,
+.Nm wapbl_start ,
+.Nm wapbl_stop ,
+.Nm wapbl_begin ,
+.Nm wapbl_end ,
+.Nm wapbl_flush ,
+.Nm wapbl_discard ,
+.Nm wapbl_add_buf ,
+.Nm wapbl_remove_buf ,
+.Nm wapbl_resize_buf ,
+.Nm wapbl_register_inode ,
+.Nm wapbl_unregister_inode ,
+.Nm wapbl_register_deallocation ,
+.Nm wapbl_jlock_assert ,
+.Nm wapbl_junlock_assert
+.Nd write-ahead physical block logging for file systems
+.Sh SYNOPSIS
+.In sys/wapbl.h
+.Vt typedef void (*wapbl_flush_fn_t)(struct mount *, daddr_t *, int *, int) ;
+.Ft int
+.Fn wapbl_start "struct wapbl **wlp" "struct mount *mp" "struct vnode *devvp" \
+        "daddr_t off" "size_t count" "size_t blksize" \
+        "struct wapbl_replay *wr" \
+        "wapbl_flush_fn_t flushfn" "wapbl_flush_fn_t flushabortfn"
+.Ft int
+.Fn wapbl_stop "struct wapbl *wl" "int force"
+.Ft int
+.Fn wapbl_begin "struct wapbl *wl" "const char *file" "int line"
+.Ft void
+.Fn wapbl_end "struct wapbl *wl"
+.Ft int
+.Fn wapbl_flush "struct wapbl *wl" "int wait"
+.Ft void
+.Fn wapbl_discard "struct wapbl *wl"
+.Ft void
+.Fn wapbl_add_buf "struct wapbl *wl" "struct buf *bp"
+.Ft void
+.Fn wapbl_remove_buf "struct wapbl *wl" "struct buf *bp"
+.Ft void
+.Fn wapbl_resize_buf "struct wapbl *wl" "struct buf *bp" "long oldsz" \
+       "long oldcnt"
+.Ft void
+.Fn wapbl_register_inode "struct wapbl *wl" "ino_t ino" "mode_t mode"
+.Ft void
+.Fn wapbl_unregister_inode "struct wapbl *wl" "ino_t ino" "mode_t mode"
+.Ft void
+.Fn wapbl_register_deallocation "struct wapbl *wl" "daddr_t blk" "int len"
+.Ft void
+.Fn wapbl_jlock_assert "struct wapbl *wl"
+.Ft void
+.Fn wapbl_junlock_assert "struct wapbl *wl"
+.Sh DESCRIPTION
+.Nm ,
+or
+.Em write-ahead physical block logging ,
+is an abstraction for file systems to write physical blocks in the
+.Xr buffercache 9
+to a bounded-size log first before their real destinations on disk.
+The name means:
+.Bl -tag -width "physical block" -offset abcd
+.It logging
+batches of writes are issued atomically via a log
+.It physical block
+only physical blocks, not logical file system operations, are stored in
+the log
+.It write-ahead
+blocks are written to the log before being written to the disk
+.El
+.Pp
+When a file system using
+.Nm
+issues writes (as in
+.Xr bwrite 9
+or
+.Xr bdwrite 9 Ns ),
+they are grouped in batches called
+.Em transactions
+in memory, which are serialized to be consistent with program order
+before
+.Nm
+submits them to disk atomically.
+.Pp
+Thus, within a transaction, after one write, another write need not
+wait for disk I/O, and if the system is interrupted, e.g. by a crash or
+by power failure, either both writes will appear on disk, or neither
+will.
+.Pp
+When a transaction is full, it is written to a circular buffer on
+disk called the
+.Em log .
+When the transaction has been written to disk, every write in the
+transaction is submitted to disk asynchronously.
+Finally, the file system may issue new writes via
+.Nm
+once enough writes submitted to disk have completed.
+.Pp
+After interruption, such as a crash or power failure, some writes
+issued by the file system may not have completed.
+However, the log is written consistently with program order and before
+file system writes are submitted to disk.
+Hence a consistent program-order view of the file system can be
+attained by resubmitting the writes that were successfully stored in
+the log using
+.Xr wapbl_replay 9 .
+This may not be the same state just before interruption -- writes in
+transactions that did not reach the disk will be excluded.
+.Pp
+For a file system to use
+.Nm ,
+its
+.Xr VFS_MOUNT 9
+method should first replay any journal on disk using
+.Xr wapbl_replay 9 ,
+and then, if the mount is read/write, initialize
+.Nm
+for the mount by calling
+.Fn wapbl_start .
+The
+.Xr VFS_MOUNT 9
+method should call
+.Fn wapbl_stop .
+.Pp
+Before issuing any
+.Xr buffercache 9
+writes, the file system must lock the current
+.Nm
+transaction with
+.Fn wapbl_begin ,
+which may sleep until there is room in the transaction for new writes.
+After issuing the writes, the file system must unlock the transaction
+with
+.Fn wapbl_end .
+Either all writes issued between
+.Fn wapbl_begin
+and
+.Fn wapbl_end
+will complete, or none of them will.
+File systems can assert that the transaction should be locked with
+.Fn wapbl_jlock_assert ,
+or unlocked, with
+.Fn wapbl_junlock_assert .
+.Pp
+If a file system requires multiple transactions to initialize an
+inode, and needs to destroy partially initialized inodes during replay,
+it can register them by
+.Vt ino_t
+inode number before initialization with
+.Fn wapbl_register_inode
+and unregister them with
+.Fn wapbl_unregister_inode
+once initialization is complete.
+.Nm
+does not actually concern itself whether the objects identified by
+.Vt ino_t
+values are
+.Sq inodes
+or
+.Sq quaggas
+or anything else -- file systems may use this to list any objects keyed
+by
+.Vt ino_t
+value in the log.
+.Pp
+When a file system frees resources on disk and issues writes to reflect
+the fact, it cannot then reuse the resources until the writes have
+reached the disk.
+However, as far as the
+.Xr buffercache 9
+is concerned, as soon as the file system issues the writes, they will
+appear to have been written.
+So the file system must not attempt to reuse the resource until the
+current
+.Nm
+transaction has been flushed to disk.
+.Pp
+The file system can defer freeing a resource by calling
+.Fn wapbl_register_deallocation
+to record the disk address of the resource and length in bytes of the
+resource.
+Then, when
+.Nm
+next flushes the transaction to disk, it will pass an array of the disk
+addresses and lengths in bytes to a file-system-supplied callback.
+(Again,
+.Nm
+does not care whether the
+.Sq disk address
+or
+.Sq length in bytes
+is actually that; it will pass along
+.Vt daddr_t
+and
+.Vt int
+values.)
+.Sh FUNCTIONS
+.Bl -tag -width abcd
+.It Fn wapbl_start wlp mp devvp off count blksize wr flushfn flushabortfn
+Start using
+.Nm
+for the file system mounted at
+.Fa mp ,
+storing a log of
+.Fa count
+disk sectors at disk address
+.Fa off
+on the block device
+.Fa devvp
+writing blocks in units of
+.Fa blksize
+bytes.
+On success, stores an opaque
+.Vt "struct wapbl *"
+cookie in
+.Li * Ns Fa wlp
+for use with the other
+.Nm
+routines and returns zero.
+On failure, returns an error number.
+.Pp
+If the file system had replayed the log with
+.Xr wapbl_replay 9 ,
+then
+.Fa wr
+must be the
+.Vt "struct wapbl_replay *"
+cookie used to replay it, and
+.Fn wapbl_start
+will register any inodes that were in the log as if with
+.Fn wapbl_register_inode ;
+otherwise
+.Fa wr
+must be
+.Dv NULL .
+.Pp
+.Fa flushfn
+is a callback that
+.Nm
+will invoke as
+.Fa flushfn Ns Li ( Fa mp Ns Li , Fa deallocblks Ns Li , Fa dealloclens Ns Li , Fa dealloccnt Ns Li )
+just before it flushes a transaction to disk, with the transaction
+locked exclusively, where
+.Fa mp
+is the mount point passed to
+.Fn wapbl_start ,
+.Fa deallocblks
+is an array of
+.Fa dealloccnt
+disk addresses, and
+.Fa dealloclens
+is an array of
+.Fa dealloccnt
+lengths, corresponding to the addresses and lengths the file system
+passed to
+.Fn wapbl_register_deallocation .
+If flushing the transaction to disk fails,
+.Nm
+will call
+.Fa flushabortfn
+with the same arguments to undo any effects that
+.Fa flushfn
+had.
+.It Fn wapbl_stop wl force
+Flush the current transaction to disk and stop using
+.Nm .
+If flushing the transaction fails and
+.Fa force
+is zero,
+return error.
+If flushing the transaction fails and
+.Fa force
+is nonzero, discard the transaction, permanently losing any writes in
+it.
+If flushing the transaction is successful or if
+.Fa force
+is nonzero,
+free memory associated with
+.Fa wl
+and return zero.
+.It Fn wapbl_begin wl file line
+Wait for space in the current transaction for new writes, flushing it
+if necessary, and lock it.
+.Pp
+The lock is not exclusive: other threads may lock the transaction too.
+However, if there is not enough space, another thread will obtain an
+exclusive lock in order to flush the transaction.
+.Pp
+May sleep.
+.Pp
+.Fa file
+and
+.Fa line
+are the file name and line number of the caller for debugging
+purposes.
+.It Fn wapbl_end wl
+Unlock the transaction.
+.It Fn wapbl_flush wl wait
+Flush the current transaction to disk.
+If
+.Fa wait
+is nonzero, wait for all writes in the current transaction to
+complete.
+.It Fn wapbl_discard wl
+Discard the current transaction, permanently losing any writes in it.
+.It Fn wapbl_add_buf wl bp
+Add the buffer
+.Fa bp
+to the current transaction, which must be locked, because someone has
+asked to write it.
+.Pp
+This is meant to be called by
+.Xr bwrite 9
+or
+.Xr bdwrite 9 ,
+not by file systems directly.
+.It Fn wapbl_remove_buf wl bp
+Remove the buffer
+.Fa bp ,
+which must have been added using
+.Fa wapbl_add_buf ,
+from the current transaction, which must be locked, because it has been
+invalidated (or XXX ???).
+.Pp
+This is meant to be called from within
+.Xr buffercache 9 ,
+not by file systems directly.
+.It Fn wapbl_resize_buf wl bp oldsz oldcnt
+Note that the buffer
+.Fa bp ,
+which must have been added using
+.Fa wapbl_add_buf ,
+has changed size, where
+.Fa oldsz
+is the previous allocated size in bytes and
+.Fa oldcnt
+is the previous number of valid bytes in
+.Fa bp .
+.Pp
+This is meant to be called from within
+.Xr buffercache 9 ,
+not by file systems directly.
+.It Fn wapbl_register_inode wl ino mode
+Register
+.Fa ino
+with the mode
+.Fa mode
+as commencing initialization.
+.It Fn wapbl_unregister_inode wl ino mode
+Unregister
+.Fa ino ,
+which must have previously been registered with
+.Fa wapbl_register_inode
+using the same
+.Fa mode ,
+now that its initialization has completed.
+.It Fn wapbl_register_deallocation wl blk len
+Register
+.Fa len
+bytes at the disk address
+.Fa blk
+as ready for deallocation, so that they will be passed to the
+.Fa flushfn
+that was given to
+.Fn wapbl_start .
+.It Fn wapbl_jlock_assert wl
+Assert that the current transaction is locked.
+.It Fn wapbl_junlock_assert wl
+Assert that the current transaction is unlocked.
+.El
+.Sh CODE REFERENCES
+The
+.Nm
+subsystem is implemented in
+.Pa sys/kern/vfs_wapbl.c ,
+with hooks in
+.Pa sys/kern/vfs_bio.c .
+.Sh SEE ALSO
+.Xr buffercache 9 ,
+.Xr vfsops 9 ,
+.Xr wapbl_replay 9
+.Sh BUGS
+.Nm
+is intended only for file system metadata managed via the
+.Xr buffercache 9 ,
+and provides no way to log writes via the page cache, as in
+.Xr VOP_GETPAGES 9 ,
+.Xr VOP_PUTPAGES 9 ,
+and
+.Xr ubc_uiomove 9 ,
+which is normally used for file data.
+.Pp
+There is only one
+.Nm
+transaction for each file system at any given time, and only one
+.Nm
+log on disk.
+Consequently, all writes are serialized.
+Extending
+.Nm
+to support multiple logs per file system, partitioned according to an
+appropriate scheme, is left as an exercise for the reader.
+.Pp
+There is no reason for
+.Nm
+to require its own hooks in
+.Xr buffercache 9 .
+.Pp
+The on-disk format used by
+.Nm
+is undocumented.

Reply via email to