Re: svn commit: r304180 - head/sys/ufs/ffs

2016-08-17 Thread Bruce Evans

On Wed, 17 Aug 2016, Konstantin Belousov wrote:


On Tue, Aug 16, 2016 at 11:53:55PM +0200, Jilles Tjoelker wrote:

Hmm, some people interpret POSIX differently than I do, but I think it
is clear that XBD 3.384 Synchronized I/O Data Integrity Completion
requires the indirect blocks to be written (because "all file system
information required to retrieve the data is successfully transferred").
The Linux man page matches this interpretation except that an fsync of
the parent directory may be required.


All related directories must be synced.  It is almost useless to sync
the data if the data cannot be found later using normal file system
operations.  Not quite useless if the file can be recovered by fsck
and saved in lost+found, but the metadata needed to guarantee that
that works probably requires a lot of writes anyway.


If the whole file is being considered, then any change to indirect
blocks is needed to access some data (because the file was extended, a
hole was filled or snapshotted data was overwritten). For other
overwrites, there is no change to indirect blocks and no conflict with
fdatasync's performance objectives.

Note that the same argument is applicable to the inode block:  the di_db
and di_ib pointers to the direct blocks and to root indirect blocks are
required to retrieve the written data.


I would never have guessed that you left out either.


...


Ideally, a changed i_size would also cause the inode to be written, but
I don't know how to determine that (there is no i_flag for it). Always
writing the inode would defeat the point of fdatasync.


My version does a bcmp() of the in-core copy with the disk copy to avoid
null changes.  This turned out to be not very useful.  It was intended
mainly to avoid writing null changes for timestamps.  By I already
avoid most timestamp changes by mounting with -noatime and not having
large data.  I use only seconds granularity for timestamps.  With
nanoseconds granularity, any change to a file is guaranteed to change
its in-core inode, so the bcmp() wouldn't work.  But timestamps are about
the only metadata that obviously doesn't need syncing.  Especially atimes.
POSIX doesn't allow turning off atimes, but it also doesn't require
syncing them before the system crashes except probably using FSYNC of
metadata.

There is an i_flag for critical metadata like i_size.  It is IN_MODIFIED.
This flag is often mismanaged.  Many places set only IN_CHANGE after making
a critical change.  ufs_chown() is one such place.  Ownerships are fairly
critical metadata.  They are not considered critical enough to sync
immediately, but they should be synced by FSYNC of metadata.  But the
implementation is to set only IN_CHANGE.  i_ctime is usually updated much
later and IN_MODIFIED is set then to say that the relatively unimportant
i_ctime has been modified.  This inhibits the optimization of syncing for
chown() but not syncing for ctime changes.


Which was my reasoning behind omitting the indirect block flushing.
There is no practical difference between inode block and indirect blocks
for metadata consistency.


Ugh.


Note that the affected case is only the async mounts (not sync mounts,
not any variants of softdeps).


I fixed the non-syncing of inodes for the async mount case ~15 years ago
in my version, and finally merged the changes to my version of -current
a few months OK.  It was stupid that ordinary fsync() synced the indirect
blocks but not the more important inode.  Directory metadata is still not
synced by fsync() in the async mount case.


I can restore indirect block flushing and wait for them for DATA_ONLY sync,
but then I should also add ffs_update() call.


Here are my ffs fixes for -current.  The are mostly to sprinkle missing
DOINGASYNC() checks and finally fix ffs_update():

X Index: ffs/ffs_balloc.c
X ===
X --- ffs/ffs_balloc.c  (revision 304069)
X +++ ffs/ffs_balloc.c  (working copy)
X @@ -155,4 +155,6 @@
X   if (flags & IO_SYNC)
X   bwrite(bp);
X + else if (DOINGASYNC(vp))
X + bdwrite(bp);
X   else
X   bawrite(bp);
X @@ -265,12 +267,10 @@
X   newb, 0, fs->fs_bsize, 0, bp);
X   bdwrite(bp);
X + } else if ((flags & IO_SYNC) == 0 && DOINGASYNC(vp)) {
X + if (bp->b_bufsize == fs->fs_bsize)
X + bp->b_flags |= B_CLUSTEROK;
X + bdwrite(bp);

This restructures the code a little to fix the DOINGASYNC() check.  IO_SYNC
was not checked.

X   } else {
X - /*
X -  * Write synchronously so that indirect blocks
X -  * never point at garbage.
X -  */
X - if (DOINGASYNC(vp))
X - bdwrite(bp);
X -  

Re: svn commit: r304180 - head/sys/ufs/ffs

2016-08-17 Thread Konstantin Belousov
On Tue, Aug 16, 2016 at 11:53:55PM +0200, Jilles Tjoelker wrote:
> Hmm, some people interpret POSIX differently than I do, but I think it
> is clear that XBD 3.384 Synchronized I/O Data Integrity Completion
> requires the indirect blocks to be written (because "all file system
> information required to retrieve the data is successfully transferred").
> The Linux man page matches this interpretation except that an fsync of
> the parent directory may be required.
> 
> If the whole file is being considered, then any change to indirect
> blocks is needed to access some data (because the file was extended, a
> hole was filled or snapshotted data was overwritten). For other
> overwrites, there is no change to indirect blocks and no conflict with
> fdatasync's performance objectives.
Note that the same argument is applicable to the inode block:  the di_db
and di_ib pointers to the direct blocks and to root indirect blocks are
required to retrieve the written data.
...

> Ideally, a changed i_size would also cause the inode to be written, but
> I don't know how to determine that (there is no i_flag for it). Always
> writing the inode would defeat the point of fdatasync.
Which was my reasoning behind omitting the indirect block flushing.
There is no practical difference between inode block and indirect blocks
for metadata consistency.

Note that the affected case is only the async mounts (not sync mounts,
not any variants of softdeps).

I can restore indirect block flushing and wait for them for DATA_ONLY sync,
but then I should also add ffs_update() call.
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r304180 - head/sys/ufs/ffs

2016-08-16 Thread Jilles Tjoelker
On Mon, Aug 15, 2016 at 07:22:24PM +, Konstantin Belousov wrote:
> [snip]
> @@ -254,15 +259,23 @@ loop:
>   if ((bp->b_vflags & BV_SCANNED) != 0)
>   continue;
>   bp->b_vflags |= BV_SCANNED;
> - /* Flush indirects in order. */
> + /*
> +  * Flush indirects in order, if requested.
> +  *
> +  * Note that if only datasync is requested, we can
> +  * skip indirect blocks when softupdates are not
> +  * active.  Otherwise we must flush them with data,
> +  * since dependencies prevent data block writes.
> +  */
>   if (waitfor == MNT_WAIT && bp->b_lblkno <= -NDADDR &&
> - lbn_level(bp->b_lblkno) >= passes)
> + (lbn_level(bp->b_lblkno) >= passes ||
> + ((flags & DATA_ONLY) != 0 && !DOINGSOFTDEP(vp
>   continue;
>   if (bp->b_lblkno > lbn)
>   panic("ffs_syncvnode: syncing truncated data.");
>   if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) == 0) {
>   BO_UNLOCK(bo);
> - } else if (wait != 0) {
> + } else if (wait) {
>   if (BUF_LOCK(bp,
>   LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK,
>   BO_LOCKPTR(bo)) != 0) {

Hmm, some people interpret POSIX differently than I do, but I think it
is clear that XBD 3.384 Synchronized I/O Data Integrity Completion
requires the indirect blocks to be written (because "all file system
information required to retrieve the data is successfully transferred").
The Linux man page matches this interpretation except that an fsync of
the parent directory may be required.

If the whole file is being considered, then any change to indirect
blocks is needed to access some data (because the file was extended, a
hole was filled or snapshotted data was overwritten). For other
overwrites, there is no change to indirect blocks and no conflict with
fdatasync's performance objectives.

> @@ -330,31 +343,59 @@ next:
>* these will be done with one sync and one async pass.
>*/
>   if (bo->bo_dirty.bv_cnt > 0) {
> - /* Write the inode after sync passes to flush deps. */
> - if (wait && DOINGSOFTDEP(vp) && (flags & NO_INO_UPDT) == 0) {
> - BO_UNLOCK(bo);
> - ffs_update(vp, 1);
> - BO_LOCK(bo);
> + if ((flags & DATA_ONLY) == 0) {
> + still_dirty = true;
> + } else {
> + /*
> +  * For data-only sync, dirty indirect buffers
> +  * are ignored.
> +  */
> + still_dirty = false;
> + TAILQ_FOREACH(bp, &bo->bo_dirty.bv_hd, b_bobufs) {
> + if (bp->b_lblkno > -NDADDR) {
> + still_dirty = true;
> + break;
> + }
> + }
>   }
> - /* switch between sync/async. */
> - wait = !wait;
> - if (wait == 1 || ++passes < NIADDR + 2)
> - goto loop;
> +
> + if (still_dirty) {
> + /* Write the inode after sync passes to flush deps. */
> + if (wait && DOINGSOFTDEP(vp) &&
> + (flags & NO_INO_UPDT) == 0) {
> + BO_UNLOCK(bo);
> + ffs_update(vp, 1);
> + BO_LOCK(bo);
> + }
> + /* switch between sync/async. */
> + wait = !wait;
> + if (wait || ++passes < NIADDR + 2)
> + goto loop;
>  #ifdef INVARIANTS
> - if (!vn_isdisk(vp, NULL))
> - vn_printf(vp, "ffs_fsync: dirty ");
> + if (!vn_isdisk(vp, NULL))
> + vn_printf(vp, "ffs_fsync: dirty ");
>  #endif
> + }
>   }
>   BO_UNLOCK(bo);
>   error = 0;
> - if ((flags & NO_INO_UPDT) == 0)
> - error = ffs_update(vp, 1);
> - if (DOINGSUJ(vp))
> - softdep_journal_fsync(VTOI(vp));
> + if ((flags & DATA_ONLY) == 0) {
> + if ((flags & NO_INO_UPDT) == 0)
> + error = ffs_update(vp, 1);
> + if (DOINGSUJ(vp))
> + softdep_journal_fsync(VTOI(vp));
> + }
>   return (error);
>  }

Ideally, a changed i_size would also cause the inode to be written, but
I don't know how to determine that (there is no i_flag for it). Always
writing the inode would defeat the point of fdatasync.

-- 
Jilles Tjoelker
___
svn-src-head@freebsd.org maili

svn commit: r304180 - head/sys/ufs/ffs

2016-08-15 Thread Konstantin Belousov
Author: kib
Date: Mon Aug 15 19:22:23 2016
New Revision: 304180
URL: https://svnweb.freebsd.org/changeset/base/304180

Log:
  Implement VOP_FDATASYNC() for UFS.
  
  Reviewed by:  mckusick
  Tested by:pho
  Sponsored by: The FreeBSD Foundation
  MFC after:2 weeks
  Differential revision:https://reviews.freebsd.org/D7471

Modified:
  head/sys/ufs/ffs/ffs_extern.h
  head/sys/ufs/ffs/ffs_vnops.c

Modified: head/sys/ufs/ffs/ffs_extern.h
==
--- head/sys/ufs/ffs/ffs_extern.h   Mon Aug 15 19:18:10 2016
(r304179)
+++ head/sys/ufs/ffs/ffs_extern.h   Mon Aug 15 19:22:23 2016
(r304180)
@@ -174,6 +174,11 @@ void   softdep_freework(struct workhead *)
  * deadlock when flushing snapshot inodes while holding snaplk.
  */
 #defineNO_INO_UPDT 0x0001
+/*
+ * Request data sync only from ffs_syncvnode(), not touching even more
+ * metadata than NO_INO_UPDT.
+ */
+#defineDATA_ONLY   0x0002
 
 intffs_rdonly(struct inode *);
 

Modified: head/sys/ufs/ffs/ffs_vnops.c
==
--- head/sys/ufs/ffs/ffs_vnops.cMon Aug 15 19:18:10 2016
(r304179)
+++ head/sys/ufs/ffs/ffs_vnops.cMon Aug 15 19:22:23 2016
(r304180)
@@ -103,6 +103,7 @@ __FBSDID("$FreeBSD$");
 extern int ffs_rawread(struct vnode *vp, struct uio *uio, int *workdone);
 #endif
 static vop_fsync_t ffs_fsync;
+static vop_fdatasync_t ffs_fdatasync;
 static vop_lock1_t ffs_lock;
 static vop_read_t  ffs_read;
 static vop_write_t ffs_write;
@@ -123,6 +124,7 @@ static vop_vptofh_t ffs_vptofh;
 struct vop_vector ffs_vnodeops1 = {
.vop_default =  &ufs_vnodeops,
.vop_fsync =ffs_fsync,
+   .vop_fdatasync =ffs_fdatasync,
.vop_getpages = vnode_pager_local_getpages,
.vop_getpages_async =   vnode_pager_local_getpages_async,
.vop_lock1 =ffs_lock,
@@ -135,6 +137,7 @@ struct vop_vector ffs_vnodeops1 = {
 struct vop_vector ffs_fifoops1 = {
.vop_default =  &ufs_fifoops,
.vop_fsync =ffs_fsync,
+   .vop_fdatasync =ffs_fdatasync,
.vop_reallocblks =  ffs_reallocblks, /* XXX: really ??? */
.vop_vptofh =   ffs_vptofh,
 };
@@ -143,6 +146,7 @@ struct vop_vector ffs_fifoops1 = {
 struct vop_vector ffs_vnodeops2 = {
.vop_default =  &ufs_vnodeops,
.vop_fsync =ffs_fsync,
+   .vop_fdatasync =ffs_fdatasync,
.vop_getpages = vnode_pager_local_getpages,
.vop_getpages_async =   vnode_pager_local_getpages_async,
.vop_lock1 =ffs_lock,
@@ -161,6 +165,7 @@ struct vop_vector ffs_vnodeops2 = {
 struct vop_vector ffs_fifoops2 = {
.vop_default =  &ufs_fifoops,
.vop_fsync =ffs_fsync,
+   .vop_fdatasync =ffs_fdatasync,
.vop_lock1 =ffs_lock,
.vop_reallocblks =  ffs_reallocblks,
.vop_strategy = ffsext_strategy,
@@ -216,10 +221,10 @@ ffs_syncvnode(struct vnode *vp, int wait
 {
struct inode *ip;
struct bufobj *bo;
-   struct buf *bp;
-   struct buf *nbp;
+   struct buf *bp, *nbp;
ufs_lbn_t lbn;
-   int error, wait, passes;
+   int error, passes;
+   bool still_dirty, wait;
 
ip = VTOI(vp);
ip->i_flag &= ~IN_NEEDSYNC;
@@ -238,7 +243,7 @@ ffs_syncvnode(struct vnode *vp, int wait
 */
error = 0;
passes = 0;
-   wait = 0;   /* Always do an async pass first. */
+   wait = false;   /* Always do an async pass first. */
lbn = lblkno(ip->i_fs, (ip->i_size + ip->i_fs->fs_bsize - 1));
BO_LOCK(bo);
 loop:
@@ -254,15 +259,23 @@ loop:
if ((bp->b_vflags & BV_SCANNED) != 0)
continue;
bp->b_vflags |= BV_SCANNED;
-   /* Flush indirects in order. */
+   /*
+* Flush indirects in order, if requested.
+*
+* Note that if only datasync is requested, we can
+* skip indirect blocks when softupdates are not
+* active.  Otherwise we must flush them with data,
+* since dependencies prevent data block writes.
+*/
if (waitfor == MNT_WAIT && bp->b_lblkno <= -NDADDR &&
-   lbn_level(bp->b_lblkno) >= passes)
+   (lbn_level(bp->b_lblkno) >= passes ||
+   ((flags & DATA_ONLY) != 0 && !DOINGSOFTDEP(vp
continue;
if (bp->b_lblkno > lbn)
panic("ffs_syncvnode: syncing truncated data.");
if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) == 0) {
BO_UNLOCK(bo);
-