Re: [Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection

2019-04-30 Thread Steven Whitehouse

Hi,

On 01/05/2019 00:03, Bob Peterson wrote:

Here is version 3 of the patch set I posted on 23 April. It is revised
based on bugs I found testing with xfstests.

This is a collection of patches I've been using to address the myriad
of recovery problems I've found. I'm still finding them, so the battle
is not won yet. I'm not convinced we need all of these but I thought
I'd send them anyway and get feedback. Previously I sent out a version
of the patch "gfs2: Force withdraw to replay journals and wait for it
to finish" that was too big and complex. So I broke it up into four
patches, starting with "move check_journal_clean to util.c for future
use". So those four need to be a set. There aren't many other dependencies
between patches, so the others could probably be taken or rejected
individually. There are more patches I still need to perfect, but maybe
a few of the safer ones can be pushed to for-next.


There is quite a lot of good stuff here, but it would be good to split 
it up and make it clearer what is bug fixes, and what are new features,


Steve.



Bob Peterson (19):
   gfs2: kthread and remount improvements
   gfs2: eliminate tr_num_revoke_rm
   gfs2: log which portion of the journal is replayed
   gfs2: Warn when a journal replay overwrites a rgrp with buffers
   gfs2: Introduce concept of a pending withdraw
   gfs2: log error reform
   gfs2: Only complain the first time an io error occurs in quota or log
   gfs2: Stop ail1 wait loop when withdrawn
   gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
   gfs2: move check_journal_clean to util.c for future use
   gfs2: Allow some glocks to be used during withdraw
   gfs2: Don't loop forever in gfs2_freeze if withdrawn
   gfs2: Make secondary withdrawers wait for first withdrawer
   gfs2: Don't write log headers after file system withdraw
   gfs2: Force withdraw to replay journals and wait for it to finish
   gfs2: simply gfs2_freeze by removing case
   gfs2: Add verbose option to check_journal_clean
   gfs2: Check for log write errors before telling dlm to unlock
   gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty

  fs/gfs2/aops.c   |   4 +-
  fs/gfs2/file.c   |   2 +-
  fs/gfs2/glock.c  |  82 +++--
  fs/gfs2/glock.h  |   1 +
  fs/gfs2/glops.c  | 100 +++--
  fs/gfs2/glops.h  |   3 +-
  fs/gfs2/incore.h |  21 -
  fs/gfs2/inode.c  |  14 ++-
  fs/gfs2/lock_dlm.c   |  50 +++
  fs/gfs2/log.c|  55 +++-
  fs/gfs2/log.h|   1 +
  fs/gfs2/lops.c   |  27 +-
  fs/gfs2/meta_io.c|   6 +-
  fs/gfs2/ops_fstype.c |  65 --
  fs/gfs2/quota.c  |  10 ++-
  fs/gfs2/recovery.c   |   3 +-
  fs/gfs2/super.c  |  88 +++---
  fs/gfs2/super.h  |   1 +
  fs/gfs2/sys.c|  14 ++-
  fs/gfs2/trans.c  |   6 +-
  fs/gfs2/util.c   | 206 ---
  fs/gfs2/util.h   |  15 
  22 files changed, 617 insertions(+), 157 deletions(-)





Re: [Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn

2019-04-30 Thread Steven Whitehouse

Hi,

On 01/05/2019 00:03, Bob Peterson wrote:

This patch addresses various problems with gfs2/dlm recovery.

For example, suppose a node with a bunch of gfs2 mounts suddenly
reboots due to kernel panic, and dlm determines it should perform
recovery. DLM does so from a pseudo-state machine calling various
callbacks into lock_dlm to perform a sequence of steps. It uses
generation numbers and recover bits in dlm "control" lock lvbs.

Now suppose another node tries to recover the failed node's
journal, but in so doing, encounters an IO error or withdraws
due to unforeseen circumstances, such as an hba driver failure.
In these cases, the recovery would eventually bail out, but it
would still update its generation number in the lvb. The other
nodes would all see the newer generation number and think they
don't need to do recovery because the generation number is newer
than the last one they saw, and therefore someone else has already
taken care of it.

If the file system has an io error or is withdrawn, it cannot
safely replay any journals (its own or others) but someone else
still needs to do it. Therefore we don't want it messing with
the journal recovery generation numbers: the local generation
numbers eventually get put into the lvb generation numbers to be
seen by all nodes.

This patch adds checks to many of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
was withdraw.

Signed-off-by: Bob Peterson 


These should probably propagate the error back to the caller of the 
recovery request. We do have a proper notification system for failed 
recovery via uevents,


Steve.


---
  fs/gfs2/lock_dlm.c | 18 ++
  fs/gfs2/util.c | 15 +++
  2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 31df26ed7854..9329f86ffcbe 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = >sd_lockstruct;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recover_prep ignored due to withdraw.\n");
+   return;
+   }
spin_lock(>ls_recover_spin);
ls->ls_recover_block = ls->ls_recover_start;
set_bit(DFL_DLM_RECOVERY, >ls_recover_flags);
@@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot 
*slot)
struct lm_lockstruct *ls = >sd_lockstruct;
int jid = slot->slot - 1;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
spin_lock(>ls_recover_spin);
if (ls->ls_recover_size < jid + 1) {
fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
@@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = >sd_lockstruct;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recover_done ignored due to withdraw.\n");
+   return;
+   }
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
  
@@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,

  {
struct lm_lockstruct *ls = >sd_lockstruct;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
if (test_bit(DFL_NO_DLM_OPS, >ls_recover_flags))
return;
  
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c

index 0a814ccac41d..7eaea6dfe1cf 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct 
buffer_head *bh,
const char *function, char *file, unsigned int line,
bool withdraw)
  {
-   if (!test_bit(SDF_SHUTDOWN, >sd_flags))
-   fs_err(sdp,
-  "fatal: I/O error\n"
-  "  block = %llu\n"
-  "  function = %s, file = %s, line = %u\n",
-  (unsigned long long)bh->b_blocknr,
-  function, file, line);
+   if (gfs2_withdrawn(sdp))
+   return;
+
+   fs_err(sdp, "fatal: I/O error\n"
+  "  block = %llu\n"
+  "  function = %s, file = %s, line = %u\n",
+  (unsigned long long)bh->b_blocknr, function, file, line);
if (withdraw)
gfs2_lm_withdraw(sdp, NULL);
  }
-




Re: [Cluster-devel] [PATCH v2] gfs2: fix race between gfs2_freeze_func and unmount

2019-04-30 Thread Andreas Gruenbacher
On Tue, 30 Apr 2019 at 23:54, Abhi Das  wrote:
> As part of the freeze operation, gfs2_freeze_func() is left blocking
> on a request to hold the sd_freeze_gl in SH. This glock is held in EX
> by the gfs2_freeze() code.
>
> A subsequent call to gfs2_unfreeze() releases the EXclusively held
> sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and
> resume its operation.
>
> gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete.
> If a umount is issued right after unfreeze, it could result in an
> inconsistent filesystem because some journal data (statfs update) isn't
> written out.
>
> Refer to commit 24972557b12c for a more detailed explanation of how
> freeze/unfreeze work.
>
> This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to
> complete before returning to the user.
>
> Signed-off-by: Abhi Das 

Thanks, pushed to for-next.

Andreas

> ---
>  fs/gfs2/incore.h | 1 +
>  fs/gfs2/super.c  | 8 +---
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 78c8e761b321..b15755068593 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -621,6 +621,7 @@ enum {
> SDF_SKIP_DLM_UNLOCK = 8,
> SDF_FORCE_AIL_FLUSH = 9,
> SDF_AIL1_IO_ERROR   = 10,
> +   SDF_FS_FROZEN   = 11,
>  };
>
>  enum gfs2_freeze_state {
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index a6a325b2a78b..ceec631efa49 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -973,8 +973,7 @@ void gfs2_freeze_func(struct work_struct *work)
> if (error) {
> printk(KERN_INFO "GFS2: couldn't get freeze lock : %d\n", 
> error);
> gfs2_assert_withdraw(sdp, 0);
> -   }
> -   else {
> +   } else {
> atomic_set(>sd_freeze_state, SFS_UNFROZEN);
> error = thaw_super(sb);
> if (error) {
> @@ -987,6 +986,8 @@ void gfs2_freeze_func(struct work_struct *work)
> gfs2_glock_dq_uninit(_gh);
> }
> deactivate_super(sb);
> +   clear_bit_unlock(SDF_FS_FROZEN, >sd_flags);
> +   wake_up_bit(>sd_flags, SDF_FS_FROZEN);
> return;
>  }
>
> @@ -1029,6 +1030,7 @@ static int gfs2_freeze(struct super_block *sb)
> msleep(1000);
> }
> error = 0;
> +   set_bit(SDF_FS_FROZEN, >sd_flags);
>  out:
> mutex_unlock(>sd_freeze_mutex);
> return error;
> @@ -1053,7 +1055,7 @@ static int gfs2_unfreeze(struct super_block *sb)
>
> gfs2_glock_dq_uninit(>sd_freeze_gh);
> mutex_unlock(>sd_freeze_mutex);
> -   return 0;
> +   return wait_on_bit(>sd_flags, SDF_FS_FROZEN, TASK_INTERRUPTIBLE);
>  }
>
>  /**
> --
> 2.20.1
>



[Cluster-devel] [GFS2 PATCH v3 02/19] gfs2: eliminate tr_num_revoke_rm

2019-04-30 Thread Bob Peterson
For its journal processing, gfs2 kept track of the number of buffers
added and removed on a per-transaction basis. These values are used
to calculate space needed in the journal. But while these calculations
make sense for the number of buffers, they make no sense for revokes.
Revokes are managed in their own list, linked from the superblock.
So it's entirely unnecessary to keep separate per-transaction counts
for revokes added and removed. A single count will do the same job.
Therefore, this patch combines the transaction revokes into a single
count.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/incore.h | 1 -
 fs/gfs2/log.c| 3 +--
 fs/gfs2/trans.c  | 6 +++---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 78c8e761b321..c54800385298 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -507,7 +507,6 @@ struct gfs2_trans {
unsigned int tr_num_buf_rm;
unsigned int tr_num_databuf_rm;
unsigned int tr_num_revoke;
-   unsigned int tr_num_revoke_rm;
 
struct list_head tr_list;
struct list_head tr_databuf;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index a7febb4bd400..76da6f046379 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -887,7 +887,6 @@ static void gfs2_merge_trans(struct gfs2_trans *old, struct 
gfs2_trans *new)
old->tr_num_buf_rm  += new->tr_num_buf_rm;
old->tr_num_databuf_rm  += new->tr_num_databuf_rm;
old->tr_num_revoke  += new->tr_num_revoke;
-   old->tr_num_revoke_rm   += new->tr_num_revoke_rm;
 
list_splice_tail_init(>tr_databuf, >tr_databuf);
list_splice_tail_init(>tr_buf, >tr_buf);
@@ -909,7 +908,7 @@ static void log_refund(struct gfs2_sbd *sdp, struct 
gfs2_trans *tr)
set_bit(TR_ATTACHED, >tr_flags);
}
 
-   sdp->sd_log_commited_revoke += tr->tr_num_revoke - tr->tr_num_revoke_rm;
+   sdp->sd_log_commited_revoke += tr->tr_num_revoke;
reserved = calc_reserved(sdp);
maxres = sdp->sd_log_blks_reserved + tr->tr_reserved;
gfs2_assert_withdraw(sdp, maxres >= reserved);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index d328da7cde36..7ebc0f99deb1 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -80,10 +80,10 @@ static void gfs2_print_trans(struct gfs2_sbd *sdp, const 
struct gfs2_trans *tr)
fs_warn(sdp, "blocks=%u revokes=%u reserved=%u touched=%u\n",
tr->tr_blocks, tr->tr_revokes, tr->tr_reserved,
test_bit(TR_TOUCHED, >tr_flags));
-   fs_warn(sdp, "Buf %u/%u Databuf %u/%u Revoke %u/%u\n",
+   fs_warn(sdp, "Buf %u/%u Databuf %u/%u Revoke %u\n",
tr->tr_num_buf_new, tr->tr_num_buf_rm,
tr->tr_num_databuf_new, tr->tr_num_databuf_rm,
-   tr->tr_num_revoke, tr->tr_num_revoke_rm);
+   tr->tr_num_revoke);
 }
 
 void gfs2_trans_end(struct gfs2_sbd *sdp)
@@ -266,7 +266,7 @@ void gfs2_trans_remove_revoke(struct gfs2_sbd *sdp, u64 
blkno, unsigned int len)
gfs2_assert_withdraw(sdp, sdp->sd_log_num_revoke);
sdp->sd_log_num_revoke--;
kmem_cache_free(gfs2_bufdata_cachep, bd);
-   tr->tr_num_revoke_rm++;
+   tr->tr_num_revoke--;
if (--n == 0)
break;
}
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 04/19] gfs2: Warn when a journal replay overwrites a rgrp with buffers

2019-04-30 Thread Bob Peterson
This patch adds some instrumentation in gfs2's journal replay that
indicates when we're about to overwrite a rgrp for which we already
have a valid buffer_head.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/lops.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 6af6a3cea967..2e8c6d02e112 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -564,9 +564,27 @@ static int buf_lo_scan_elements(struct gfs2_jdesc *jd, u32 
start,
 
if (gfs2_meta_check(sdp, bh_ip))
error = -EIO;
-   else
+   else {
+   struct gfs2_meta_header *mh =
+   (struct gfs2_meta_header *)bh_ip->b_data;
+
+   if (mh->mh_type == cpu_to_be32(GFS2_METATYPE_RG)) {
+   struct gfs2_rgrpd *rgd;
+
+   rgd = gfs2_blk2rgrpd(sdp, blkno, false);
+   if (rgd && rgd->rd_addr == blkno &&
+   rgd->rd_bits && rgd->rd_bits->bi_bh) {
+   fs_info(sdp, "Replaying 0x%llx but we "
+   "already have a bh!\n",
+   (unsigned long long)blkno);
+   fs_info(sdp, "busy:%d, pinned:%d\n",
+   
buffer_busy(rgd->rd_bits->bi_bh) ? 1 : 0,
+   
buffer_pinned(rgd->rd_bits->bi_bh));
+   gfs2_dump_glock(NULL, rgd->rd_gl);
+   }
+   }
mark_buffer_dirty(bh_ip);
-
+   }
brelse(bh_log);
brelse(bh_ip);
 
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 13/19] gfs2: Make secondary withdrawers wait for first withdrawer

2019-04-30 Thread Bob Peterson
Before this patch, if a process encountered an error and decided to
withdraw, if another process was already in the process of withdrawing,
the secondary withdraw would be silently ignored, which set it free
to proceed with its processing, unlock any locks, etc. That's correct
behavior if the original withdrawer encounters further errors down
the road. However, second withdrawers need to wait for the first
withdrawer to finish its withdraw before proceeding. If we don't wait
we could end up assuming everything is alright, unlock glocks and
telling other nodes they can have the glock, despite the fact that
a withdraw is still ongoing and may require a journal replay before
any locks are released. For example, if an rgrp glock is freed
by a process that didn't wait for the withdraw, a journal replay
could introduce file system corruption by replaying a rgrp block
that has already been granted to another node.

This patch makes secondary withdrawers wait until the primary
withdrawer is finished with its processing before proceeding.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/incore.h |  3 +++
 fs/gfs2/util.c   | 21 +++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b16b82e5b85a..0b283bbd20ea 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -628,6 +628,7 @@ enum {
SDF_SKIP_DLM_UNLOCK = 8,
SDF_FORCE_AIL_FLUSH = 9,
SDF_WITHDRAW= 10, /* Will withdraw eventually */
+   SDF_WITHDRAW_IN_PROG= 11, /* Withdraw is in progress */
 };
 
 enum gfs2_freeze_state {
@@ -837,6 +838,8 @@ struct gfs2_sbd {
struct bio *sd_log_bio;
wait_queue_head_t sd_log_flush_wait;
int sd_log_error; /* First log error */
+   atomic_t sd_withdrawer;
+   wait_queue_head_t sd_withdraw_wait;
 
atomic_t sd_reserving_log;
wait_queue_head_t sd_reserving_log_wait;
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 398d3cedb1a4..ad56f934ff0f 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -89,9 +89,23 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, 
...)
struct va_format vaf;
 
if (sdp->sd_args.ar_errors == GFS2_ERRORS_WITHDRAW &&
-   test_and_set_bit(SDF_SHUTDOWN, >sd_flags))
-   return 0;
+   test_and_set_bit(SDF_SHUTDOWN, >sd_flags)) {
+   if (!test_bit(SDF_WITHDRAW_IN_PROG, >sd_flags))
+   return -1;
+
+   fs_warn(sdp, "Pid %d waiting for process %d to withdraw.\n",
+   pid_nr(task_pid(current)),
+   atomic_read(>sd_withdrawer));
+   wait_on_bit(>sd_flags, SDF_WITHDRAW_IN_PROG,
+   TASK_UNINTERRUPTIBLE);
+   fs_warn(sdp, "Pid %d done waiting for process %d.\n",
+   pid_nr(task_pid(current)),
+   atomic_read(>sd_withdrawer));
+   return -1;
+   }
 
+   set_bit(SDF_WITHDRAW_IN_PROG, >sd_flags);
+   atomic_set(>sd_withdrawer, pid_nr(task_pid(current)));
if (fmt) {
va_start(args, fmt);
 
@@ -119,6 +133,9 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, 
...)
set_bit(SDF_SKIP_DLM_UNLOCK, >sd_flags);
fs_err(sdp, "withdrawn\n");
dump_stack();
+   clear_bit(SDF_WITHDRAW_IN_PROG, >sd_flags);
+   smp_mb__after_atomic();
+   wake_up_bit(>sd_flags, SDF_WITHDRAW_IN_PROG);
}
 
if (sdp->sd_args.ar_errors == GFS2_ERRORS_PANIC)
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 01/19] gfs2: kthread and remount improvements

2019-04-30 Thread Bob Peterson
Before this patch, gfs2 saved the pointers to the two daemon threads
(logd and quotad) in the superblock, but they were never cleared,
even if the threads were stopped (e.g. on remount -o ro). That meant
that certain error conditions (like a withdrawn file system) could
race. For example, xfstests generic/361 caused an IO error during
remount -o ro, which caused the kthreads to be stopped, then the
error flagged. Later, when the test unmounted the file system, it
would try to stop the threads a second time with kthread_stop.

This patch does two things: First, every time it stops the threads
it zeroes out the thread pointer, and also checks whether it's NULL
before trying to stop it. Second, in function gfs2_remount_fs, it
was returning if an error was logged by either of the two functions
for gfs2_make_fs_ro and _rw, which caused it to bypass the online
uevent at the bottom of the function. This removes that bypass in
favor of just running the whole function, then returning the error.
That way, unmounts and remounts won't hang forever.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/super.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index a6a325b2a78b..dc4ec1ca3d4e 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -396,6 +396,7 @@ static int init_threads(struct gfs2_sbd *sdp)
 
 fail:
kthread_stop(sdp->sd_logd_process);
+   sdp->sd_logd_process = NULL;
return error;
 }
 
@@ -453,8 +454,12 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
freeze_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq_uninit(_gh);
 fail_threads:
-   kthread_stop(sdp->sd_quotad_process);
-   kthread_stop(sdp->sd_logd_process);
+   if (sdp->sd_quotad_process)
+   kthread_stop(sdp->sd_quotad_process);
+   sdp->sd_quotad_process = NULL;
+   if (sdp->sd_logd_process)
+   kthread_stop(sdp->sd_logd_process);
+   sdp->sd_logd_process = NULL;
return error;
 }
 
@@ -855,8 +860,12 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
return error;
 
flush_workqueue(gfs2_delete_workqueue);
-   kthread_stop(sdp->sd_quotad_process);
-   kthread_stop(sdp->sd_logd_process);
+   if (sdp->sd_quotad_process)
+   kthread_stop(sdp->sd_quotad_process);
+   sdp->sd_quotad_process = NULL;
+   if (sdp->sd_logd_process)
+   kthread_stop(sdp->sd_logd_process);
+   sdp->sd_logd_process = NULL;
 
gfs2_quota_sync(sdp->sd_vfs, 0);
gfs2_statfs_sync(sdp->sd_vfs, 0);
@@ -1273,8 +1282,6 @@ static int gfs2_remount_fs(struct super_block *sb, int 
*flags, char *data)
error = gfs2_make_fs_ro(sdp);
else
error = gfs2_make_fs_rw(sdp);
-   if (error)
-   return error;
}
 
sdp->sd_args = args;
@@ -1300,7 +1307,7 @@ static int gfs2_remount_fs(struct super_block *sb, int 
*flags, char *data)
spin_unlock(>gt_spin);
 
gfs2_online_uevent(sdp);
-   return 0;
+   return error;
 }
 
 /**
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection

2019-04-30 Thread Bob Peterson
Here is version 3 of the patch set I posted on 23 April. It is revised
based on bugs I found testing with xfstests.

This is a collection of patches I've been using to address the myriad
of recovery problems I've found. I'm still finding them, so the battle
is not won yet. I'm not convinced we need all of these but I thought
I'd send them anyway and get feedback. Previously I sent out a version
of the patch "gfs2: Force withdraw to replay journals and wait for it
to finish" that was too big and complex. So I broke it up into four
patches, starting with "move check_journal_clean to util.c for future
use". So those four need to be a set. There aren't many other dependencies
between patches, so the others could probably be taken or rejected
individually. There are more patches I still need to perfect, but maybe
a few of the safer ones can be pushed to for-next.

Bob Peterson (19):
  gfs2: kthread and remount improvements
  gfs2: eliminate tr_num_revoke_rm
  gfs2: log which portion of the journal is replayed
  gfs2: Warn when a journal replay overwrites a rgrp with buffers
  gfs2: Introduce concept of a pending withdraw
  gfs2: log error reform
  gfs2: Only complain the first time an io error occurs in quota or log
  gfs2: Stop ail1 wait loop when withdrawn
  gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
  gfs2: move check_journal_clean to util.c for future use
  gfs2: Allow some glocks to be used during withdraw
  gfs2: Don't loop forever in gfs2_freeze if withdrawn
  gfs2: Make secondary withdrawers wait for first withdrawer
  gfs2: Don't write log headers after file system withdraw
  gfs2: Force withdraw to replay journals and wait for it to finish
  gfs2: simply gfs2_freeze by removing case
  gfs2: Add verbose option to check_journal_clean
  gfs2: Check for log write errors before telling dlm to unlock
  gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty

 fs/gfs2/aops.c   |   4 +-
 fs/gfs2/file.c   |   2 +-
 fs/gfs2/glock.c  |  82 +++--
 fs/gfs2/glock.h  |   1 +
 fs/gfs2/glops.c  | 100 +++--
 fs/gfs2/glops.h  |   3 +-
 fs/gfs2/incore.h |  21 -
 fs/gfs2/inode.c  |  14 ++-
 fs/gfs2/lock_dlm.c   |  50 +++
 fs/gfs2/log.c|  55 +++-
 fs/gfs2/log.h|   1 +
 fs/gfs2/lops.c   |  27 +-
 fs/gfs2/meta_io.c|   6 +-
 fs/gfs2/ops_fstype.c |  65 --
 fs/gfs2/quota.c  |  10 ++-
 fs/gfs2/recovery.c   |   3 +-
 fs/gfs2/super.c  |  88 +++---
 fs/gfs2/super.h  |   1 +
 fs/gfs2/sys.c|  14 ++-
 fs/gfs2/trans.c  |   6 +-
 fs/gfs2/util.c   | 206 ---
 fs/gfs2/util.h   |  15 
 22 files changed, 617 insertions(+), 157 deletions(-)

-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 08/19] gfs2: Stop ail1 wait loop when withdrawn

2019-04-30 Thread Bob Peterson
Before this patch, function gfs2_log_flush could get into an infinite
loop trying to clear out its ail1 list. If the file system was
withdrawn (or pending withdraw) due to a problem with writing the ail1
list, it would never clear out the list, and therefore, would loop
infinitely. This patch changes function gfs2_log_flush so that it
does while (!gfs2_withdraw(sdp)) rather than while (;;).

Signed-off-by: Bob Peterson 
---
 fs/gfs2/log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 33ef2cb570e2..6169276aa9e6 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -854,7 +854,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock 
*gl, u32 flags)
 
if (!(flags & GFS2_LOG_HEAD_FLUSH_NORMAL)) {
if (!sdp->sd_log_idle) {
-   for (;;) {
+   while (!gfs2_withdrawn(sdp)) {
gfs2_ail1_start(sdp);
gfs2_ail1_wait(sdp);
if (gfs2_ail1_empty(sdp))
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 03/19] gfs2: log which portion of the journal is replayed

2019-04-30 Thread Bob Peterson
When a journal is replayed, gfs2 logs a message similar to:

jid=X: Replaying journal...

This patch adds the tail and block number so that the range of the
replayed block is also printed. These values will match the values
shown if the journal is dumped with gfs2_edit -p journalX. The
resulting output looks something like this:

jid=1: Replaying journal...0x28b7 to 0x2beb

This will allow us to better debug file system corruption problems.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/recovery.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index fa575d1676b9..16422fdef2b9 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -514,7 +514,8 @@ void gfs2_recover_func(struct work_struct *work)
}
 
t_tlck = ktime_get();
-   fs_info(sdp, "jid=%u: Replaying journal...\n", jd->jd_jid);
+   fs_info(sdp, "jid=%u: Replaying journal...0x%x to 0x%x\n",
+   jd->jd_jid, head.lh_tail, head.lh_blkno);
 
for (pass = 0; pass < 2; pass++) {
lops_before_scan(jd, , pass);
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 12/19] gfs2: Don't loop forever in gfs2_freeze if withdrawn

2019-04-30 Thread Bob Peterson
Before this patch, function gfs2_freeze would loop forever if the
file system trying to be frozen is withdrawn. That's because function
gfs2_lock_fs_check_clean tries to enqueue the glock of the journal
and the gfs2_glock returns -EIO because you can't enqueue a journaled
glock after a withdraw.

This patch moves the check for file system withdraw inside the loop
so that the loop can end when withdraw occurs.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/super.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 378519beb6c3..4ae23c338034 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1014,12 +1014,12 @@ static int gfs2_freeze(struct super_block *sb)
if (atomic_read(>sd_freeze_state) != SFS_UNFROZEN)
goto out;
 
-   if (gfs2_withdrawn(sdp)) {
-   error = -EINVAL;
-   goto out;
-   }
-
for (;;) {
+   if (gfs2_withdrawn(sdp)) {
+   error = -EINVAL;
+   goto out;
+   }
+
error = gfs2_lock_fs_check_clean(sdp, >sd_freeze_gh);
if (!error)
break;
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 17/19] gfs2: Add verbose option to check_journal_clean

2019-04-30 Thread Bob Peterson
Before this patch, function check_journal_clean would give messages
related to journal recovery. That's fine for mount time, but when a
node withdraws and forces replay that way, we don't want all those
distracting and misleading messages. This patch adds a new parameter
to make those messages optional.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/ops_fstype.c |  2 +-
 fs/gfs2/util.c   | 23 ---
 fs/gfs2/util.h   |  4 +++-
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c4222c1fa735..d8bca92f34a1 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -690,7 +690,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo)
struct gfs2_jdesc *jd = gfs2_jdesc_find(sdp, x);
 
if (sdp->sd_args.ar_spectator) {
-   error = check_journal_clean(sdp, jd);
+   error = check_journal_clean(sdp, jd, true);
if (error)
goto fail_jinode_gh;
continue;
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 5e39859d58d8..2dd3fb8975a7 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -48,7 +48,8 @@ void gfs2_assert_i(struct gfs2_sbd *sdp)
  *
  * Returns: 0 if the journal is clean or locked, else an error
  */
-int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd)
+int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
+   bool verbose)
 {
int error;
struct gfs2_holder j_gh;
@@ -59,23 +60,31 @@ int check_journal_clean(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd)
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP |
   GL_EXACT | GL_NOCACHE, _gh);
if (error) {
-   fs_err(sdp, "Error locking journal for spectator mount.\n");
+   if (verbose)
+   fs_err(sdp, "Error %d locking journal for spectator "
+  "mount.\n", error);
return -EPERM;
}
error = gfs2_jdesc_check(jd);
if (error) {
-   fs_err(sdp, "Error checking journal for spectator mount.\n");
+   if (verbose)
+   fs_err(sdp, "Error checking journal for spectator "
+  "mount.\n");
goto out_unlock;
}
error = gfs2_find_jhead(jd, );
if (error) {
-   fs_err(sdp, "Error parsing journal for spectator mount.\n");
+   if (verbose)
+   fs_err(sdp, "Error parsing journal for spectator "
+  "mount.\n");
goto out_unlock;
}
if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
error = -EPERM;
-   fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter "
-  "must not be a spectator.\n", jd->jd_jid);
+   if (verbose)
+   fs_err(sdp, "jid=%u: Journal is dirty, so the first "
+  "mounter must not be a spectator.\n",
+  jd->jd_jid);
}
 
 out_unlock:
@@ -174,7 +183,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
 
/* Now wait until recovery complete. */
for (tries = 0; tries < 10; tries++) {
-   ret = check_journal_clean(sdp, sdp->sd_jdesc);
+   ret = check_journal_clean(sdp, sdp->sd_jdesc, false);
if (!ret)
break;
msleep(HZ);
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index 49d46865dceb..345b906a6142 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -131,7 +131,9 @@ static inline void gfs2_metatype_set(struct buffer_head 
*bh, u16 type,
 
 int gfs2_io_error_i(struct gfs2_sbd *sdp, const char *function,
char *file, unsigned int line);
-int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd);
+
+extern int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
+  bool verbose);
 
 #define gfs2_io_error(sdp) \
 gfs2_io_error_i((sdp), __func__, __FILE__, __LINE__);
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 18/19] gfs2: Check for log write errors before telling dlm to unlock

2019-04-30 Thread Bob Peterson
Before this patch, function do_xmote just assumed all the writes
submitted to the journal were finished and successful, and it
called the go_unlock function to release the dlm lock. But if
they're not, and a revoke failed to make its way to the journal,
a journal replay on another node will cause corruption if we
let the go_inval function continue and tell dlm to release the
glock to another node. This patch adds a couple checks for errors
in do_xmote after the calls to go_sync and go_inval. If an error
is found, we cannot withdraw yet, because the withdraw itself
uses glocks to make the file system read-only. Instead, we flag
the error. Later, asserts should cause another node to replay
the journal before continuing, thus protecting rgrp and dinode
glocks and maintaining the integrity of the metadata. Note that
we only need to do this for journaled glocks. System glocks
should be able to progress even under withdrawn conditions.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 3d1b9bdfd0de..f4129305a815 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -570,8 +570,47 @@ __acquires(>gl_lockref.lock)
spin_unlock(>gl_lockref.lock);
if (glops->go_sync)
glops->go_sync(gl);
+   /**
+* Check for an error encountered since we called go_sync. If so, we
+* can't withdraw from the glock code because the withdraw code itself
+* uses glocks (see function signal_our_withdraw) to change the mount
+* to read-only. Most importantly, we must not call dlm to unlock the
+* glock until the journal is in a known good state (after journal
+* replay) otherwise other nodes may use the object (rgrp or dinode)
+* and then later, journal replay will corrupt the file system. The
+* best we can do now is wait for the logd daemon to see sd_log_error
+* and withdraw, and in the meantime, requeue the work for later.
+*/
+   if ((glops->go_flags & GLOF_JOURNALED) && sdp->sd_log_error) {
+   gfs2_glock_hold(gl);
+   finish_xmote(gl, LM_OUT_ERROR | LM_ST_UNLOCKED);
+   gfs2_glock_queue_work(gl, GL_GLOCK_DFT_HOLD);
+   goto out;
+   }
if (test_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags))
glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : 
DIO_METADATA);
+   /**
+* If the go_inval left some items on the ail, that's a problem. We
+* need to withdraw, but we can't, as explained above. Warn about the
+* error, dump the glock, then fall through and wait for logd to do
+* the withdraw for us.
+*/
+   if (atomic_read(>gl_ail_count) != 0) {
+   if (!cmpxchg(>sd_log_error, 0, -EIO)) {
+   gfs2_assert_warn(sdp, !atomic_read(>gl_ail_count));
+   gfs2_dump_glock(NULL, gl);
+   }
+   }
+   /**
+* Check for an error encountered while we were doing go_inval, and
+* handle it the same way we did for the go_sync case.
+*/
+   if ((glops->go_flags & GLOF_JOURNALED) && sdp->sd_log_error) {
+   gfs2_glock_hold(gl);
+   finish_xmote(gl, LM_OUT_ERROR | LM_ST_UNLOCKED);
+   gfs2_glock_queue_work(gl, GL_GLOCK_DFT_HOLD);
+   goto out;
+   }
clear_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags);
 
gfs2_glock_hold(gl);
@@ -583,8 +622,7 @@ __acquires(>gl_lockref.lock)
test_bit(SDF_SKIP_DLM_UNLOCK, >sd_flags)) {
finish_xmote(gl, target);
gfs2_glock_queue_work(gl, 0);
-   }
-   else if (ret) {
+   } else if (ret) {
fs_err(sdp, "lm_lock ret %d\n", ret);
GLOCK_BUG_ON(gl, !gfs2_withdrawn(sdp));
}
@@ -592,7 +630,7 @@ __acquires(>gl_lockref.lock)
finish_xmote(gl, target);
gfs2_glock_queue_work(gl, 0);
}
-
+out:
spin_lock(>gl_lockref.lock);
 }
 
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 19/19] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty

2019-04-30 Thread Bob Peterson
Before this patch, if gfs2_ail_empty_gl saw there was nothing on
the ail list, it would return and not flush the log. The problem
is that there could still be a revoke for the rgrp sitting on the
sd_log_le_revoke list that's been recently taken off the ail list.
But that revoke still needs to be written, and the rgrp_go_inval
still needs to call log_flush_wait to ensure the revokes are all
properly written to the journal before we relinquish control of
the glock to another node. If we give the glock to another node
before we have this knowledge, the node might crash and its journal
replayed, in which case the missing revoke would allow the journal
replay to replay the rgrp over top of the rgrp we already gave to
another node, thus overwriting its changes and corrupting the
file system.

This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
than returning.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glops.c | 26 +-
 fs/gfs2/log.c   |  2 +-
 fs/gfs2/log.h   |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index ff50013c6aa9..1cd2a3c69d63 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -93,8 +93,31 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
INIT_LIST_HEAD(_databuf);
tr.tr_revokes = atomic_read(>gl_ail_count);
 
-   if (!tr.tr_revokes)
+   if (!tr.tr_revokes) {
+   /**
+* We have nothing on the ail, but there could be revokes on
+* the sdp revoke queue, in which case, we still want to flush
+* the log and wait for it to finish.
+*
+* If the sdp revoke list is empty too, we might still have an
+* io outstanding for writing revokes, so we should wait for
+* it before returning.
+*
+* If none of these conditions are true, our revokes are all
+* flushed and we can return.
+*/
+   gfs2_log_lock(sdp);
+   if (test_bit(GLF_REVOKES, >gl_flags)) {
+   gfs2_log_unlock(sdp);
+   goto flush;
+   } else if (atomic_read(>sd_log_in_flight)) {
+   gfs2_log_unlock(sdp);
+   log_flush_wait(sdp);
+   } else {
+   gfs2_log_unlock(sdp);
+   }
return;
+   }
 
/* A shortened, inline version of gfs2_trans_begin()
  * tr->alloced is not set since the transaction structure is
@@ -109,6 +132,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
__gfs2_ail_flush(gl, 0, tr.tr_revokes);
 
gfs2_trans_end(sdp);
+flush:
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
   GFS2_LFC_AIL_EMPTY_GL);
 }
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 732b585f33be..a6c83cdf4e1b 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -522,7 +522,7 @@ static void log_pull_tail(struct gfs2_sbd *sdp, unsigned 
int new_tail)
 }
 
 
-static void log_flush_wait(struct gfs2_sbd *sdp)
+void log_flush_wait(struct gfs2_sbd *sdp)
 {
DEFINE_WAIT(wait);
 
diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h
index 7a34a3234266..cd1581bff293 100644
--- a/fs/gfs2/log.h
+++ b/fs/gfs2/log.h
@@ -76,6 +76,7 @@ extern void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
gfs2_glock *gl,
   u32 type);
 extern void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans *trans);
 extern void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control 
*wbc);
+extern void log_flush_wait(struct gfs2_sbd *sdp);
 
 extern void gfs2_log_shutdown(struct gfs2_sbd *sdp);
 extern int gfs2_logd(void *data);
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 11/19] gfs2: Allow some glocks to be used during withdraw

2019-04-30 Thread Bob Peterson
Before this patch, when a file system was withdrawn, all further
attempts to enqueue or promote glocks were rejected and returned
-EIO. This is only important for media-backed glocks like inode
and rgrp glocks. All other glocks may be safely used because there
is no potential for metadata corruption. This patch allows some
glocks to be used even after the file system is withdrawn. This
is accomplished with a new glops flag, GLOF_JOURNALED, which tells
us which inodes cannot be safely manipulated after withdraw. This
facilitates future patches that enhance fs withdraw.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c  |  7 +--
 fs/gfs2/glops.c  | 16 +---
 fs/gfs2/glops.h  |  3 ++-
 fs/gfs2/incore.h |  8 
 fs/gfs2/inode.c  | 14 ++
 fs/gfs2/ops_fstype.c | 14 +++---
 fs/gfs2/sys.c| 12 ++--
 7 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index fe9f9a22426e..eeb84d8d21f3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -548,7 +548,7 @@ __acquires(>gl_lockref.lock)
int ret;
 
if (unlikely(gfs2_withdrawn(sdp)) &&
-   target != LM_ST_UNLOCKED)
+   (glops->go_flags & GLOF_JOURNALED) && target != LM_ST_UNLOCKED)
return;
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
  LM_FLAG_PRIORITY);
@@ -1096,7 +1096,8 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
int error = 0;
 
-   if (unlikely(gfs2_withdrawn(sdp)))
+   if (unlikely(gfs2_withdrawn(sdp)) &&
+   (gl->gl_ops->go_flags & GLOF_JOURNALED))
return -EIO;
 
if (test_bit(GLF_LRU, >gl_flags))
@@ -1760,6 +1761,8 @@ static const char *gflags2str(char *buf, const struct 
gfs2_glock *gl)
*p++ = 'o';
if (test_bit(GLF_BLOCKING, gflags))
*p++ = 'b';
+   if (gl->gl_ops->go_flags & GLOF_JOURNALED)
+   *p++ = 'j';
*p = 0;
return buf;
 }
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index b9f36a85a4d4..d2b44bdca17e 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -584,7 +584,7 @@ const struct gfs2_glock_operations gfs2_meta_glops = {
.go_type = LM_TYPE_META,
 };
 
-const struct gfs2_glock_operations gfs2_inode_glops = {
+const struct gfs2_glock_operations gfs2_sys_inode_glops = {
.go_sync = inode_go_sync,
.go_inval = inode_go_inval,
.go_demote_ok = inode_go_demote_ok,
@@ -594,6 +594,16 @@ const struct gfs2_glock_operations gfs2_inode_glops = {
.go_flags = GLOF_ASPACE | GLOF_LRU,
 };
 
+const struct gfs2_glock_operations gfs2_user_inode_glops = {
+   .go_sync = inode_go_sync,
+   .go_inval = inode_go_inval,
+   .go_demote_ok = inode_go_demote_ok,
+   .go_lock = inode_go_lock,
+   .go_dump = inode_go_dump,
+   .go_type = LM_TYPE_INODE,
+   .go_flags = GLOF_ASPACE | GLOF_LRU | GLOF_JOURNALED,
+};
+
 const struct gfs2_glock_operations gfs2_rgrp_glops = {
.go_sync = rgrp_go_sync,
.go_inval = rgrp_go_inval,
@@ -601,7 +611,7 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
.go_unlock = gfs2_rgrp_go_unlock,
.go_dump = gfs2_rgrp_dump,
.go_type = LM_TYPE_RGRP,
-   .go_flags = GLOF_LVB,
+   .go_flags = GLOF_LVB | GLOF_JOURNALED,
 };
 
 const struct gfs2_glock_operations gfs2_freeze_glops = {
@@ -637,7 +647,7 @@ const struct gfs2_glock_operations gfs2_journal_glops = {
 
 const struct gfs2_glock_operations *gfs2_glops_list[] = {
[LM_TYPE_META] = _meta_glops,
-   [LM_TYPE_INODE] = _inode_glops,
+   [LM_TYPE_INODE] = _user_inode_glops,
[LM_TYPE_RGRP] = _rgrp_glops,
[LM_TYPE_IOPEN] = _iopen_glops,
[LM_TYPE_FLOCK] = _flock_glops,
diff --git a/fs/gfs2/glops.h b/fs/gfs2/glops.h
index 8ed1857c1a8d..63130a5959e1 100644
--- a/fs/gfs2/glops.h
+++ b/fs/gfs2/glops.h
@@ -15,7 +15,8 @@
 extern struct workqueue_struct *gfs2_freeze_wq;
 
 extern const struct gfs2_glock_operations gfs2_meta_glops;
-extern const struct gfs2_glock_operations gfs2_inode_glops;
+extern const struct gfs2_glock_operations gfs2_user_inode_glops;
+extern const struct gfs2_glock_operations gfs2_sys_inode_glops;
 extern const struct gfs2_glock_operations gfs2_rgrp_glops;
 extern const struct gfs2_glock_operations gfs2_freeze_glops;
 extern const struct gfs2_glock_operations gfs2_iopen_glops;
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e16ab4c98072..b16b82e5b85a 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -250,6 +250,14 @@ struct gfs2_glock_operations {
 #define GLOF_ASPACE 1
 #define GLOF_LVB2
 #define GLOF_LRU4
+/**
+ * The GLOF_JOURNALED flag marks objects that are journaled, which means we
+ * need to treat them specially when a file system withdraw occurs. Inodes
+ * like the journal inode itself are okay to manipulate after withdraw, but
+ * 

[Cluster-devel] [GFS2 PATCH v3 05/19] gfs2: Introduce concept of a pending withdraw

2019-04-30 Thread Bob Peterson
File system withdraws can be delayed when inconsistencies are
discovered when we cannot withdraw immediately, for example, when
critical spin_locks are held. But delaying the withdraw can cause
gfs2 to ignore the error and keep running for a short period of time.
For example, an rgrp glock may be dequeued and demoted while there
are still buffers that haven't been properly revoked, due to io
errors writing to the journal.

This patch introduces a new concept of a pending withdraw, which
means an inconsistency has been discovered and we need to withdraw
at the earliest possible opportunity. In these cases, we aren't
quite withdrawn yet, but we still need to not dequeue glocks and
other critical things. If we dequeue the glocks and the withdraw
results in our journal being replayed, the replay could overwrite
data that's been modified by a different node that acquired the
glock in the meantime.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/aops.c   |  4 ++--
 fs/gfs2/file.c   |  2 +-
 fs/gfs2/glock.c  |  7 +++
 fs/gfs2/glops.c  |  2 +-
 fs/gfs2/incore.h |  1 +
 fs/gfs2/log.c| 20 
 fs/gfs2/meta_io.c|  6 +++---
 fs/gfs2/ops_fstype.c |  3 +--
 fs/gfs2/quota.c  |  2 +-
 fs/gfs2/super.c  |  6 +++---
 fs/gfs2/sys.c|  2 +-
 fs/gfs2/util.h   | 12 
 12 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..2bacce032a34 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -521,7 +521,7 @@ static int __gfs2_readpage(void *file, struct page *page)
error = mpage_readpage(page, gfs2_block_map);
}
 
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)))
+   if (unlikely(gfs2_withdrawn(sdp)))
return -EIO;
 
return error;
@@ -638,7 +638,7 @@ static int gfs2_readpages(struct file *file, struct 
address_space *mapping,
gfs2_glock_dq();
 out_uninit:
gfs2_holder_uninit();
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)))
+   if (unlikely(gfs2_withdrawn(sdp)))
ret = -EIO;
return ret;
 }
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 58a768e59712..bba39c73a43c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1169,7 +1169,7 @@ static int gfs2_lock(struct file *file, int cmd, struct 
file_lock *fl)
cmd = F_SETLK;
fl->fl_type = F_UNLCK;
}
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) {
+   if (unlikely(gfs2_withdrawn(sdp))) {
if (fl->fl_type == F_UNLCK)
locks_lock_file_wait(file, fl);
return -EIO;
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 15c605cfcfc8..fe9f9a22426e 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -547,7 +547,7 @@ __acquires(>gl_lockref.lock)
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
int ret;
 
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)) &&
+   if (unlikely(gfs2_withdrawn(sdp)) &&
target != LM_ST_UNLOCKED)
return;
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
@@ -584,8 +584,7 @@ __acquires(>gl_lockref.lock)
}
else if (ret) {
fs_err(sdp, "lm_lock ret %d\n", ret);
-   GLOCK_BUG_ON(gl, !test_bit(SDF_SHUTDOWN,
-  >sd_flags));
+   GLOCK_BUG_ON(gl, !gfs2_withdrawn(sdp));
}
} else { /* lock_nolock */
finish_xmote(gl, target);
@@ -1097,7 +1096,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
int error = 0;
 
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)))
+   if (unlikely(gfs2_withdrawn(sdp)))
return -EIO;
 
if (test_bit(GLF_LRU, >gl_flags))
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 78510ab91835..b9f36a85a4d4 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -538,7 +538,7 @@ static int freeze_go_xmote_bh(struct gfs2_glock *gl, struct 
gfs2_holder *gh)
gfs2_consist(sdp);
 
/*  Initialize some head of the log stuff  */
-   if (!test_bit(SDF_SHUTDOWN, >sd_flags)) {
+   if (!gfs2_withdrawn(sdp)) {
sdp->sd_log_sequence = head.lh_sequence + 1;
gfs2_log_pointers_init(sdp, head.lh_blkno);
}
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index c54800385298..003d9da937b4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -620,6 +620,7 @@ enum {
SDF_SKIP_DLM_UNLOCK = 8,
SDF_FORCE_AIL_FLUSH = 9,
SDF_AIL1_IO_ERROR   = 10,
+   SDF_WITHDRAW= 11, /* Will withdraw eventually */
 };
 
 enum gfs2_freeze_state {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 76da6f046379..944aaf3d1816 100644

[Cluster-devel] [GFS2 PATCH v3 06/19] gfs2: log error reform

2019-04-30 Thread Bob Peterson
Before this patch, gfs2 kept track of journal io errors in two
places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags.
This patch consolidates the two into sd_log_error so that it
reflects the first error encountered writing to the journal.
In future patches, we will take advantage of this by checking
this value rather than having to check both when reacting to
io errors.

In addition, this fixes a tight loop in unmount: If buffers
get on the ail1 list and an io error occurs elsewhere, the
ail1 list would never be cleared because they were always busy.
So unmount would hang, waiting for the ail1 list to empty.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/incore.h |  5 ++---
 fs/gfs2/log.c| 20 +++-
 fs/gfs2/quota.c  |  2 +-
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 003d9da937b4..e16ab4c98072 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -619,8 +619,7 @@ enum {
SDF_RORECOVERY  = 7, /* read only recovery */
SDF_SKIP_DLM_UNLOCK = 8,
SDF_FORCE_AIL_FLUSH = 9,
-   SDF_AIL1_IO_ERROR   = 10,
-   SDF_WITHDRAW= 11, /* Will withdraw eventually */
+   SDF_WITHDRAW= 10, /* Will withdraw eventually */
 };
 
 enum gfs2_freeze_state {
@@ -829,7 +828,7 @@ struct gfs2_sbd {
atomic_t sd_log_in_flight;
struct bio *sd_log_bio;
wait_queue_head_t sd_log_flush_wait;
-   int sd_log_error;
+   int sd_log_error; /* First log error */
 
atomic_t sd_reserving_log;
wait_queue_head_t sd_reserving_log_wait;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 944aaf3d1816..33ef2cb570e2 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -108,8 +108,7 @@ __acquires(>sd_ail_lock)
 
if (!buffer_busy(bh)) {
if (!buffer_uptodate(bh) &&
-   !test_and_set_bit(SDF_AIL1_IO_ERROR,
- >sd_flags)) {
+   !cmpxchg(>sd_log_error, 0, -EIO)) {
gfs2_io_error_bh(sdp, bh);
set_bit(SDF_WITHDRAW, >sd_flags);
}
@@ -203,10 +202,21 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, 
struct gfs2_trans *tr)
 bd_ail_st_list) {
bh = bd->bd_bh;
gfs2_assert(sdp, bd->bd_tr == tr);
-   if (buffer_busy(bh))
+   /**
+* If another process flagged an io error, e.g. writing to the
+* journal, error all other bhs and move them off the ail1 to
+* prevent a tight loop when unmount tries to flush ail1,
+* regardless of whether they're still busy. If no outside
+* errors were found and the buffer is busy, move to the next.
+* If the ail buffer is not busy and caught an error, flag it
+* for others.
+*/
+   if (sdp->sd_log_error) {
+   gfs2_io_error_bh(sdp, bh);
+   } else if (buffer_busy(bh)) {
continue;
-   if (!buffer_uptodate(bh) &&
-   !test_and_set_bit(SDF_AIL1_IO_ERROR, >sd_flags)) {
+   } else if (!buffer_uptodate(bh) &&
+  !cmpxchg(>sd_log_error, 0, -EIO)) {
gfs2_io_error_bh(sdp, bh);
set_bit(SDF_WITHDRAW, >sd_flags);
}
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index a8dfc86fd682..8871fca9102f 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1480,7 +1480,7 @@ static void quotad_error(struct gfs2_sbd *sdp, const char 
*msg, int error)
return;
if (!gfs2_withdrawn(sdp)) {
fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
-   sdp->sd_log_error = error;
+   cmpxchg(>sd_log_error, 0, error);
wake_up(>sd_logd_waitq);
}
 }
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 16/19] gfs2: simply gfs2_freeze by removing case

2019-04-30 Thread Bob Peterson
Function gfs2_freeze had a case statement that simply checked the
error code, but the break statements just made the logic hard to
read. This patch simplifies the logic in favor of a simple if.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/super.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 1582cb11866e..7ffeb4379222 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1043,20 +1043,14 @@ static int gfs2_freeze(struct super_block *sb)
if (!error)
break;
 
-   switch (error) {
-   case -EBUSY:
+   if (error == -EBUSY)
fs_err(sdp, "waiting for recovery before freeze\n");
-   break;
-
-   default:
+   else
fs_err(sdp, "error freezing FS: %d\n", error);
-   break;
-   }
 
fs_err(sdp, "retrying...\n");
msleep(1000);
}
-   error = 0;
 out:
mutex_unlock(>sd_freeze_mutex);
return error;
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn

2019-04-30 Thread Bob Peterson
This patch addresses various problems with gfs2/dlm recovery.

For example, suppose a node with a bunch of gfs2 mounts suddenly
reboots due to kernel panic, and dlm determines it should perform
recovery. DLM does so from a pseudo-state machine calling various
callbacks into lock_dlm to perform a sequence of steps. It uses
generation numbers and recover bits in dlm "control" lock lvbs.

Now suppose another node tries to recover the failed node's
journal, but in so doing, encounters an IO error or withdraws
due to unforeseen circumstances, such as an hba driver failure.
In these cases, the recovery would eventually bail out, but it
would still update its generation number in the lvb. The other
nodes would all see the newer generation number and think they
don't need to do recovery because the generation number is newer
than the last one they saw, and therefore someone else has already
taken care of it.

If the file system has an io error or is withdrawn, it cannot
safely replay any journals (its own or others) but someone else
still needs to do it. Therefore we don't want it messing with
the journal recovery generation numbers: the local generation
numbers eventually get put into the lvb generation numbers to be
seen by all nodes.

This patch adds checks to many of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
was withdraw.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/lock_dlm.c | 18 ++
 fs/gfs2/util.c | 15 +++
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 31df26ed7854..9329f86ffcbe 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = >sd_lockstruct;
 
+   if (gfs2_withdrawn(sdp)) {
+   fs_err(sdp, "recover_prep ignored due to withdraw.\n");
+   return;
+   }
spin_lock(>ls_recover_spin);
ls->ls_recover_block = ls->ls_recover_start;
set_bit(DFL_DLM_RECOVERY, >ls_recover_flags);
@@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot 
*slot)
struct lm_lockstruct *ls = >sd_lockstruct;
int jid = slot->slot - 1;
 
+   if (gfs2_withdrawn(sdp)) {
+   fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
spin_lock(>ls_recover_spin);
if (ls->ls_recover_size < jid + 1) {
fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
@@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = >sd_lockstruct;
 
+   if (gfs2_withdrawn(sdp)) {
+   fs_err(sdp, "recover_done ignored due to withdraw.\n");
+   return;
+   }
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
 
@@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, 
unsigned int jid,
 {
struct lm_lockstruct *ls = >sd_lockstruct;
 
+   if (gfs2_withdrawn(sdp)) {
+   fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
if (test_bit(DFL_NO_DLM_OPS, >ls_recover_flags))
return;
 
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 0a814ccac41d..7eaea6dfe1cf 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct 
buffer_head *bh,
const char *function, char *file, unsigned int line,
bool withdraw)
 {
-   if (!test_bit(SDF_SHUTDOWN, >sd_flags))
-   fs_err(sdp,
-  "fatal: I/O error\n"
-  "  block = %llu\n"
-  "  function = %s, file = %s, line = %u\n",
-  (unsigned long long)bh->b_blocknr,
-  function, file, line);
+   if (gfs2_withdrawn(sdp))
+   return;
+
+   fs_err(sdp, "fatal: I/O error\n"
+  "  block = %llu\n"
+  "  function = %s, file = %s, line = %u\n",
+  (unsigned long long)bh->b_blocknr, function, file, line);
if (withdraw)
gfs2_lm_withdraw(sdp, NULL);
 }
-
-- 
2.20.1



[Cluster-devel] [GFS2 PATCH v3 15/19] gfs2: Force withdraw to replay journals and wait for it to finish

2019-04-30 Thread Bob Peterson
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.

This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.

The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.

Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held. The
glocks not affected by a withdraw are permitted to be passed
around as normal during a withdraw. A new glops flag, called
GLOF_OK_AT_WITHDRAW, indicates glocks that may be passed around
freely while a withdraw is taking place.

One such glock is the "live" glock which is now used to signal when
a withdraw occurs. When a withdraw occurs, the node signals its
withdraw by dequeueing the "live" glock and trying to enqueue it
in EX mode, thus forcing the other nodes to all see a demote
request, by way of a "1CB" (one callback) try lock. The "live"
glock is not granted in EX; the callback is only just used to
indicate a withdraw has occurred.

Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c  |  32 ++--
 fs/gfs2/glock.h  |   1 +
 fs/gfs2/glops.c  |  56 +
 fs/gfs2/incore.h |   5 ++
 fs/gfs2/lock_dlm.c   |  32 
 fs/gfs2/log.c|   1 +
 fs/gfs2/meta_io.c|   2 +-
 fs/gfs2/ops_fstype.c |   4 +-
 fs/gfs2/quota.c  |   4 ++
 fs/gfs2/super.c  |  47 --
 fs/gfs2/super.h  |   1 +
 fs/gfs2/util.c   | 116 ++-
 12 files changed, 281 insertions(+), 20 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index eeb84d8d21f3..3d1b9bdfd0de 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -247,7 +247,7 @@ static void __gfs2_glock_put(struct gfs2_glock *gl)
gfs2_glock_remove_from_lru(gl);
spin_unlock(>gl_lockref.lock);
GLOCK_BUG_ON(gl, !list_empty(>gl_holders));
-   GLOCK_BUG_ON(gl, mapping && mapping->nrpages);
+   GLOCK_BUG_ON(gl, mapping && mapping->nrpages && !gfs2_withdrawn(sdp));
trace_gfs2_glock_put(gl);
sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);
 }
@@ -548,7 +548,9 @@ __acquires(>gl_lockref.lock)
int ret;
 
if (unlikely(gfs2_withdrawn(sdp)) &&
-   (glops->go_flags & GLOF_JOURNALED) && target != LM_ST_UNLOCKED)
+   (glops->go_flags & GLOF_JOURNALED) &&
+   (gh && !(LM_FLAG_NOEXP & gh->gh_flags)) &&
+   target != LM_ST_UNLOCKED)
return;
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
  LM_FLAG_PRIORITY);
@@ -1096,7 +1098,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
int error = 0;
 
-   if (unlikely(gfs2_withdrawn(sdp)) &&
+   if (unlikely(gfs2_withdrawn(sdp)) && !(LM_FLAG_NOEXP & gh->gh_flags) &&
(gl->gl_ops->go_flags & GLOF_JOURNALED))
return -EIO;
 
@@ -1141,11 +1143,30 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
 void gfs2_glock_dq(struct gfs2_holder *gh)
 {
struct gfs2_glock *gl = gh->gh_gl;
+   struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
const struct gfs2_glock_operations *glops = gl->gl_ops;
unsigned delay = 0;
int fast_path = 0;
 
spin_lock(>gl_lockref.lock);
+   /**
+* If we're in the process of file system withdraw, we cannot just
+* dequeue any glocks until our journal is recovered, lest we
+* introduce file system corruption. We need two exceptions to this
+* rule: We need to allow unlocking of nondisk glocks and the glock
+* for our own journal that 

[Cluster-devel] [GFS2 PATCH v3 07/19] gfs2: Only complain the first time an io error occurs in quota or log

2019-04-30 Thread Bob Peterson
Before this patch, all io errors received by the quota daemon or the
logd daemon would cause a complaint message to be issued, such as:

   gfs2: fsid=dm-13.0: Error 10 writing to journal, jid=0

This patch changes it so that the error message is only issued the
first time the error is encountered.

Also, before this patch function gfs2_end_log_write did not set the
sd_log_error value, so log errors would not cause the file system to
be withdrawn. This patch sets the error code so the file system is
properly withdrawn if an io error is encountered writing to the journal.

WARNING: This change in function breaks check xfstests generic/441
and causes it to fail: io errors writing to the log should cause a
file system to be withdrawn, and no further operations are tolerated.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/lops.c  | 5 +++--
 fs/gfs2/quota.c | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 2e8c6d02e112..065867b5f7c8 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -211,8 +211,9 @@ static void gfs2_end_log_write(struct bio *bio)
struct bvec_iter_all iter_all;
 
if (bio->bi_status) {
-   fs_err(sdp, "Error %d writing to journal, jid=%u\n",
-  bio->bi_status, sdp->sd_jdesc->jd_jid);
+   if (!cmpxchg(>sd_log_error, 0, bio->bi_status))
+   fs_err(sdp, "Error %d writing to journal, jid=%u\n",
+  bio->bi_status, sdp->sd_jdesc->jd_jid);
wake_up(>sd_logd_waitq);
}
 
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 8871fca9102f..acb6cbdd1431 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1479,8 +1479,8 @@ static void quotad_error(struct gfs2_sbd *sdp, const char 
*msg, int error)
if (error == 0 || error == -EROFS)
return;
if (!gfs2_withdrawn(sdp)) {
-   fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
-   cmpxchg(>sd_log_error, 0, error);
+   if (!cmpxchg(>sd_log_error, 0, error))
+   fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error);
wake_up(>sd_logd_waitq);
}
 }
-- 
2.20.1



[Cluster-devel] [PATCH v2] gfs2: fix race between gfs2_freeze_func and unmount

2019-04-30 Thread Abhi Das
As part of the freeze operation, gfs2_freeze_func() is left blocking
on a request to hold the sd_freeze_gl in SH. This glock is held in EX
by the gfs2_freeze() code.

A subsequent call to gfs2_unfreeze() releases the EXclusively held
sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and
resume its operation.

gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete.
If a umount is issued right after unfreeze, it could result in an
inconsistent filesystem because some journal data (statfs update) isn't
written out.

Refer to commit 24972557b12c for a more detailed explanation of how
freeze/unfreeze work.

This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to
complete before returning to the user.

Signed-off-by: Abhi Das 
---
 fs/gfs2/incore.h | 1 +
 fs/gfs2/super.c  | 8 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 78c8e761b321..b15755068593 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -621,6 +621,7 @@ enum {
SDF_SKIP_DLM_UNLOCK = 8,
SDF_FORCE_AIL_FLUSH = 9,
SDF_AIL1_IO_ERROR   = 10,
+   SDF_FS_FROZEN   = 11,
 };
 
 enum gfs2_freeze_state {
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index a6a325b2a78b..ceec631efa49 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -973,8 +973,7 @@ void gfs2_freeze_func(struct work_struct *work)
if (error) {
printk(KERN_INFO "GFS2: couldn't get freeze lock : %d\n", 
error);
gfs2_assert_withdraw(sdp, 0);
-   }
-   else {
+   } else {
atomic_set(>sd_freeze_state, SFS_UNFROZEN);
error = thaw_super(sb);
if (error) {
@@ -987,6 +986,8 @@ void gfs2_freeze_func(struct work_struct *work)
gfs2_glock_dq_uninit(_gh);
}
deactivate_super(sb);
+   clear_bit_unlock(SDF_FS_FROZEN, >sd_flags);
+   wake_up_bit(>sd_flags, SDF_FS_FROZEN);
return;
 }
 
@@ -1029,6 +1030,7 @@ static int gfs2_freeze(struct super_block *sb)
msleep(1000);
}
error = 0;
+   set_bit(SDF_FS_FROZEN, >sd_flags);
 out:
mutex_unlock(>sd_freeze_mutex);
return error;
@@ -1053,7 +1055,7 @@ static int gfs2_unfreeze(struct super_block *sb)
 
gfs2_glock_dq_uninit(>sd_freeze_gh);
mutex_unlock(>sd_freeze_mutex);
-   return 0;
+   return wait_on_bit(>sd_flags, SDF_FS_FROZEN, TASK_INTERRUPTIBLE);
 }
 
 /**
-- 
2.20.1



Re: [Cluster-devel] [PATCH] gfs2: fix race between gfs2_freeze_func and unmount

2019-04-30 Thread Abhijith Das
NACK. Andreas mentioned that the description could be more descriptive and
that we should be using clear_bit_unlock() instead of clear_bit(). I'll
post a v2 shortly with these changes.

Cheers!
--Abhi

On Tue, Apr 30, 2019 at 12:48 PM Abhi Das  wrote:

> gfs2_unfreee() doesn't wait for gfs2_freeze_func() to complete. If a
> umount is issued right after unfreeze, it could result in an
> inconsistent filesystem because some journal data (statfs update)
> wasn't written out.
>
> This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to
> complete before returning to the user.
>
> Signed-off-by: Abhi Das 
> ---
>  fs/gfs2/incore.h | 1 +
>  fs/gfs2/super.c  | 8 +---
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 78c8e761b321..b15755068593 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -621,6 +621,7 @@ enum {
> SDF_SKIP_DLM_UNLOCK = 8,
> SDF_FORCE_AIL_FLUSH = 9,
> SDF_AIL1_IO_ERROR   = 10,
> +   SDF_FS_FROZEN   = 11,
>  };
>
>  enum gfs2_freeze_state {
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index a6a325b2a78b..a81d7a5afe39 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -973,8 +973,7 @@ void gfs2_freeze_func(struct work_struct *work)
> if (error) {
> printk(KERN_INFO "GFS2: couldn't get freeze lock : %d\n",
> error);
> gfs2_assert_withdraw(sdp, 0);
> -   }
> -   else {
> +   } else {
> atomic_set(>sd_freeze_state, SFS_UNFROZEN);
> error = thaw_super(sb);
> if (error) {
> @@ -987,6 +986,8 @@ void gfs2_freeze_func(struct work_struct *work)
> gfs2_glock_dq_uninit(_gh);
> }
> deactivate_super(sb);
> +   clear_bit(SDF_FS_FROZEN, >sd_flags);
> +   wake_up_bit(>sd_flags, SDF_FS_FROZEN);
> return;
>  }
>
> @@ -1029,6 +1030,7 @@ static int gfs2_freeze(struct super_block *sb)
> msleep(1000);
> }
> error = 0;
> +   set_bit(SDF_FS_FROZEN, >sd_flags);
>  out:
> mutex_unlock(>sd_freeze_mutex);
> return error;
> @@ -1053,7 +1055,7 @@ static int gfs2_unfreeze(struct super_block *sb)
>
> gfs2_glock_dq_uninit(>sd_freeze_gh);
> mutex_unlock(>sd_freeze_mutex);
> -   return 0;
> +   return wait_on_bit(>sd_flags, SDF_FS_FROZEN,
> TASK_INTERRUPTIBLE);
>  }
>
>  /**
> --
> 2.20.1
>
>


Re: [Cluster-devel] [PATCH v7 0/5] iomap and gfs2 fixes

2019-04-30 Thread Dave Chinner
On Mon, Apr 29, 2019 at 07:50:28PM -0700, Darrick J. Wong wrote:
> On Tue, Apr 30, 2019 at 12:09:29AM +0200, Andreas Gruenbacher wrote:
> > Here's another update of this patch queue, hopefully with all wrinkles
> > ironed out now.
> > 
> > Darrick, I think Linus would be unhappy seeing the first four patches in
> > the gfs2 tree; could you put them into the xfs tree instead like we did
> > some time ago already?
> 
> Sure.  When I'm done reviewing them I'll put them in the iomap tree,
> though, since we now have a separate one. :)

I'd just keep the iomap stuff in the xfs tree as a separate set of
branches and merge them into the XFS for-next when composing it.
That way it still gets plenty of test coverage from all the XFS
devs and linux next without anyone having to think about.

You really only need to send separate pull requests for the iomap
and XFS branches - IMO, there's no really need to have a complete
new tree for it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



[Cluster-devel] [PATCH] gfs2: fix race between gfs2_freeze_func and unmount

2019-04-30 Thread Abhi Das
gfs2_unfreee() doesn't wait for gfs2_freeze_func() to complete. If a
umount is issued right after unfreeze, it could result in an
inconsistent filesystem because some journal data (statfs update)
wasn't written out.

This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to
complete before returning to the user.

Signed-off-by: Abhi Das 
---
 fs/gfs2/incore.h | 1 +
 fs/gfs2/super.c  | 8 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 78c8e761b321..b15755068593 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -621,6 +621,7 @@ enum {
SDF_SKIP_DLM_UNLOCK = 8,
SDF_FORCE_AIL_FLUSH = 9,
SDF_AIL1_IO_ERROR   = 10,
+   SDF_FS_FROZEN   = 11,
 };
 
 enum gfs2_freeze_state {
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index a6a325b2a78b..a81d7a5afe39 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -973,8 +973,7 @@ void gfs2_freeze_func(struct work_struct *work)
if (error) {
printk(KERN_INFO "GFS2: couldn't get freeze lock : %d\n", 
error);
gfs2_assert_withdraw(sdp, 0);
-   }
-   else {
+   } else {
atomic_set(>sd_freeze_state, SFS_UNFROZEN);
error = thaw_super(sb);
if (error) {
@@ -987,6 +986,8 @@ void gfs2_freeze_func(struct work_struct *work)
gfs2_glock_dq_uninit(_gh);
}
deactivate_super(sb);
+   clear_bit(SDF_FS_FROZEN, >sd_flags);
+   wake_up_bit(>sd_flags, SDF_FS_FROZEN);
return;
 }
 
@@ -1029,6 +1030,7 @@ static int gfs2_freeze(struct super_block *sb)
msleep(1000);
}
error = 0;
+   set_bit(SDF_FS_FROZEN, >sd_flags);
 out:
mutex_unlock(>sd_freeze_mutex);
return error;
@@ -1053,7 +1055,7 @@ static int gfs2_unfreeze(struct super_block *sb)
 
gfs2_glock_dq_uninit(>sd_freeze_gh);
mutex_unlock(>sd_freeze_mutex);
-   return 0;
+   return wait_on_bit(>sd_flags, SDF_FS_FROZEN, TASK_INTERRUPTIBLE);
 }
 
 /**
-- 
2.20.1



Re: [Cluster-devel] [PATCH v7 5/5] gfs2: Fix iomap write page reclaim deadlock

2019-04-30 Thread Andreas Grünbacher
Am Di., 30. Apr. 2019 um 17:48 Uhr schrieb Darrick J. Wong
:
> Ok, I'll take the first four patches through the iomap branch and cc you
> on the pull request.

Ok great, thanks.

Andreas



Re: [Cluster-devel] [PATCH v7 5/5] gfs2: Fix iomap write page reclaim deadlock

2019-04-30 Thread Darrick J. Wong
On Tue, Apr 30, 2019 at 05:39:28PM +0200, Andreas Gruenbacher wrote:
> On Tue, 30 Apr 2019 at 17:33, Darrick J. Wong  wrote:
> > On Tue, Apr 30, 2019 at 12:09:34AM +0200, Andreas Gruenbacher wrote:
> > > Since commit 64bc06bb32ee ("gfs2: iomap buffered write support"), gfs2 is 
> > > doing
> > > buffered writes by starting a transaction in iomap_begin, writing a range 
> > > of
> > > pages, and ending that transaction in iomap_end.  This approach suffers 
> > > from
> > > two problems:
> > >
> > >   (1) Any allocations necessary for the write are done in iomap_begin, so 
> > > when
> > >   the data aren't journaled, there is no need for keeping the transaction 
> > > open
> > >   until iomap_end.
> > >
> > >   (2) Transactions keep the gfs2 log flush lock held.  When
> > >   iomap_file_buffered_write calls balance_dirty_pages, this can end up 
> > > calling
> > >   gfs2_write_inode, which will try to flush the log.  This requires 
> > > taking the
> > >   log flush lock which is already held, resulting in a deadlock.
> >
> > /me wonders how holding the log flush lock doesn't seriously limit
> > performance, but gfs2 isn't my fight so I'll set that aside and assume
> > that a patch S-o-B'd by both maintainers is ok. :)
> 
> This only affects inline and journaled data, not standard writes, so
> it's not quite as bad as it looks.

Ah, ok.

> > How should we merge this patch #5?  It doesn't touch fs/iomap.c itself,
> > so do you want me to pull it into the iomap branch along with the
> > previous four patches?  That would be fine with me (and easier than a
> > multi-tree merge mess)...
> 
> I'd prefer to get this merged via the gfs2 tree once the iomap fixes
> have been pulled.

Ok, I'll take the first four patches through the iomap branch and cc you
on the pull request.

--D

> 
> Thanks,
> Andreas



Re: [Cluster-devel] [PATCH v7 5/5] gfs2: Fix iomap write page reclaim deadlock

2019-04-30 Thread Andreas Gruenbacher
On Tue, 30 Apr 2019 at 17:33, Darrick J. Wong  wrote:
> On Tue, Apr 30, 2019 at 12:09:34AM +0200, Andreas Gruenbacher wrote:
> > Since commit 64bc06bb32ee ("gfs2: iomap buffered write support"), gfs2 is 
> > doing
> > buffered writes by starting a transaction in iomap_begin, writing a range of
> > pages, and ending that transaction in iomap_end.  This approach suffers from
> > two problems:
> >
> >   (1) Any allocations necessary for the write are done in iomap_begin, so 
> > when
> >   the data aren't journaled, there is no need for keeping the transaction 
> > open
> >   until iomap_end.
> >
> >   (2) Transactions keep the gfs2 log flush lock held.  When
> >   iomap_file_buffered_write calls balance_dirty_pages, this can end up 
> > calling
> >   gfs2_write_inode, which will try to flush the log.  This requires taking 
> > the
> >   log flush lock which is already held, resulting in a deadlock.
>
> /me wonders how holding the log flush lock doesn't seriously limit
> performance, but gfs2 isn't my fight so I'll set that aside and assume
> that a patch S-o-B'd by both maintainers is ok. :)

This only affects inline and journaled data, not standard writes, so
it's not quite as bad as it looks.

> How should we merge this patch #5?  It doesn't touch fs/iomap.c itself,
> so do you want me to pull it into the iomap branch along with the
> previous four patches?  That would be fine with me (and easier than a
> multi-tree merge mess)...

I'd prefer to get this merged via the gfs2 tree once the iomap fixes
have been pulled.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v7 5/5] gfs2: Fix iomap write page reclaim deadlock

2019-04-30 Thread Darrick J. Wong
On Tue, Apr 30, 2019 at 12:09:34AM +0200, Andreas Gruenbacher wrote:
> Since commit 64bc06bb32ee ("gfs2: iomap buffered write support"), gfs2 is 
> doing
> buffered writes by starting a transaction in iomap_begin, writing a range of
> pages, and ending that transaction in iomap_end.  This approach suffers from
> two problems:
> 
>   (1) Any allocations necessary for the write are done in iomap_begin, so when
>   the data aren't journaled, there is no need for keeping the transaction open
>   until iomap_end.
> 
>   (2) Transactions keep the gfs2 log flush lock held.  When
>   iomap_file_buffered_write calls balance_dirty_pages, this can end up calling
>   gfs2_write_inode, which will try to flush the log.  This requires taking the
>   log flush lock which is already held, resulting in a deadlock.

/me wonders how holding the log flush lock doesn't seriously limit
performance, but gfs2 isn't my fight so I'll set that aside and assume
that a patch S-o-B'd by both maintainers is ok. :)

How should we merge this patch #5?  It doesn't touch fs/iomap.c itself,
so do you want me to pull it into the iomap branch along with the
previous four patches?  That would be fine with me (and easier than a
multi-tree merge mess)...

--D

> 
> Fix both of these issues by not keeping transactions open from iomap_begin to
> iomap_end.  Instead, start a small transaction in page_prepare and end it in
> page_done when necessary.
> 
> Reported-by: Edwin Török 
> Fixes: 64bc06bb32ee ("gfs2: iomap buffered write support")
> Signed-off-by: Andreas Gruenbacher 
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/aops.c | 14 +---
>  fs/gfs2/bmap.c | 88 +++---
>  2 files changed, 58 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 05dd78f4b2b3..6210d4429d84 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -649,7 +649,7 @@ static int gfs2_readpages(struct file *file, struct 
> address_space *mapping,
>   */
>  void adjust_fs_space(struct inode *inode)
>  {
> - struct gfs2_sbd *sdp = inode->i_sb->s_fs_info;
> + struct gfs2_sbd *sdp = GFS2_SB(inode);
>   struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
>   struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
>   struct gfs2_statfs_change_host *m_sc = >sd_statfs_master;
> @@ -657,10 +657,13 @@ void adjust_fs_space(struct inode *inode)
>   struct buffer_head *m_bh, *l_bh;
>   u64 fs_total, new_free;
>  
> + if (gfs2_trans_begin(sdp, 2 * RES_STATFS, 0) != 0)
> + return;
> +
>   /* Total up the file system space, according to the latest rindex. */
>   fs_total = gfs2_ri_total(sdp);
>   if (gfs2_meta_inode_buffer(m_ip, _bh) != 0)
> - return;
> + goto out;
>  
>   spin_lock(>sd_statfs_spin);
>   gfs2_statfs_change_in(m_sc, m_bh->b_data +
> @@ -675,11 +678,14 @@ void adjust_fs_space(struct inode *inode)
>   gfs2_statfs_change(sdp, new_free, new_free, 0);
>  
>   if (gfs2_meta_inode_buffer(l_ip, _bh) != 0)
> - goto out;
> + goto out2;
>   update_statfs(sdp, m_bh, l_bh);
>   brelse(l_bh);
> -out:
> +out2:
>   brelse(m_bh);
> +out:
> + sdp->sd_rindex_uptodate = 0;
> + gfs2_trans_end(sdp);
>  }
>  
>  /**
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index aa014725f84a..27c82f4aaf32 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -991,17 +991,28 @@ static void gfs2_write_unlock(struct inode *inode)
>   gfs2_glock_dq_uninit(>i_gh);
>  }
>  
> +static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
> +unsigned len, struct iomap *iomap)
> +{
> + struct gfs2_sbd *sdp = GFS2_SB(inode);
> +
> + return gfs2_trans_begin(sdp, RES_DINODE + (len >> inode->i_blkbits), 0);
> +}
> +
>  static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
>unsigned copied, struct page *page,
>struct iomap *iomap)
>  {
>   struct gfs2_inode *ip = GFS2_I(inode);
> + struct gfs2_sbd *sdp = GFS2_SB(inode);
>  
> - if (page)
> + if (page && !gfs2_is_stuffed(ip))
>   gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
> + gfs2_trans_end(sdp);
>  }
>  
>  static const struct iomap_page_ops gfs2_iomap_page_ops = {
> + .page_prepare = gfs2_iomap_page_prepare,
>   .page_done = gfs2_iomap_page_done,
>  };
>  
> @@ -1057,31 +1068,45 @@ static int gfs2_iomap_begin_write(struct inode 
> *inode, loff_t pos,
>   if (alloc_required)
>   rblocks += gfs2_rg_blocks(ip, data_blocks + ind_blocks);
>  
> - ret = gfs2_trans_begin(sdp, rblocks, iomap->length >> inode->i_blkbits);
> - if (ret)
> - goto out_trans_fail;
> + if (unstuff || iomap->type == IOMAP_HOLE) {
> + struct gfs2_trans *tr;
>  
> - if (unstuff) {
> - ret = gfs2_unstuff_dinode(ip, NULL);
> +

Re: [Cluster-devel] [PATCH v7 4/5] iomap: Add a page_prepare callback

2019-04-30 Thread Darrick J. Wong
On Tue, Apr 30, 2019 at 12:09:33AM +0200, Andreas Gruenbacher wrote:
> Move the page_done callback into a separate iomap_page_ops structure and
> add a page_prepare calback to be called before the next page is written
> to.  In gfs2, we'll want to start a transaction in page_prepare and end
> it in page_done.  Other filesystems that implement data journaling will
> require the same kind of mechanism.
> 
> Signed-off-by: Andreas Gruenbacher 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jan Kara 

Looks fine, assuming you've done the appropriate QA/testing on the gfs2
side. :)

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/bmap.c| 15 ++-
>  fs/iomap.c| 36 ++--
>  include/linux/iomap.h | 22 +-
>  3 files changed, 53 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 5da4ca9041c0..aa014725f84a 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -991,15 +991,20 @@ static void gfs2_write_unlock(struct inode *inode)
>   gfs2_glock_dq_uninit(>i_gh);
>  }
>  
> -static void gfs2_iomap_journaled_page_done(struct inode *inode, loff_t pos,
> - unsigned copied, struct page *page,
> - struct iomap *iomap)
> +static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
> +  unsigned copied, struct page *page,
> +  struct iomap *iomap)
>  {
>   struct gfs2_inode *ip = GFS2_I(inode);
>  
> - gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
> + if (page)
> + gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
>  }
>  
> +static const struct iomap_page_ops gfs2_iomap_page_ops = {
> + .page_done = gfs2_iomap_page_done,
> +};
> +
>  static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
> loff_t length, unsigned flags,
> struct iomap *iomap,
> @@ -1077,7 +1082,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, 
> loff_t pos,
>   }
>   }
>   if (!gfs2_is_stuffed(ip) && gfs2_is_jdata(ip))
> - iomap->page_done = gfs2_iomap_journaled_page_done;
> + iomap->page_ops = _iomap_page_ops;
>   return 0;
>  
>  out_trans_end:
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 62e3461704ce..a3ffc83134ee 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -665,6 +665,7 @@ static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
>   struct page **pagep, struct iomap *iomap)
>  {
> + const struct iomap_page_ops *page_ops = iomap->page_ops;
>   pgoff_t index = pos >> PAGE_SHIFT;
>   struct page *page;
>   int status = 0;
> @@ -674,9 +675,17 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   if (fatal_signal_pending(current))
>   return -EINTR;
>  
> + if (page_ops && page_ops->page_prepare) {
> + status = page_ops->page_prepare(inode, pos, len, iomap);
> + if (status)
> + return status;
> + }
> +
>   page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
> - if (!page)
> - return -ENOMEM;
> + if (!page) {
> + status = -ENOMEM;
> + goto out_no_page;
> + }
>  
>   if (iomap->type == IOMAP_INLINE)
>   iomap_read_inline_data(inode, page, iomap);
> @@ -684,15 +693,21 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   status = __block_write_begin_int(page, pos, len, NULL, iomap);
>   else
>   status = __iomap_write_begin(inode, pos, len, page, iomap);
> - if (unlikely(status)) {
> - unlock_page(page);
> - put_page(page);
> - page = NULL;
>  
> - iomap_write_failed(inode, pos, len);
> - }
> + if (unlikely(status))
> + goto out_unlock;
>  
>   *pagep = page;
> + return 0;
> +
> +out_unlock:
> + unlock_page(page);
> + put_page(page);
> + iomap_write_failed(inode, pos, len);
> +
> +out_no_page:
> + if (page_ops && page_ops->page_done)
> + page_ops->page_done(inode, pos, 0, NULL, iomap);
>   return status;
>  }
>  
> @@ -766,6 +781,7 @@ static int
>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>   unsigned copied, struct page *page, struct iomap *iomap)
>  {
> + const struct iomap_page_ops *page_ops = iomap->page_ops;
>   int ret;
>  
>   if (iomap->type == IOMAP_INLINE) {
> @@ -778,8 +794,8 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
> len,
>   }
>  
>   __generic_write_end(inode, pos, ret, page);
> - if (iomap->page_done)
> - iomap->page_done(inode, pos, copied, page, iomap);
> + if (page_ops && 

Re: [Cluster-devel] [PATCH v7 3/5] iomap: Fix use-after-free error in page_done callback

2019-04-30 Thread Darrick J. Wong
On Tue, Apr 30, 2019 at 12:09:32AM +0200, Andreas Gruenbacher wrote:
> In iomap_write_end, we're not holding a page reference anymore when
> calling the page_done callback, but the callback needs that reference to
> access the page.  To fix that, move the put_page call in
> __generic_write_end into the callers of __generic_write_end.  Then, in
> iomap_write_end, put the page after calling the page_done callback.
> 
> Reported-by: Jan Kara 
> Fixes: 63899c6f8851 ("iomap: add a page_done callback")
> Signed-off-by: Andreas Gruenbacher 
> Reviewed-by: Jan Kara 
> Reviewed-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/buffer.c | 2 +-
>  fs/iomap.c  | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e0d4c6a5e2d2..0faa41fb4c88 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2104,7 +2104,6 @@ void __generic_write_end(struct inode *inode, loff_t 
> pos, unsigned copied,
>   }
>  
>   unlock_page(page);
> - put_page(page);
>  
>   if (old_size < pos)
>   pagecache_isize_extended(inode, old_size, pos);
> @@ -2160,6 +2159,7 @@ int generic_write_end(struct file *file, struct 
> address_space *mapping,
>  {
>   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
>   __generic_write_end(mapping->host, pos, copied, page);
> + put_page(page);
>   return copied;
>  }
>  EXPORT_SYMBOL(generic_write_end);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index f8c9722d1a97..62e3461704ce 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -780,6 +780,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
> len,
>   __generic_write_end(inode, pos, ret, page);
>   if (iomap->page_done)
>   iomap->page_done(inode, pos, copied, page, iomap);
> + put_page(page);
>  
>   if (ret < len)
>   iomap_write_failed(inode, pos, len);
> -- 
> 2.20.1
> 



Re: [Cluster-devel] [PATCH v7 2/5] fs: Turn __generic_write_end into a void function

2019-04-30 Thread Darrick J. Wong
On Tue, Apr 30, 2019 at 12:09:31AM +0200, Andreas Gruenbacher wrote:
> The VFS-internal __generic_write_end helper always returns the value of
> its @copied argument.  This can be confusing, and it isn't very useful
> anyway, so turn __generic_write_end into a function returning void
> instead.

(Also weird that @copied is unsigned but the return value is signed...)

> Signed-off-by: Andreas Gruenbacher 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/buffer.c   | 6 +++---
>  fs/internal.h | 2 +-
>  fs/iomap.c| 2 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ce357602f471..e0d4c6a5e2d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2085,7 +2085,7 @@ int block_write_begin(struct address_space *mapping, 
> loff_t pos, unsigned len,
>  }
>  EXPORT_SYMBOL(block_write_begin);
>  
> -int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
> +void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
>   struct page *page)
>  {
>   loff_t old_size = inode->i_size;
> @@ -2116,7 +2116,6 @@ int __generic_write_end(struct inode *inode, loff_t 
> pos, unsigned copied,
>*/
>   if (i_size_changed)
>   mark_inode_dirty(inode);
> - return copied;
>  }
>  
>  int block_write_end(struct file *file, struct address_space *mapping,
> @@ -2160,7 +2159,8 @@ int generic_write_end(struct file *file, struct 
> address_space *mapping,
>   struct page *page, void *fsdata)
>  {
>   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> - return __generic_write_end(mapping->host, pos, copied, page);
> + __generic_write_end(mapping->host, pos, copied, page);
> + return copied;
>  }
>  EXPORT_SYMBOL(generic_write_end);
>  
> diff --git a/fs/internal.h b/fs/internal.h
> index 6a8b71643af4..530587fdf5d8 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -44,7 +44,7 @@ static inline int __sync_blockdev(struct block_device 
> *bdev, int wait)
>  extern void guard_bio_eod(int rw, struct bio *bio);
>  extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned 
> len,
>   get_block_t *get_block, struct iomap *iomap);
> -int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
> +void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
>   struct page *page);
>  
>  /*
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 2344c662e6fc..f8c9722d1a97 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -777,7 +777,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
> len,
>   ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
>   }
>  
> - ret = __generic_write_end(inode, pos, ret, page);
> + __generic_write_end(inode, pos, ret, page);
>   if (iomap->page_done)
>   iomap->page_done(inode, pos, copied, page, iomap);
>  
> -- 
> 2.20.1
> 



Re: [Cluster-devel] [PATCH v7 1/5] iomap: Clean up __generic_write_end calling

2019-04-30 Thread Darrick J. Wong
On Tue, Apr 30, 2019 at 12:09:30AM +0200, Andreas Gruenbacher wrote:
> From: Christoph Hellwig 
> 
> Move the call to __generic_write_end into iomap_write_end instead of
> duplicating it in each of the three branches.  This requires open coding
> the generic_write_end for the buffer_head case.
> 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Andreas Gruenbacher 
> Reviewed-by: Jan Kara 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap.c | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 97cb9d486a7d..2344c662e6fc 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -738,13 +738,11 @@ __iomap_write_end(struct inode *inode, loff_t pos, 
> unsigned len,
>* uptodate page as a zero-length write, and force the caller to redo
>* the whole thing.
>*/
> - if (unlikely(copied < len && !PageUptodate(page))) {
> - copied = 0;
> - } else {
> - iomap_set_range_uptodate(page, offset_in_page(pos), len);
> - iomap_set_page_dirty(page);
> - }
> - return __generic_write_end(inode, pos, copied, page);
> + if (unlikely(copied < len && !PageUptodate(page)))
> + return 0;
> + iomap_set_range_uptodate(page, offset_in_page(pos), len);
> + iomap_set_page_dirty(page);
> + return copied;
>  }
>  
>  static int
> @@ -761,7 +759,6 @@ iomap_write_end_inline(struct inode *inode, struct page 
> *page,
>   kunmap_atomic(addr);
>  
>   mark_inode_dirty(inode);
> - __generic_write_end(inode, pos, copied, page);
>   return copied;
>  }
>  
> @@ -774,12 +771,13 @@ iomap_write_end(struct inode *inode, loff_t pos, 
> unsigned len,
>   if (iomap->type == IOMAP_INLINE) {
>   ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
>   } else if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
> - ret = generic_write_end(NULL, inode->i_mapping, pos, len,
> - copied, page, NULL);
> + ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
> + page, NULL);
>   } else {
>   ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
>   }
>  
> + ret = __generic_write_end(inode, pos, ret, page);
>   if (iomap->page_done)
>   iomap->page_done(inode, pos, copied, page, iomap);
>  
> -- 
> 2.20.1
> 



Re: [Cluster-devel] [PATCH v7 2/5] fs: Turn __generic_write_end into a void function

2019-04-30 Thread Christoph Hellwig
On Tue, Apr 30, 2019 at 12:09:31AM +0200, Andreas Gruenbacher wrote:
> The VFS-internal __generic_write_end helper always returns the value of
> its @copied argument.  This can be confusing, and it isn't very useful
> anyway, so turn __generic_write_end into a function returning void
> instead.
> 
> Signed-off-by: Andreas Gruenbacher 

Looks good,

Reviewed-by: Christoph Hellwig