Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Luis R. Rodriguez
On Thu, Jun 14, 2018 at 07:37:19PM +0200, Luis R. Rodriguez wrote:
> 
> Ie a file describing which
> disk serial gets deadline and which one gets mq-deadline.
> Anyway, let's assume this is done in the kernel, which one would use deadline,
> which one would use mq-deadline?

Nevermind this, deadline will work, Bart already stated why.

 Luis


Re: [PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Luis R. Rodriguez
On Thu, Jun 14, 2018 at 04:19:23PM +, Bart Van Assche wrote:
> On Wed, 2018-06-13 at 17:11 -0700, Luis R. Rodriguez wrote:
> > This tries to put a bit of this tribal knowledge into an initial udev
> > rule for development with the hopes Linux distributions can later
> > deploy. Three rule are added. One rule is optional for now, it should be
> > extended later to be more distribution-friendly and then I think this
> > may be ready for consideration for integration on distributions.
> > 
> > 1) scheduler setup
> > 2) backlist f2fs devices
> > 3) run dmsetup for the rest of devices
> 
> Hello Luis,
> 
> I think it is wrong to package the zoned block device scheduler rule in the
> dm-zoned-tools package. That udev rule should be activated whether or not the
> dm-zoned-tools package has been installed. Have you considered to submit the
> zoned block device scheduler rule to the systemd project since today that
> project includes all base udev rules?

Nope, this is a udev rule intended for developers wishing to brush up on our
state of affairs. All I wanted to do is to get my own drive to work at home in
a reasonably reliable and generic form, which also allowed me to hop around
kernels without a fatal issue.

Clearly there is much to discuss still and a roadmap to illustrate. We should
update the documentation to reflect this first, and based on that enable then
distributions based on that discussion. We should have short term and long term
plans. That discussion may have taken place already so forgive me for not
knowing about it, I did try to read as much from the code and existing tools
and just didn't find anything else to be clear on it.

> > +# Zoned disks can only work with the deadline or mq-deadline scheduler. 
> > This is
> > +# mandated for all SMR drives since v4.16. It has been determined this 
> > must be
> > +# done through a udev rule, and the kernel should not set this up for 
> > disks.
> > +# This magic will have to live for *all* zoned disks.
> > +# XXX: what about distributions that want mq-deadline ? Probably easy for 
> > now
> > +#  to assume deadline and later have a mapping file to enable
> > +#  mq-deadline for specific serial devices?
> > +ACTION=="add|change", KERNEL=="sd*[!0-9]", 
> > ATTRS{queue/zoned}=="host-managed", \
> > +   ATTR{queue/scheduler}="deadline"
> 
> I think it is wrong to limit this rule to SCSI disks only. Work is ongoing to
> add zoned block device support to the null_blk driver. That is a block driver
> and not a SCSI driver. I think the above udev rule should apply to that block
> driver too.

Sure patches welcomed :)

> Regarding blk-mq, from the mq-deadline source code:
>   .elevator_alias = "deadline",
> 
> In other words, the name "deadline" should work both for legacy and for blk-mq
> block devices.

Groovy that helps.

  Luis


Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Luis R. Rodriguez
On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> On Wed, Jun 13 2018 at  8:11pm -0400,
> Luis R. Rodriguez  wrote:
> 
> > Setting up a zoned disks in a generic form is not so trivial. There
> > is also quite a bit of tribal knowledge with these devices which is not
> > easy to find.
> > 
> > The currently supplied demo script works but it is not generic enough to be
> > practical for Linux distributions or even developers which often move
> > from one kernel to another.
> > 
> > This tries to put a bit of this tribal knowledge into an initial udev
> > rule for development with the hopes Linux distributions can later
> > deploy. Three rule are added. One rule is optional for now, it should be
> > extended later to be more distribution-friendly and then I think this
> > may be ready for consideration for integration on distributions.
> > 
> > 1) scheduler setup
> 
> This is wrong.. if zoned devices are so dependent on deadline or
> mq-deadline then the kernel should allow them to be hardcoded.  I know
> Jens removed the API to do so but the fact that drivers need to rely on
> hacks like this udev rule to get a functional device is proof we need to
> allow drivers to impose the scheduler used.

This is the point to the patch as well, I actually tend to agree with you,
and I had tried to draw up a patch to do just that, however its *not* possible
today to do this and would require some consensus. So from what I can tell
we *have* to live with this one or a form of it. Ie a file describing which
disk serial gets deadline and which one gets mq-deadline.

Jens?

Anyway, let's assume this is done in the kernel, which one would use deadline,
which one would use mq-deadline?

> > 2) backlist f2fs devices
> 
> There should porbably be support in dm-zoned for detecting whether a
> zoned device was formatted with f2fs (assuming there is a known f2fs
> superblock)?

Not sure what you mean. Are you suggesting we always setup dm-zoned for
all zoned disks and just make an excemption on dm-zone code to somehow
use the disk directly if a filesystem supports zoned disks directly somehow?

f2fs does not require dm-zoned. What would be required is a bit more complex
given one could dedicate portions of the disk to f2fs and other portions to
another filesystem, which would require dm-zoned.

Also filesystems which *do not* support zoned disks should *not* be allowing
direct setup. Today that's all filesystems other than f2fs, in the future
that may change. Those are bullets we are allowing to trigger for users
just waiting to shot themselves on the foot with.

So who's going to work on all the above?

The point of the udev script is to illustrate the pains to properly deploy
zoned disks on distributions today and without a roadmap... this is what
at least I need on my systems today to reasonably deploy these disks for
my own development.

Consensus is indeed needed for a broader picture.

> > 3) run dmsetup for the rest of devices
> 
> automagically running dmsetup directly from udev to create a dm-zoned
> target is very much wrong.  It just gets in the way of proper support
> that should be add to appropriate tools that admins use to setup their
> zoned devices.  For instance, persistent use of dm-zoned target should
> be made reliable with a volume manager..

Ah yes, but who's working on that? How long will it take?

I agree it is odd to expect one to use dmsetup and then use a volume manager on
top of it, if we can just add proper support onto the volume manager... then
that's a reasonable way to go.

But *we're not there* yet, and as-is today, what is described in the udev
script is the best we can do for a generic setup.

> In general this udev script is unwelcome and makes things way worse for
> the long-term success of zoned devices.

dm-zoned-tools does not acknowledge in any way a roadmap, and just provides
a script, which IMHO is less generic and less distribution friendly. Having
a udev rule in place to demonstrate the current state of affairs IMHO is
more scalable demonstrates the issues better than the script.

If we have an agreed upon long term strategy lets document that. But from
what I gather we are not even in consensus with regards to the scheduler
stuff. If we have consensus on the other stuff lets document that as
dm-zoned-tools is the only place I think folks could find to reasonably
deploy these things.

> I don't dispute there is an obvious void for how to properly setup zoned
> devices, but this script is _not_ what should fill that void.

Good to know! Again, consider it as an alternative to the script.

I'm happy to adapt the language and supply it only as an example script
developers can use, but we can't leave users hanging as well. Let's at
least come up with a plan which we seem to agree on and document that.

  Luis


[PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-13 Thread Luis R. Rodriguez
Setting up a zoned disks in a generic form is not so trivial. There
is also quite a bit of tribal knowledge with these devices which is not
easy to find.

The currently supplied demo script works but it is not generic enough to be
practical for Linux distributions or even developers which often move
from one kernel to another.

This tries to put a bit of this tribal knowledge into an initial udev
rule for development with the hopes Linux distributions can later
deploy. Three rule are added. One rule is optional for now, it should be
extended later to be more distribution-friendly and then I think this
may be ready for consideration for integration on distributions.

1) scheduler setup
2) backlist f2fs devices
3) run dmsetup for the rest of devices

Note that this udev rule will not work well if you want to use a disk
with f2fs on part of the disk and another filesystem on another part of
the disk. That setup will require manual love so these setups can use
the same backlist on rule 2).

Its not widely known for instance that as of v4.16 it is mandated to use
either deadline or the mq-deadline scheduler for *all* SMR drivers. Its
also been determined that the Linux kernel is not the place to set this up,
so a udev rule *is required* as per latest discussions. This is the
first rule we add.

Furthermore if you are *not* using f2fs you always have to run dmsetup.
dmsetups do not persist, so you currently *always* have to run a custom
sort of script, which is not ideal for Linux distributions. We can invert
this logic into a udev rule to enable users to blacklist disks they know they
want to use f2fs for. This the second optional rule. This blacklisting
can be generalized further in the future with an exception list file, for
instance using INPUT{db} or the like.

The third and final rule added then runs dmsetup for the rest of the disks
using the disk serial number for the new device mapper name.

Note that it is currently easy for users to make a mistake and run mkfs
on the the original disk, not the /dev/mapper/ device for non f2fs
arrangements. If that is done experience shows things can easily fall
apart with alignment *eventually*. We have no generic way today to
error out on this condition and proactively prevent this.

Signed-off-by: Luis R. Rodriguez 
---
 README| 10 +-
 udev/99-zoned-disks.rules | 78 +++
 2 files changed, 87 insertions(+), 1 deletion(-)
 create mode 100644 udev/99-zoned-disks.rules

diff --git a/README b/README
index 65e96c34fd04..f49541eaabc8 100644
--- a/README
+++ b/README
@@ -168,7 +168,15 @@ Options:
  reclaiming random zones if the percentage of
  free random data zones falls below .
 
-V. Example scripts
+V. Udev zone disk deployment
+
+
+A udev rule is provided which enables you to set the IO scheduler, blacklist
+driver to run dmsetup, and runs dmsetup for the rest of the zone drivers.
+If you use this udev rule the below script is not needed. Be sure to mkfs only
+on the resulting /dev/mapper/zone-$serial device you end up with.
+
+VI. Example scripts
 ==
 
 [[
diff --git a/udev/99-zoned-disks.rules b/udev/99-zoned-disks.rules
new file mode 100644
index ..e19b738dcc0e
--- /dev/null
+++ b/udev/99-zoned-disks.rules
@@ -0,0 +1,78 @@
+# To use a zone disks first thing you need to:
+#
+# 1) Enable zone disk support in your kernel
+# 2) Use the deadline or mq-deadline scheduler for it - mandated as of v4.16
+# 3) Blacklist devices dedicated for f2fs as of v4.10
+# 4) Run dmsetup other disks
+# 5) Create the filesystem -- NOTE: use mkfs /dev/mapper/zone-serial if
+#you enabled use dmsetup on the disk.
+# 6) Consider using nofail mount option in case you run an supported kernel
+#
+# You can use this udev rules file for 2) 3) and 4). Further details below.
+#
+# 1) Enable zone disk support in your kernel
+#
+#o CONFIG_BLK_DEV_ZONED
+#o CONFIG_DM_ZONED
+#
+# This will let the kernel actually see these devices, ie, via fdisk /dev/sda
+# for instance. Run:
+#
+#  dmzadm --format /dev/sda
+
+# 2) Set deadline or mq-deadline for all disks which are zoned
+#
+# Zoned disks can only work with the deadline or mq-deadline scheduler. This is
+# mandated for all SMR drives since v4.16. It has been determined this must be
+# done through a udev rule, and the kernel should not set this up for disks.
+# This magic will have to live for *all* zoned disks.
+# XXX: what about distributions that want mq-deadline ? Probably easy for now
+#  to assume deadline and later have a mapping file to enable
+#  mq-deadline for specific serial devices?
+ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", 
\
+   ATTR{queue/scheduler}="deadline"
+
+# 3) Blacklist f2fs devices as of v4.10
+# We don't have to run dmsetup on on disks where you want to 

Re: [PATCH dm-zoned-tools 2/2] README: fix example script

2018-04-30 Thread Luis R. Rodriguez
On Mon, Apr 30, 2018 at 08:36:09PM +, Damien Le Moal wrote:
> On 2018/04/30 12:45, Luis R. Rodriguez wrote:
> > The target type should be "zoned", not "dm-zoned".
> > 
> > Signed-off-by: Luis R. Rodriguez 
> > ---
> >  README | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > The README does note explain if one should use dmzadm to format a drive
> > if using dm zone or only if one is to use a filesystem which supports
> > zones, such as f2fs. Or if we should *always* use dmzadm format regardless
> > of the circumstance.
> 
> dmzadm is only for dm-zoned device mapper. If the file system natively support
> zoned block devices, it is the FS problem to format the drive accordingly to
> ZBC/ZAC constraints.

The README could be update to reflect this then, just to be sure. Only now
that I realize it has the dm prefix did I realize this was an obvious
requirement :P

> And for these files systems, dm-zoned is useless.

Sure, make sense.

  Luis


[PATCH dm-zoned-tools 2/2] README: fix example script

2018-04-30 Thread Luis R. Rodriguez
The target type should be "zoned", not "dm-zoned".

Signed-off-by: Luis R. Rodriguez 
---
 README | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

The README does note explain if one should use dmzadm to format a drive
if using dm zone or only if one is to use a filesystem which supports
zones, such as f2fs. Or if we should *always* use dmzadm format regardless
of the circumstance.

Could this be clarfied? What are the downsides to not using dmzadm
format on a drive and say a user just going straight to use the dm zone
and then mkfs on top of it?

diff --git a/README b/README
index 7762694..9aea28e 100644
--- a/README
+++ b/README
@@ -184,7 +184,7 @@ options="$@"
 
 modprobe dm-zoned
 
-echo "0 `blockdev --getsize ${dev}` dm-zoned ${dev} ${options}" | \
+echo "0 `blockdev --getsize ${dev}` zoned ${dev} ${options}" | \
 dmsetup create zoned-`basename ${dev}`
 ]]
 
-- 
2.17.0



[PATCH dm-zoned-tools 1/2] README: fix small typo host-manmaged --> host-managed

2018-04-30 Thread Luis R. Rodriguez
Small typo fix.

Signed-off-by: Luis R. Rodriguez 
---
 README | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README b/README
index 61830dd..7762694 100644
--- a/README
+++ b/README
@@ -23,7 +23,7 @@ http://www.t13.org/Documents/UploadedDocuments/docs2015/
 di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
 
 dm-zoned implementation focused on simplicity and on minimizing overhead (CPU,
-memory and storage overhead). For a 10TB host-manmaged disk with 256 MB zones,
+memory and storage overhead). For a 10TB host-managed disk with 256 MB zones,
 dm-zoned memory usage per disk instance is at most 4.5 MB and as little as 5
 zones will be used internally for storing metadata and performaing reclaim
 operations.
-- 
2.17.0



Re: [PATCH 2/2] xfs: add 'discard_sync' mount flag

2018-04-30 Thread Luis R. Rodriguez
On Mon, Apr 30, 2018 at 12:07:31PM -0600, Jens Axboe wrote:
> On 4/30/18 11:19 AM, Brian Foster wrote:
> > On Mon, Apr 30, 2018 at 09:32:52AM -0600, Jens Axboe wrote:
> >> XFS recently added support for async discards. While this can be
> >> a win for some workloads and devices, there are also cases where
> >> async bursty discard will severly harm the latencies of reads
> >> and writes.
> >>
> >> Add a 'discard_sync' mount flag to revert to using sync discard,
> >> issuing them one at the time and waiting for each one. This fixes
> >> a big performance regression we had moving to kernels that include
> >> the XFS async discard support.
> >>
> >> Signed-off-by: Jens Axboe 
> >> ---
> > 
> > Hm, I figured the async discard stuff would have been a pretty clear win
> > all around, but then again I'm not terribly familiar with what happens
> > with discards beneath the fs. I do know that the previous behavior would
> > cause fs level latencies due to holding up log I/O completion while
> > discards completed one at a time. My understanding is that this lead to
> > online discard being pretty much universally "not recommended" in favor
> > of fstrim.
> 
> It's not a secret that most devices suck at discard.

How can we know if a device sucks at discard?

> While the async
> discard is nifty and I bet works well for some cases, it can also cause
> a flood of discards on the device side which does not work well for
> other cases.

Shouldn't async then be only enabled if the device used supports it well?
Or should a blacklist per device be more suitable? Which is more popular?

  Luis


Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-21 Thread Luis R. Rodriguez
On Thu, Nov 30, 2017 at 06:13:10PM +0100, Jan Kara wrote:
> On Wed 29-11-17 15:23:48, Luis R. Rodriguez wrote:
> > The question of whether or not a superblock is frozen needs to be
> > augmented in the future to account for differences between a user
> > initiated freeze and a kernel initiated freeze done automatically
> > on behalf of the kernel.
> > 
> > Provide helpers so that these can be used instead so that we don't
> > have to expand checks later in these same call sites as we expand
> > the definition of a frozen superblock.
> > 
> > Signed-off-by: Luis R. Rodriguez 
> 
> So helpers are fine but...
> 
> > +/**
> > + * sb_is_frozen_by_user - is superblock frozen by a user call
> > + * @sb: the super to check
> > + *
> > + * Returns true if the super freeze was initiated by userspace, for 
> > instance,
> > + * an ioctl call.
> > + */
> > +static inline bool sb_is_frozen_by_user(struct super_block *sb)
> > +{
> > +   return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
> > +}
> 
> ... I dislike the _by_user() suffix as there may be different places that
> call freeze_super() (e.g. device mapper does this during some operations).
> Clearly we need to distinguish "by system suspend" and "the other" cases.
> So please make this clear in the naming.
> 
> In fact, what might be a cleaner solution is to introduce a 'freeze_count'
> for superblock freezing (we already do have this for block devices). Then
> you don't need to differentiate these two cases - but you'd still need to
> properly handle cleanup if freezing of all superblocks fails in the middle.
> So I'm not 100% this works out nicely in the end. But it's certainly worth
> a consideration.

Seems reasonable.

  Luis


Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-21 Thread Luis R. Rodriguez
On Sun, Apr 22, 2018 at 01:53:23AM +0200, Jan Kara wrote:
> On Fri 20-04-18 11:49:32, Luis R. Rodriguez wrote:
> > On Tue, Apr 17, 2018 at 05:59:36PM -0700, Luis R. Rodriguez wrote:
> > > On Thu, Dec 21, 2017 at 12:03:29PM +0100, Jan Kara wrote:
> > > > Hello,
> > > > 
> > > > I think I owe you a reply here... Sorry that it took so long.
> > > 
> > > Took me just as long :)
> > > 
> > > > On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> > > > > 
> > > > > I'll note that its still not perfectly clear if really the semantics 
> > > > > behind
> > > > > freeze_bdev() match what I described above fully. That still needs to 
> > > > > be
> > > > > vetted for. For instance, does thaw_bdev() keep a superblock frozen 
> > > > > if we
> > > > > an ioctl initiated freeze had occurred before? If so then great. 
> > > > > Otherwise
> > > > > I think we'll need to distinguish the ioctl interface. Worst possible 
> > > > > case
> > > > > is that bdev semantics and in-kernel semantics differ somehow, then 
> > > > > that
> > > > > will really create a holy fucking mess.
> > > > 
> > > > I believe nobody really thought about mixing those two interfaces to fs
> > > > freezing and so the behavior is basically defined by the implementation.
> > > > That is:
> > > > 
> > > > freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > > freeze_bdev() on sb frozen by freeze_bdev() -> success
> > > > ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
> > > > ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > > 
> > > > thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
> > > 
> > > Phew, so this is what we want for the in-kernel freezing so we're good
> > > and *can* combine these then.
> > 
> > I double checked, and I don't see where you get EINVAL for this case.
> > We *do* keep the sb frozen though, which is good, and the worst fear
> > I had was that we did not. However we return 0 if there was already
> > a prior freeze_bdev() or ioctl_fsfreeze() other than the context that
> > started the prior freeze (--bdev->bd_fsfreeze_count > 0).
> > 
> > The -EINVAL is only returned currently if there were no freezers.
> > 
> > int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> > {
> > int error = -EINVAL;
> > 
> > mutex_lock(&bdev->bd_fsfreeze_mutex);
> > if (!bdev->bd_fsfreeze_count)
> > goto out;
> 
> But this is precisely where we'd bail if we freeze sb by ioctl_fsfreeze()
> but try to thaw by thaw_bdev(). ioctl_fsfreeze() does not touch
> bd_fsfreeze_count...

Ah, yes, I see that now, thanks!

  Luis


Re: [PATCH 1/3] fs: move documentation for thaw_super() where appropriate

2018-04-20 Thread Luis R. Rodriguez
On Sat, Apr 21, 2018 at 12:01:31AM +, Bart Van Assche wrote:
> On Fri, 2018-04-20 at 16:59 -0700, Luis R. Rodriguez wrote:
> > +/**
> > + * thaw_super -- unlock filesystem
> > + * @sb: the super to thaw
> > + *
> > + * Unlocks the filesystem and marks it writeable again after 
> > freeze_super().
> > + */
> 
> Have you verified the output generated by scripts/kernel-doc? Last
> time I checked the output for headers containing a double dash looked
> weird.

No, I just moved the block of kdoc crap. Randy would have this memorized I'm
sure though so I should just fix the kdoc if its bad while at it.

  Luis


Re: [PATCH 0/3] fs: minor fs thaw fixes and adjustments

2018-04-20 Thread Luis R. Rodriguez
On Fri, Apr 20, 2018 at 04:59:01PM -0700, Luis R. Rodriguez wrote:
> Here's a few minor fs thaw fixes and adjustments I ran accross
> as I started to refresh my development for to use freeze_fs on
> suspend/hibernation [0]. These are just prep commits for the real
> work, I'm still reviewing feedback and adjusting the code for
> that.

And I forgot to provide the URL, for those that missed the old series
I was working on:

[0] https://lkml.kernel.org/r/20171129232356.28296-1-mcg...@kernel.org

  Luis
> 
> I've tested the patches with generic/085 on xfs and found no regressions.
> 
> Luis R. Rodriguez (3):
>   fs: move documentation for thaw_super() where appropriate
>   fs: make thaw_super_locked() really just a helper
>   fs: fix corner case race on freeze_bdev() when sb disappears
> 
>  fs/block_dev.c |  4 +++-
>  fs/super.c | 39 ++-
>  2 files changed, 29 insertions(+), 14 deletions(-)
> 
> -- 
> 2.16.3
> 
> 

-- 
Do not panic


[PATCH 3/3] fs: fix corner case race on freeze_bdev() when sb disappears

2018-04-20 Thread Luis R. Rodriguez
freeze_bdev() will bail but leave the bd_fsfreeze_count incremented
if the get_active_super() does not find the superblock on our
super_blocks list to match.

This issue has been present since v2.6.29 during the introduction of the
ioctl_fsfreeze() and ioctl_fsthaw() via commit fcccf502540e3 ("filesystem
freeze: implement generic freeze feature").

I am not aware of any existing races which have triggered this
situation, however, if it does trigger it could mean leaving a
superblock with bd_fsfreeze_count always positive.

Fixes: fcccf502540e3 ("filesystem freeze: implement generic freeze feature")
Signed-off-by: Luis R. Rodriguez 
---
 fs/block_dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b54966679833..7a532aa58c07 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -507,8 +507,10 @@ struct super_block *freeze_bdev(struct block_device *bdev)
}
 
sb = get_active_super(bdev);
-   if (!sb)
+   if (!sb) {
+   bdev->bd_fsfreeze_count--;
goto out;
+   }
if (sb->s_op->freeze_super)
error = sb->s_op->freeze_super(sb);
else
-- 
2.16.3



[PATCH 2/3] fs: make thaw_super_locked() really just a helper

2018-04-20 Thread Luis R. Rodriguez
thaw_super_locked() was added via commit 08fdc8a0138a ("buffer.c: call
thaw_super during emergency thaw") merged on v4.17 to help with the
ability so that the caller can take charge of handling the s_umount lock,
however, it has left all* of the failure handling including unlocking
lock of s_umount inside thaw_super_locked().

This does not make thaw_super_locked() flexible. For instance we may
later want to use it with the abilty to handle unfolding of the locks
ourselves.

Change thaw_super_locked() to really just be a helper, and let the
callers deal with all the error handling.

This commit introeuces no functional changes.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 9d0eb5e20a1f..82bc74a16f06 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -937,10 +937,15 @@ void emergency_remount(void)
 
 static void do_thaw_all_callback(struct super_block *sb)
 {
+   int error;
+
down_write(&sb->s_umount);
if (sb->s_root && sb->s_flags & MS_BORN) {
emergency_thaw_bdev(sb);
-   thaw_super_locked(sb);
+   error = thaw_super_locked(sb);
+   if (error)
+   up_write(&sb->s_umount);
+   deactivate_locked_super(sb);
} else {
up_write(&sb->s_umount);
}
@@ -1532,14 +1537,13 @@ int freeze_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(freeze_super);
 
+/* Caller takes the sb->s_umount rw_semaphore lock and handles active count */
 static int thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
-   up_write(&sb->s_umount);
+   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
return -EINVAL;
-   }
 
if (sb_rdonly(sb)) {
sb->s_writers.frozen = SB_UNFROZEN;
@@ -1554,7 +1558,6 @@ static int thaw_super_locked(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
lockdep_sb_freeze_release(sb);
-   up_write(&sb->s_umount);
return error;
}
}
@@ -1563,7 +1566,6 @@ static int thaw_super_locked(struct super_block *sb)
sb_freeze_unlock(sb);
 out:
wake_up(&sb->s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return 0;
 }
 
@@ -1575,7 +1577,18 @@ static int thaw_super_locked(struct super_block *sb)
  */
 int thaw_super(struct super_block *sb)
 {
+   int error;
+
down_write(&sb->s_umount);
-   return thaw_super_locked(sb);
+   error = thaw_super_locked(sb);
+   if (error) {
+   up_write(&sb->s_umount);
+   goto out;
+   }
+
+   deactivate_locked_super(sb);
+
+out:
+   return error;
 }
 EXPORT_SYMBOL(thaw_super);
-- 
2.16.3



[PATCH 0/3] fs: minor fs thaw fixes and adjustments

2018-04-20 Thread Luis R. Rodriguez
Here's a few minor fs thaw fixes and adjustments I ran accross
as I started to refresh my development for to use freeze_fs on
suspend/hibernation [0]. These are just prep commits for the real
work, I'm still reviewing feedback and adjusting the code for
that.

I've tested the patches with generic/085 on xfs and found no regressions.

Luis R. Rodriguez (3):
  fs: move documentation for thaw_super() where appropriate
  fs: make thaw_super_locked() really just a helper
  fs: fix corner case race on freeze_bdev() when sb disappears

 fs/block_dev.c |  4 +++-
 fs/super.c | 39 ++-
 2 files changed, 29 insertions(+), 14 deletions(-)

-- 
2.16.3



[PATCH 1/3] fs: move documentation for thaw_super() where appropriate

2018-04-20 Thread Luis R. Rodriguez
On commit 08fdc8a0138a ("buffer.c: call thaw_super during emergency thaw")
Mateusz added thaw_super_locked() and made thaw_super() use it, but
forgot to move the documentation.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 5fa9a8d8d865..9d0eb5e20a1f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1532,12 +1532,6 @@ int freeze_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(freeze_super);
 
-/**
- * thaw_super -- unlock filesystem
- * @sb: the super to thaw
- *
- * Unlocks the filesystem and marks it writeable again after freeze_super().
- */
 static int thaw_super_locked(struct super_block *sb)
 {
int error;
@@ -1573,6 +1567,12 @@ static int thaw_super_locked(struct super_block *sb)
return 0;
 }
 
+/**
+ * thaw_super -- unlock filesystem
+ * @sb: the super to thaw
+ *
+ * Unlocks the filesystem and marks it writeable again after freeze_super().
+ */
 int thaw_super(struct super_block *sb)
 {
down_write(&sb->s_umount);
-- 
2.16.3



Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-20 Thread Luis R. Rodriguez
On Tue, Apr 17, 2018 at 05:59:36PM -0700, Luis R. Rodriguez wrote:
> On Thu, Dec 21, 2017 at 12:03:29PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > I think I owe you a reply here... Sorry that it took so long.
> 
> Took me just as long :)
> 
> > On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> > > 
> > > I'll note that its still not perfectly clear if really the semantics 
> > > behind
> > > freeze_bdev() match what I described above fully. That still needs to be
> > > vetted for. For instance, does thaw_bdev() keep a superblock frozen if we
> > > an ioctl initiated freeze had occurred before? If so then great. Otherwise
> > > I think we'll need to distinguish the ioctl interface. Worst possible case
> > > is that bdev semantics and in-kernel semantics differ somehow, then that
> > > will really create a holy fucking mess.
> > 
> > I believe nobody really thought about mixing those two interfaces to fs
> > freezing and so the behavior is basically defined by the implementation.
> > That is:
> > 
> > freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > freeze_bdev() on sb frozen by freeze_bdev() -> success
> > ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
> > ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > 
> > thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
> 
> Phew, so this is what we want for the in-kernel freezing so we're good
> and *can* combine these then.

I double checked, and I don't see where you get EINVAL for this case.
We *do* keep the sb frozen though, which is good, and the worst fear
I had was that we did not. However we return 0 if there was already
a prior freeze_bdev() or ioctl_fsfreeze() other than the context that
started the prior freeze (--bdev->bd_fsfreeze_count > 0).

The -EINVAL is only returned currently if there were no freezers.

int thaw_bdev(struct block_device *bdev, struct super_block *sb)
{
int error = -EINVAL;

mutex_lock(&bdev->bd_fsfreeze_mutex);
if (!bdev->bd_fsfreeze_count)
goto out;

error = 0;
if (--bdev->bd_fsfreeze_count > 0)
goto out;
...
out:
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return error;
}

  Luis


Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-17 Thread Luis R. Rodriguez
On Thu, Dec 21, 2017 at 12:03:29PM +0100, Jan Kara wrote:
> Hello,
> 
> I think I owe you a reply here... Sorry that it took so long.

Took me just as long :)

> On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> > 
> > I'll note that its still not perfectly clear if really the semantics behind
> > freeze_bdev() match what I described above fully. That still needs to be
> > vetted for. For instance, does thaw_bdev() keep a superblock frozen if we
> > an ioctl initiated freeze had occurred before? If so then great. Otherwise
> > I think we'll need to distinguish the ioctl interface. Worst possible case
> > is that bdev semantics and in-kernel semantics differ somehow, then that
> > will really create a holy fucking mess.
> 
> I believe nobody really thought about mixing those two interfaces to fs
> freezing and so the behavior is basically defined by the implementation.
> That is:
> 
> freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY

Note below as well on your *future* freeze_super() implementation.

> freeze_bdev() on sb frozen by freeze_bdev() -> success
> ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
> ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY
> 
> thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL

Phew, so this is what we want for the in-kernel freezing so we're good
and *can* combine these then.

> ioctl_fsthaw() on sb frozen by freeze_bdev() -> success
> 
> What I propose is the following API:
> 
> freeze_super_excl()
>   - freezes superblock, returns EBUSY if the superblock is already frozen
> (either by another freeze_super_excl() or by freeze_super())
> freeze_super()
>   - this function will make sure superblock is frozen when the function
> returns with success. 

That's straight forward.

> It can be nested with other freeze_super() or
> freeze_super_excl() calls 

This is where it can get hairy. More below.

> (this second part is different from how
> freeze_bdev() behaves currently but AFAICT this behavior is actually
> what all current users of freeze_bdev() really want - just make sure
> fs cannot be written to)

If we can agree to this, then sure. However there are two types of
possible nested calls to consider, one where the sb was already frozen
by an IOCTL, and the other where it was initiated by either another
freeze_super_excl() or another freeze_super() call which is currently
being processed. For the first type, its easy to say the device is
already frozen as such return success. If the freezing is ongoing,
we may want to wait or not wait, and this will depend on our current
use cases for freeze_bdev().

As you noted above, freeze_bdev() currently returns EBUSY if we had
the sb already frozen by ioctl_fsfreeze(). It may be a welcomed
enhancement to correct the semantics first to address the first case,
but keep the EBUSY for the other case. A secondary patch could then
add a completion mechanism and let callers decide to either wait or not.
*Iff* the caller did not opt-in to wait we keep the EBUSY return.

Seem reasonable?

I'll address the rest of the mail later.

  Luis


Re: [PATCH 00/11] fs: use freeze_fs on suspend/hibernate

2017-12-19 Thread Luis R. Rodriguez
On Wed, Dec 13, 2017 at 02:09:49AM +0100, Rafael J. Wysocki wrote:
> I'm assuming an update of this to be posted due to the comments from Jan
> on patch [3/11] and patch [7/11] probably.
> 
> Is there anything else that needs to be addressed?

I was waiting on Jan Kara's feedback on how he'd like to proceed with the
unthawing on error given his point on that the device mapper API seems to match
the in kernel automatic freezing just that I didn't use that same interface.

0-day did come back with one RCU issue which I also have to address:

[  422.919958] kernel BUG at kernel/rcu/sync.c:228! 

  
[  422.920115] invalid opcode:  [#1] SMP

  
[  422.920212] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 
dns_resolver netconsole sr_mod cdrom sd_mod sg snd_hda_codec_idt 
snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal
+intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul snd_hda_intel 
crc32_pclmul crc32c_intel snd_hda_codec snd_hda_core snd_hwdep i915 
ghash_clmulni_intel cryptd snd_pcm pcspkr 
+drm_kms_helper snd_timer ahci libahci syscopyarea sysfillrect sysimgblt 
fb_sys_fops snd libata shpchp soundcore drm video ip_tables 
 
[  422.921168] CPU: 2 PID: 237 Comm: kworker/2:3 Not tainted 
4.15.0-rc1-00030-gf95c16a #1
 
[  422.921347] Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 
02/05/2013  
  
[  422.921515] Workqueue: events destroy_super_work 

  
[  422.921628] task: 8801bfcd task.stack: c90001128000  

  
[  422.921768] RIP: 0010:rcu_sync_dtor+0x65/0x70

  
[  422.921874] RSP: :c9000112be60 EFLAGS: 00010286  

  
[  422.921985] RAX: 0008 RBX: 8801c00793d8 RCX: 
0001fece
  
[  422.922133] RDX: fff6 RSI: 0282 RDI: 
8801c00793d8
  
[  422.922283] RBP: 880212f1b6c0 R08:  R09: 
009c
  
[  422.922432] R10:  R11:  R12: 
880212f1f800
  
[  422.922579] R13:  R14: 8801c036db40 R15: 
8801c00795b0
  
[  422.922728] FS:  () GS:880212f0() 
knlGS:  
 
[  422.922931] CS:  0010 DS:  ES:  CR0: 80050033

  
[  422.923054] CR2: 7ffe2f820ff8 CR3: 01e09002 CR4: 
001606e0
  
[  422.923203] Call Trace:  

  
[  422.923266]  percpu_free_rwsem+0x15/0x30 

  
[  422.923357]  destroy_super_work+0x3d/0x50

  
[  422.923449]  process_one_work+0x18f/0x3e0

  
[  422.923540]

Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-12-01 Thread Luis R. Rodriguez
On Fri, Dec 01, 2017 at 12:47:24PM +0100, Jan Kara wrote:
> On Thu 30-11-17 20:05:48, Luis R. Rodriguez wrote:
> > On Thu, Nov 30, 2017 at 06:13:10PM +0100, Jan Kara wrote:
> > > ... I dislike the _by_user() suffix as there may be different places that
> > > call freeze_super() (e.g. device mapper does this during some operations).
> > > Clearly we need to distinguish "by system suspend" and "the other" cases.
> > > So please make this clear in the naming.
> > 
> > Ah. How about sb_frozen_by_cb() ?
> 
> And what does 'cb' stand for? :)

Callback. But let me think about bdev usage a bit and we can worry about the
bikeshedding later.

> > > In fact, what might be a cleaner solution is to introduce a 'freeze_count'
> > > for superblock freezing (we already do have this for block devices). Then
> > > you don't need to differentiate these two cases - but you'd still need to
> > > properly handle cleanup if freezing of all superblocks fails in the 
> > > middle.
> > > So I'm not 100% this works out nicely in the end. But it's certainly worth
> > > a consideration.
> > 
> > Ah, there are three important reasons for doing it the way I did it which 
> > are
> > easy to miss, unless you read the commit log message very carefully.
> > 
> > 0) The ioctl interface causes a failure to be sent back to userspace if
> > you issue two consecutive freezes, or two thaws. Ie, once a filesystem is
> > frozen, a secondary call will result in an error. Likewise for thaw.
> 
> Yep. But also note that there's *another* interface to filesystem freezing
> which behaves differently - freeze_bdev() (used internally by dm). That
> interface uses the counter and freezing of already frozen device succeeds.

Ah... so freeze_bdev() semantics matches the same semantics I was looking
for.

> IOW it is a mess.

To say the least.

> We cannot change the behavior of the ioctl but we could
> still provide an in-kernel interface to freeze_super() with the same
> semantics as freeze_bdev() which might be easier to use by suspend - maybe
> we could call it 'exclusive' (for the current freeze_super() semantics) and
> 'non-exclusive' (for the freeze_bdev() semantics) since this is very much
> like O_EXCL open of block devices...

Sure, now typically I see we make exclusive calls with the postfix _excl() so
I take it you'd be OK in renaming freeze_super() freeze_super_excl() eventually
then?

I totally missed freeze_bdev() otherwise I think I would have picked up on the
shared semantics stuff and I would have just made a helper out of what
freeze_bdev() uses, and then have both in-kernel and freeze_bdev() use it.

I'll note that its still not perfectly clear if really the semantics behind
freeze_bdev() match what I described above fully. That still needs to be
vetted for. For instance, does thaw_bdev() keep a superblock frozen if we
an ioctl initiated freeze had occurred before? If so then great. Otherwise
I think we'll need to distinguish the ioctl interface. Worst possible case
is that bdev semantics and in-kernel semantics differ somehow, then that
will really create a holy fucking mess.

> > 1) The new iterate supers stuff I added bail on the first error and return 
> > that
> > error. If we kept the ioctl() interface error scheme we'd be erroring out
> > if on suspend if userspace had already frozen a filesystem. Clearly that'd
> > be silly so we need to distinguish between the automatic kernel freezing
> > and the old userspace ioctl initiated interface, so that we can keep the
> > old behaviour but allow in-kernel auto freeze on suspend to work properly.
> 
> This would work fine with the non-exclusive semantics I believe.

Groovy.

> > 2) If we fail to suspend we need to then thaw up all filesystems. The order
> > in which we try to freeze is in reverse order on the super_block list. If we
> > fail though we iterate in proper order on the super_block list and thaw. If
> > you had two filesystems this means that if a failure happened on freezing
> > the first filesystem, we'd first thaw the other filesystem -- and because of
> > 0) if we don't distinguish between the ioctl interface or auto freezing, 
> > we'd
> > also fail on thaw'ing given the other superblock wouldn't have been frozen.
> > 
> > So we need to keep two separate approaches. The count stuff would not 
> > suffice
> > to distinguish origin of source for freeze call.
> > 
> > Come to think of it, I think I forgot to avoid thaw if the freeze was ioctl
> > initiated..
> > 
> > thaw_unlocked(bool

Re: [PATCH 00/11] fs: use freeze_fs on suspend/hibernate

2017-11-30 Thread Luis R. Rodriguez
On Thu, Nov 30, 2017 at 10:51:57PM +0100, Pavel Machek wrote:
> Hi!
> 
> > Proposed solution:
> > 
> > Instead of fixing such semantics and trying to get all filesystems to do it
> > right, we can easily do away with all freezing calls if the filesystem
> > implements a proper freeze_fs() callback. The following 9 filesystems have
> > freeze_fs() implemented as such we can let the kernel issue the callback 
> > upon
> > suspend and thaw on resume automatically on our behalf.
> > 
> >   o xfs
> >   o reiserfs
> >   o nilfs2
> >   o jfs
> >   o f2fs
> >   o ext4
> >   o ext2
> >   o btrfs
> > 
> > Of these, the following have freezer helpers, which can then be removed
> > after the kernel automaticaly calls freeze_fs for us on suspend:
> > 
> >   o xfs
> >   o nilfs2
> >   o jfs
> >   o f2fs
> >   o ext4
> > 
> > I've tested this on a system with ext4 and XFS, and have let 0-day go at
> > without issues. I have branches availabe for linux-next [3] and Linus'
> > latest tree [4].
> 
> Was hibernation tested? uswsusp?

I had not done a test before but I just did and it worked, in fact I was able
to keep my ssh connection to my qemu guest after resume from hibernation with
this.

root@piggy:~# echo shutdown > /sys/power/disk; echo disk > /sys/power/state

[   87.930446] PM: hibernation entry
[   87.936294] firmware_class: device_cache_fw_images
[   87.936363] PM: Syncing filesystems ... 
[   87.979960] PM: done.
[   87.980594] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   87.983839] Freezing filesystems ... 
[   87.983844] xfs (sdb1): freezing
[   88.013313] ext4 (sda1): freezing
[   88.057635] done.
[   88.058242] OOM killer disabled.
...
[   88.145364] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
done.
[   88.221583] Disabling non-boot CPUs ...
[   88.241231] Unregister pv shared memory for cpu 1
[   88.244139] smpboot: CPU 1 is now offline
[   88.273290] Unregister pv shared memory for cpu 2
[   88.276170] smpboot: CPU 2 is now offline
[   88.297296] Unregister pv shared memory for cpu 3
[   88.21] smpboot: CPU 3 is now offline
[   88.302665] PM: Creating hibernation image:
[   88.305633] PM: Need to copy 107866 pages
[   88.305633] PM: Normal pages needed: 107866 + 1024, available pages: 940488
...

At this point my qemu session ends. I start it up again.

   88.305633] Enabling non-boot CPUs ...
[   88.305633] x86: Booting SMP configuration:
[   88.305633] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   88.244097] kvm-clock: cpu 1, msr 1:3ffef041, secondary cpu clock
[   88.334899] KVM setup async PF for cpu 1
[   88.335174] kvm-stealtime: cpu 1, msr 13fc8d9c0
...
[   88.424852] sd 0:0:0:0: [sda] Starting disk
[   88.424893] sd 0:0:1:0: [sdb] Starting disk
[   88.424928] sd 1:0:0:0: [sdc] Starting disk
[   88.473888] PM: Basic memory bitmaps freed
[   88.473890] OOM killer enabled.
[   88.474876] ext4 (sda1): thawing
[   88.585233] ata2.01: NODEV after polling detection
[   88.587805] ata2.00: configured for MWDMA2
[   88.589436] ata1.00: configured for MWDMA2
[   88.592421] ata1.01: configured for MWDMA2
[   88.601141] xfs (sdb1): thawing
[   88.602430] Restarting tasks ... done.
[   88.611055] PM: hibernation exit
[   90.492926] e1000: ens3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[   98.780159] firmware_class: device_uncache_fw_images

And ssh is working after this, without requiring to initiate another connection.

Let me know if you'd like me to test something else.

  Luis


Re: [PATCH 00/11] fs: use freeze_fs on suspend/hibernate

2017-11-30 Thread Luis R. Rodriguez
On Thu, Nov 30, 2017 at 05:01:13PM +, Bart Van Assche wrote:
> It's great to see that you are making progress with this work :-) However,
> what's not clear to me is what your (long-term) plan is for freezing
> filesystems that e.g. exist on top of a md RAID1 block device?

The original approach of doing a full swing kill on kthread freezing helpers
in one shot was too much. We must therefore approach this turtle style and
carefully. This patch series currently only addresses filesystems which
have a freeze_fs() callback implemented, that's it. The rest of the kernel
needs more work.

So this work is far as I've gotten so far.

Note, it does *not* remove the kthread freezing API calls on any other drivers,
as such that should still work.

> The md resync
> thread must be stopped before a system is frozen. Today the md driver uses
> the kthread freezing mechanism for that purpose. Do you have a plan for
> handling the more complicated scenarios, e.g. a filesystem that exists on top
> of an md device where the md device uses one or more files as backing store
> and with the loop driver between the md device and the files?

Nope not yet. It seems you have given this some thought though so you're 
help here is greatly appreciated. In fact the way we should see the long
term 'kill kthread freezing' effort should be a collaborative one. I've
never touched md, so folks more familiar with md should give this some
thought.

Can for instance md register_pm_notifier() and register_syscore_ops()
and do handy work to pause some work to replace kthread freezing?
Note that I believe a pm notifier is needed in case syscore_suspend()
is not called, say on a suspend fail.

> How about filesystems implemented in user space using the FUSE driver?

Not sure, someone more familiar with FUSE drivers should chime in. Right
now all this does is leverage filesystems which *do* implement freeze_fs()
only, and removes the kthread freezing calls after that. That's all. It
should not affect any other filesystems

Likewise consider filesystems that implement freeze_super() instead (like gfs2)
which don't hold sb->s_umount -- see commit 48b6bca6b7b8309 ("fs: add
freeze_super/thaw_super fs hooks"). If its desirable to address those as well,
it would seem a pair of non-locking iters somehow are needed. More work to
be done.

This all needs to be considered for the future as well.

> Patch 6/11 of this series freezes user space processes before freezing
> filesystems. Will that work for FUSE filesystems?

It doesn't add new code which should negatively affect FUSE drivers.
Long term if we wanted to address FUSE, it may require other work.

In the future then fs_suspend_freeze() may need to work for different
types of filesystems. Right now only filesystems which implement freeze_fs()
are addressed.

  Luis


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-11-30 Thread Luis R. Rodriguez
On Thu, Nov 30, 2017 at 06:13:10PM +0100, Jan Kara wrote:
> ... I dislike the _by_user() suffix as there may be different places that
> call freeze_super() (e.g. device mapper does this during some operations).
> Clearly we need to distinguish "by system suspend" and "the other" cases.
> So please make this clear in the naming.

Ah. How about sb_frozen_by_cb() ?

> In fact, what might be a cleaner solution is to introduce a 'freeze_count'
> for superblock freezing (we already do have this for block devices). Then
> you don't need to differentiate these two cases - but you'd still need to
> properly handle cleanup if freezing of all superblocks fails in the middle.
> So I'm not 100% this works out nicely in the end. But it's certainly worth
> a consideration.

Ah, there are three important reasons for doing it the way I did it which are
easy to miss, unless you read the commit log message very carefully.

0) The ioctl interface causes a failure to be sent back to userspace if
you issue two consecutive freezes, or two thaws. Ie, once a filesystem is
frozen, a secondary call will result in an error. Likewise for thaw.

1) The new iterate supers stuff I added bail on the first error and return that
error. If we kept the ioctl() interface error scheme we'd be erroring out
if on suspend if userspace had already frozen a filesystem. Clearly that'd
be silly so we need to distinguish between the automatic kernel freezing
and the old userspace ioctl initiated interface, so that we can keep the
old behaviour but allow in-kernel auto freeze on suspend to work properly.

2) If we fail to suspend we need to then thaw up all filesystems. The order
in which we try to freeze is in reverse order on the super_block list. If we
fail though we iterate in proper order on the super_block list and thaw. If
you had two filesystems this means that if a failure happened on freezing
the first filesystem, we'd first thaw the other filesystem -- and because of
0) if we don't distinguish between the ioctl interface or auto freezing, we'd
also fail on thaw'ing given the other superblock wouldn't have been frozen.

So we need to keep two separate approaches. The count stuff would not suffice
to distinguish origin of source for freeze call.

Come to think of it, I think I forgot to avoid thaw if the freeze was ioctl
initiated..

thaw_unlocked(bool cb_call)
{
  if (sb_frozen_by_cb(sb) && !cb_call)
return 0; /* skip as the user wanted to keep this fs frozen */
  ...
}

Even though the kernel auto call is new I think we need to keep ioctl initiated
frozen filesystems frozen to not break old userspace assumptions.

So, keeping all this in mind, does a count method still suffice?

  Luis


Re: [PATCH 05/11] fs: add iterate_supers_excl() and iterate_supers_reverse_excl()

2017-11-29 Thread Luis R. Rodriguez
On Thu, Nov 30, 2017 at 12:48:15AM +0100, Rafael J. Wysocki wrote:
> On Thu, Nov 30, 2017 at 12:23 AM, Luis R. Rodriguez  wrote:
> > +int iterate_supers_excl(int (*f)(struct super_block *, void *), void *arg)
> > +{
> > +   struct super_block *sb, *p = NULL;
> > +   int error = 0;
> > +
> > +   spin_lock(&sb_lock);
> > +   list_for_each_entry(sb, &super_blocks, s_list) {
> > +   if (hlist_unhashed(&sb->s_instances))
> > +   continue;
> > +   sb->s_count++;
> > +   spin_unlock(&sb_lock);
> 
> Can anything bad happen if the list is modified at this point by a
> concurrent thread?

The race is theoretical and applies to all users of iterate_supers() as well.

Its certainly worth considering, however given other code is implicated its
not a *new* issue or race. Its the best we can do with the current design.

That said, as I looked at all this code I considered that perhaps super_blocks
deserves its own RCU lock to enable us to do full swap operations on the list,
without having to do these nasty releases in between.

If that's possible / desirable I'd consider it a welcomed future optimization,
and I could give it a shot, however its unclear if this is a requirement for
this feature at this time.

  Luis


[PATCH 00/11] fs: use freeze_fs on suspend/hibernate

2017-11-29 Thread Luis R. Rodriguez
This is a followup from the original RFC which proposed to start
to kill kthread freezing all together [0]. Instead of going straight
out to the jugular for kthread freezing this series only addresses
killing freezer calls on filesystems which implement freeze_fs, after
we let the kernel freeze these filesystems for us on suspend.

This approach puts on a slow but steady path towards the original goal
though. Each subsystem could look for similar solutions. Even with
filesystems we're not all done yet, after this we'll still have to
decide what to do about filesystems which do not implement freeze_fs().

Motivation and problem:

kthreads have some semantics for freezing, which helps the kernel
freeze them when a system is going to suspend or hibernation. These
semantics are not well defined though, and it actually turns out
pretty hard to get it right.

Without a proper solution suspend and hibernation are fragile on filesystems,
it can easily break suspend and fixing such issues are in no way trivial [1]
[2].

Proposed solution:

Instead of fixing such semantics and trying to get all filesystems to do it
right, we can easily do away with all freezing calls if the filesystem
implements a proper freeze_fs() callback. The following 9 filesystems have
freeze_fs() implemented as such we can let the kernel issue the callback upon
suspend and thaw on resume automatically on our behalf.

  o xfs
  o reiserfs
  o nilfs2
  o jfs
  o f2fs
  o ext4
  o ext2
  o btrfs

Of these, the following have freezer helpers, which can then be removed
after the kernel automaticaly calls freeze_fs for us on suspend:

  o xfs
  o nilfs2
  o jfs
  o f2fs
  o ext4

I've tested this on a system with ext4 and XFS, and have let 0-day go at
without issues. I have branches availabe for linux-next [3] and Linus'
latest tree [4].

Further testing, thoughts, reviews, flames are all equally appreciated.

[0] https://lkml.kernel.org/r/20171003185313.1017-1-mcg...@kernel.org
[1] https://bugzilla.suse.com/show_bug.cgi?id=1043449
[2] https://lkml.kernel.org/r/20171113103139.ga18...@yu-chen.sh.intel.com
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20171129-fs-freeze-cleanup
[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20171129-fs-freeze-cleanup

Luis R. Rodriguez (11):
  fs: provide unlocked helper for freeze_super()
  fs: provide unlocked helper thaw_super()
  fs: add frozen sb state helpers
  fs: distinguish between user initiated freeze and kernel initiated
freeze
  fs: add iterate_supers_excl() and iterate_supers_reverse_excl()
  fs: freeze on suspend and thaw on resume
  xfs: remove not needed freezing calls
  ext4: remove not needed freezing calls
  f2fs: remove not needed freezing calls
  nilfs2: remove not needed freezing calls
  jfs: remove not needed freezing calls

 fs/ext4/ext4_jbd2.c|   2 +-
 fs/ext4/super.c|   2 -
 fs/f2fs/gc.c   |   5 +-
 fs/f2fs/segment.c  |   6 +-
 fs/jfs/jfs_logmgr.c|  11 +-
 fs/jfs/jfs_txnmgr.c|  31 ++---
 fs/nilfs2/segment.c|  48 
 fs/super.c | 320 -
 fs/xfs/xfs_trans.c |   2 +-
 fs/xfs/xfs_trans_ail.c |   7 +-
 include/linux/fs.h |  63 +-
 kernel/power/process.c |  15 ++-
 12 files changed, 378 insertions(+), 134 deletions(-)

-- 
2.15.0



[PATCH 03/11] fs: add frozen sb state helpers

2017-11-29 Thread Luis R. Rodriguez
The question of whether or not a superblock is frozen needs to be
augmented in the future to account for differences between a user
initiated freeze and a kernel initiated freeze done automatically
on behalf of the kernel.

Provide helpers so that these can be used instead so that we don't
have to expand checks later in these same call sites as we expand
the definition of a frozen superblock.

Signed-off-by: Luis R. Rodriguez 
---
 fs/ext4/ext4_jbd2.c |  2 +-
 fs/super.c  |  4 ++--
 fs/xfs/xfs_trans.c  |  2 +-
 include/linux/fs.h  | 33 +
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 2d593201cf7a..090b8cd4551a 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -50,7 +50,7 @@ static int ext4_journal_check_start(struct super_block *sb)
 
if (sb_rdonly(sb))
return -EROFS;
-   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
+   WARN_ON(sb_is_frozen(sb));
journal = EXT4_SB(sb)->s_journal;
/*
 * Special case here: if the journal has aborted behind our
diff --git a/fs/super.c b/fs/super.c
index cecc279beecd..e8f5a7139b8f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1392,7 +1392,7 @@ static int freeze_locked_super(struct super_block *sb)
 {
int ret;
 
-   if (sb->s_writers.frozen != SB_UNFROZEN)
+   if (!sb_is_unfrozen(sb))
return -EBUSY;
 
if (!(sb->s_flags & SB_BORN))
@@ -1498,7 +1498,7 @@ static int thaw_locked_super(struct super_block *sb)
 {
int error;
 
-   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
+   if (!sb_is_frozen(sb))
return -EINVAL;
 
if (sb_rdonly(sb)) {
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index a87f657f59c9..b1180c26d902 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -241,7 +241,7 @@ xfs_trans_alloc(
if (!(flags & XFS_TRANS_NO_WRITECOUNT))
sb_start_intwrite(mp->m_super);
 
-   WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
+   WARN_ON(sb_is_frozen(mp->m_super));
atomic_inc(&mp->m_active_trans);
 
tp = kmem_zone_zalloc(xfs_trans_zone,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..1e10239c1d3b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1589,6 +1589,39 @@ static inline void sb_start_intwrite(struct super_block 
*sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
 }
 
+/**
+ * sb_is_frozen_by_user - is superblock frozen by a user call
+ * @sb: the super to check
+ *
+ * Returns true if the super freeze was initiated by userspace, for instance,
+ * an ioctl call.
+ */
+static inline bool sb_is_frozen_by_user(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
+}
+
+/**
+ * sb_is_frozen - is superblock frozen
+ * @sb: the super to check
+ *
+ * Returns true if the super is frozen.
+ */
+static inline bool sb_is_frozen(struct super_block *sb)
+{
+   return sb_is_frozen_by_user(sb);
+}
+
+/**
+ * sb_is_unfrozen - is superblock unfrozen
+ * @sb: the super to check
+ *
+ * Returns true if the super is unfrozen.
+ */
+static inline bool sb_is_unfrozen(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_UNFROZEN;
+}
 
 extern bool inode_owner_or_capable(const struct inode *inode);
 
-- 
2.15.0



[PATCH 04/11] fs: distinguish between user initiated freeze and kernel initiated freeze

2017-11-29 Thread Luis R. Rodriguez
Userspace can initiate a freeze call using ioctls. If the kernel decides
to freeze a filesystem later it must be able to distinguish if userspace
had initiated the freeze, so that it does not unfreeze it later
automatically on resume.

Likewise if the kernel is initiating a freeze on its own it should *not*
fail to freeze a filesystem if a user had already frozen it on our behalf.
This same concept applies to thawing, even if its not possible for
userspace to beat the kernel in thawing a filesystem. This logic however
has never applied to userspace freezing and thawing, two consecutive
userspace freeze calls will results in only the first one succeeding, so
we must retain the same behaviour in userspace.

This doesn't implement yet kernel initiated filesystem freeze calls,
this will be done in subsequent calls. This change should introduce
no functional changes, it just extends the definitions a frozen
filesystem to account for future kernel initiated filesystem freeze.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 19 ++-
 include/linux/fs.h | 17 +++--
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index e8f5a7139b8f..a63513d187e8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1388,10 +1388,13 @@ static void sb_freeze_unlock(struct super_block *sb)
 }
 
 /* Caller takes lock and handles active count */
-static int freeze_locked_super(struct super_block *sb)
+static int freeze_locked_super(struct super_block *sb, bool usercall)
 {
int ret;
 
+   if (!usercall && sb_is_frozen(sb))
+   return 0;
+
if (!sb_is_unfrozen(sb))
return -EBUSY;
 
@@ -1436,7 +1439,10 @@ static int freeze_locked_super(struct super_block *sb)
 * For debugging purposes so that fs can warn if it sees write activity
 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 */
-   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   if (usercall)
+   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   else
+   sb->s_writers.frozen = SB_FREEZE_COMPLETE_AUTO;
lockdep_sb_freeze_release(sb);
return 0;
 }
@@ -1481,7 +1487,7 @@ int freeze_super(struct super_block *sb)
atomic_inc(&sb->s_active);
 
down_write(&sb->s_umount);
-   error = freeze_locked_super(sb);
+   error = freeze_locked_super(sb, true);
if (error) {
deactivate_locked_super(sb);
goto out;
@@ -1494,10 +1500,13 @@ int freeze_super(struct super_block *sb)
 EXPORT_SYMBOL(freeze_super);
 
 /* Caller takes lock and handles active count */
-static int thaw_locked_super(struct super_block *sb)
+static int thaw_locked_super(struct super_block *sb, bool usercall)
 {
int error;
 
+   if (!usercall && sb_is_unfrozen(sb))
+   return 0;
+
if (!sb_is_frozen(sb))
return -EINVAL;
 
@@ -1536,7 +1545,7 @@ int thaw_super(struct super_block *sb)
int error;
 
down_write(&sb->s_umount);
-   error = thaw_locked_super(sb);
+   error = thaw_locked_super(sb, true);
if (error) {
up_write(&sb->s_umount);
goto out;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1e10239c1d3b..107725b20fad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1324,9 +1324,10 @@ enum {
SB_FREEZE_FS = 3,   /* For internal FS use (e.g. to stop
 * internal threads if needed) */
SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */
+   SB_FREEZE_COMPLETE_AUTO = 5,/* same but initiated automatically */
 };
 
-#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
+#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE_AUTO - 1)
 
 struct sb_writers {
int frozen; /* Is sb frozen? */
@@ -1601,6 +1602,18 @@ static inline bool sb_is_frozen_by_user(struct 
super_block *sb)
return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
 }
 
+/**
+ * sb_is_frozen_by_kernel - is superblock frozen by the kernel automatically
+ * @sb: the super to check
+ *
+ * Returns true if the super freeze was initiated by the kernel, automatically,
+ * for instance during system sleep or hibernation.
+ */
+static inline bool sb_is_frozen_by_kernel(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_FREEZE_COMPLETE_AUTO;
+}
+
 /**
  * sb_is_frozen - is superblock frozen
  * @sb: the super to check
@@ -1609,7 +1622,7 @@ static inline bool sb_is_frozen_by_user(struct 
super_block *sb)
  */
 static inline bool sb_is_frozen(struct super_block *sb)
 {
-   return sb_is_frozen_by_user(sb);
+   return sb_is_frozen_by_user(sb) || sb_is_frozen_by_kernel(sb);
 }
 
 /**
-- 
2.15.0



[PATCH 01/11] fs: provide unlocked helper for freeze_super()

2017-11-29 Thread Luis R. Rodriguez
freeze_super() holds a write lock, however we wish to also enable
callers which already hold the write lock. To do this provide a helper
and make freeze_super() use it. This way, all that freeze_super() does
now is lock handling and active count management.

This change has no functional changes.

Suggested-by: Dave Chinner 
Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 100 +
 1 file changed, 55 insertions(+), 45 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index d4e33e8f1e6f..a7650ff22f0e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1387,59 +1387,20 @@ static void sb_freeze_unlock(struct super_block *sb)
percpu_up_write(sb->s_writers.rw_sem + level);
 }
 
-/**
- * freeze_super - lock the filesystem and force it into a consistent state
- * @sb: the super to lock
- *
- * Syncs the super to make sure the filesystem is consistent and calls the fs's
- * freeze_fs.  Subsequent calls to this without first thawing the fs will 
return
- * -EBUSY.
- *
- * During this function, sb->s_writers.frozen goes through these values:
- *
- * SB_UNFROZEN: File system is normal, all writes progress as usual.
- *
- * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
- * writes should be blocked, though page faults are still allowed. We wait for
- * all writes to complete and then proceed to the next stage.
- *
- * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
- * but internal fs threads can still modify the filesystem (although they
- * should not dirty new pages or inodes), writeback can run etc. After waiting
- * for all running page faults we sync the filesystem which will clean all
- * dirty pages and inodes (no new dirty pages or inodes can be created when
- * sync is running).
- *
- * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
- * modification are blocked (e.g. XFS preallocation truncation on inode
- * reclaim). This is usually implemented by blocking new transactions for
- * filesystems that have them and need this additional guard. After all
- * internal writers are finished we call ->freeze_fs() to finish filesystem
- * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
- * mostly auxiliary for filesystems to verify they do not modify frozen fs.
- *
- * sb->s_writers.frozen is protected by sb->s_umount.
- */
-int freeze_super(struct super_block *sb)
+/* Caller takes lock and handles active count */
+static int freeze_locked_super(struct super_block *sb)
 {
int ret;
 
-   atomic_inc(&sb->s_active);
-   down_write(&sb->s_umount);
-   if (sb->s_writers.frozen != SB_UNFROZEN) {
-   deactivate_locked_super(sb);
+   if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;
-   }
 
-   if (!(sb->s_flags & SB_BORN)) {
-   up_write(&sb->s_umount);
+   if (!(sb->s_flags & SB_BORN))
return 0;   /* sic - it's "nothing to do" */
-   }
 
if (sb_rdonly(sb)) {
/* Nothing to do really... */
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-   up_write(&sb->s_umount);
return 0;
}
 
@@ -1468,7 +1429,6 @@ int freeze_super(struct super_block *sb)
sb->s_writers.frozen = SB_UNFROZEN;
sb_freeze_unlock(sb);
wake_up(&sb->s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return ret;
}
}
@@ -1478,9 +1438,59 @@ int freeze_super(struct super_block *sb)
 */
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
lockdep_sb_freeze_release(sb);
-   up_write(&sb->s_umount);
return 0;
 }
+
+/**
+ * freeze_super - lock the filesystem and force it into a consistent state
+ * @sb: the super to lock
+ *
+ * Syncs the super to make sure the filesystem is consistent and calls the fs's
+ * freeze_fs.  Subsequent calls to this without first thawing the fs will 
return
+ * -EBUSY.
+ *
+ * During this function, sb->s_writers.frozen goes through these values:
+ *
+ * SB_UNFROZEN: File system is normal, all writes progress as usual.
+ *
+ * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
+ * writes should be blocked, though page faults are still allowed. We wait for
+ * all writes to complete and then proceed to the next stage.
+ *
+ * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
+ * but internal fs threads can still modify the filesystem (although they
+ * should not dirty new pages or inodes), writeback can run etc. After waiting
+ * for all running page faults we sync the filesystem which will clean all
+ * dirty pages and inodes (no new dirty pages or inodes can be 

[PATCH 02/11] fs: provide unlocked helper thaw_super()

2017-11-29 Thread Luis R. Rodriguez
thaw_super() hold a write lock, however we wish to also enable
callers which already hold the write lock. To do this provide a helper
and make thaw_super() use it. This way, all that thaw_super() does
now is lock handling and active count management.

This change has no functional changes.

Suggested-by: Dave Chinner 
Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a7650ff22f0e..cecc279beecd 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1493,21 +1493,13 @@ int freeze_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(freeze_super);
 
-/**
- * thaw_super -- unlock filesystem
- * @sb: the super to thaw
- *
- * Unlocks the filesystem and marks it writeable again after freeze_super().
- */
-int thaw_super(struct super_block *sb)
+/* Caller takes lock and handles active count */
+static int thaw_locked_super(struct super_block *sb)
 {
int error;
 
-   down_write(&sb->s_umount);
-   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
-   up_write(&sb->s_umount);
+   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
return -EINVAL;
-   }
 
if (sb_rdonly(sb)) {
sb->s_writers.frozen = SB_UNFROZEN;
@@ -1522,7 +1514,6 @@ int thaw_super(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
lockdep_sb_freeze_release(sb);
-   up_write(&sb->s_umount);
return error;
}
}
@@ -1531,7 +1522,29 @@ int thaw_super(struct super_block *sb)
sb_freeze_unlock(sb);
 out:
wake_up(&sb->s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return 0;
 }
+
+/**
+ * thaw_super -- unlock filesystem
+ * @sb: the super to thaw
+ *
+ * Unlocks the filesystem and marks it writeable again after freeze_super().
+ */
+int thaw_super(struct super_block *sb)
+{
+   int error;
+
+   down_write(&sb->s_umount);
+   error = thaw_locked_super(sb);
+   if (error) {
+   up_write(&sb->s_umount);
+   goto out;
+   }
+
+   deactivate_locked_super(sb);
+
+out:
+   return error;
+}
 EXPORT_SYMBOL(thaw_super);
-- 
2.15.0



[PATCH 05/11] fs: add iterate_supers_excl() and iterate_supers_reverse_excl()

2017-11-29 Thread Luis R. Rodriguez
There are use cases where we wish to traverse the superblock list
but also capture errors, and in which case we want to avoid having
our callers issue a lock themselves since we can do the locking for
the callers. Provide a iterate_supers_excl() which calls a function
with the write lock held. If an error occurs we capture it and
propagate it.

Likewise there are use cases where we wish to traverse the superblock
list but in reverse order. The new iterate_supers_reverse_excl() helpers
does this but also also captures any errors encountered.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 91 ++
 include/linux/fs.h |  2 ++
 2 files changed, 93 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index a63513d187e8..885711c1d35b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -605,6 +605,97 @@ void iterate_supers(void (*f)(struct super_block *, void 
*), void *arg)
spin_unlock(&sb_lock);
 }
 
+/**
+ * iterate_supers_excl - exclusively call func for all active superblocks
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument. Returns 0 unless an error
+ * occurred on calling the function on any superblock.
+ */
+int iterate_supers_excl(int (*f)(struct super_block *, void *), void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   list_for_each_entry(sb, &super_blocks, s_list) {
+   if (hlist_unhashed(&sb->s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(&sb_lock);
+
+   down_write(&sb->s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   up_write(&sb->s_umount);
+   spin_lock(&sb_lock);
+   __put_super(sb);
+   break;
+   }
+   }
+   up_write(&sb->s_umount);
+
+   spin_lock(&sb_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(&sb_lock);
+
+   return error;
+}
+
+/**
+ * iterate_supers_reverse_excl - exclusively calls func in reverse order
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument, in reverse order, and holding
+ * the s_umount write lock. Returns if an error occurred.
+ */
+int iterate_supers_reverse_excl(int (*f)(struct super_block *, void *),
+void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+   if (hlist_unhashed(&sb->s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(&sb_lock);
+
+   down_write(&sb->s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   up_write(&sb->s_umount);
+   spin_lock(&sb_lock);
+   __put_super(sb);
+   break;
+   }
+   }
+   up_write(&sb->s_umount);
+
+   spin_lock(&sb_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(&sb_lock);
+
+   return error;
+}
+
 /**
  * iterate_supers_type - call function for superblocks of given type
  * @type: fs type
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 107725b20fad..fe90b6542697 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3164,6 +3164,8 @@ extern struct super_block *get_active_super(struct 
block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern int iterate_supers_excl(int (*f)(struct super_block *, void *), void 
*arg);
+extern int iterate_supers_reverse_excl(int (*)(struct super_block *, void *), 
void *);
 extern void iterate_supers_type(struct file_system_type *,
void (*)(struct super_block *, void *), void *);
 
-- 
2.15.0



[PATCH 07/11] xfs: remove not needed freezing calls

2017-11-29 Thread Luis R. Rodriguez
This removes superflous freezer calls as they are no longer needed
as the VFS now performs filesystem freezing/thaw if the filesystem has
support for it.

The following Coccinelle rule was used as follows:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2, S3;
expression task;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

Generated-by: Coccinelle SmPL
Signed-off-by: Luis R. Rodriguez 
---
 fs/xfs/xfs_trans_ail.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index cef89f7127d3..1f3dd10a9d00 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -513,7 +513,6 @@ xfsaild(
longtout = 0;   /* milliseconds */
 
current->flags |= PF_MEMALLOC;
-   set_freezable();
 
while (1) {
if (tout && tout <= 20)
@@ -551,19 +550,17 @@ xfsaild(
if (!xfs_ail_min(ailp) &&
ailp->xa_target == ailp->xa_target_prev) {
spin_unlock(&ailp->xa_lock);
-   freezable_schedule();
+   schedule();
tout = 0;
continue;
}
spin_unlock(&ailp->xa_lock);
 
if (tout)
-   freezable_schedule_timeout(msecs_to_jiffies(tout));
+   schedule_timeout(msecs_to_jiffies(tout));
 
__set_current_state(TASK_RUNNING);
 
-   try_to_freeze();
-
tout = xfsaild_push(ailp);
}
 
-- 
2.15.0



[PATCH 06/11] fs: freeze on suspend and thaw on resume

2017-11-29 Thread Luis R. Rodriguez
This uses the existing filesystem freeze and thaw callbacks to
freeze each filesystem on suspend/hibernation and thaw upon resume.

This is needed so that we properly really stop IO in flight without
races after userspace has been frozen. Without this we rely on
kthread freezing and its semantics are loose and error prone.
For instance, even though a kthread may use try_to_freeze() and end
up being frozen we have no way of being sure that everything that
has been spawned asynchronously from it (such as timers) have also
been stopped as well.

A long term advantage of also adding filesystem freeze / thawing
supporting durign suspend / hibernation is that long term we may
be able to eventually drop the kernel's thread freezing completely
as it was originally added to stop disk IO in flight as we hibernate
or suspend.

This also implies that many kthread users exist which have been
adding freezer semantics onto its kthreads without need. These also
will need to be reviewed later.

This is based on prior work originally by Rafael Wysocki and later by
Jiri Kosina.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 85 ++
 include/linux/fs.h | 13 
 kernel/power/process.c | 15 -
 3 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 885711c1d35b..c3a2842e5690 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1648,3 +1648,88 @@ int thaw_super(struct super_block *sb)
return error;
 }
 EXPORT_SYMBOL(thaw_super);
+
+#ifdef CONFIG_PM_SLEEP
+static bool super_should_freeze(struct super_block *sb)
+{
+   if (!sb->s_root)
+   return false;
+   if (!(sb->s_flags & MS_BORN))
+   return false;
+   /*
+* We don't freeze virtual filesystems, we skip those filesystems with
+* no backing device.
+*/
+   if (sb->s_bdi == &noop_backing_dev_info)
+   return false;
+   /* No need to freeze read-only filesystems */
+   if (sb->s_flags & MS_RDONLY)
+   return false;
+
+   return true;
+}
+
+static int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
+{
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   if (!super_should_freeze(sb))
+   goto out;
+
+   pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
+
+   spin_unlock(&sb_lock);
+
+   atomic_inc(&sb->s_active);
+   error = freeze_locked_super(sb, false);
+   if (error)
+   atomic_dec(&sb->s_active);
+
+   spin_lock(&sb_lock);
+   if (error && error != -EBUSY)
+   pr_notice("%s (%s): Unable to freeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+
+out:
+   spin_unlock(&sb_lock);
+   return error;
+}
+
+int fs_suspend_freeze(void)
+{
+   return iterate_supers_reverse_excl(fs_suspend_freeze_sb, NULL);
+}
+
+static int fs_suspend_thaw_sb(struct super_block *sb, void *priv)
+{
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   if (!super_should_freeze(sb))
+   goto out;
+
+   pr_info("%s (%s): thawing\n", sb->s_type->name, sb->s_id);
+
+   spin_unlock(&sb_lock);
+
+   error = thaw_locked_super(sb, false);
+   if (!error)
+   atomic_dec(&sb->s_active);
+
+   spin_lock(&sb_lock);
+   if (error && error != -EBUSY)
+   pr_notice("%s (%s): Unable to unfreeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+
+out:
+   spin_unlock(&sb_lock);
+   return error;
+}
+
+int fs_resume_unfreeze(void)
+{
+   return iterate_supers_excl(fs_suspend_thaw_sb, NULL);
+}
+
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fe90b6542697..dbaa69c3a4cf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2237,6 +2237,19 @@ extern int user_statfs(const char __user *, struct 
kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
+#ifdef CONFIG_PM_SLEEP
+int fs_suspend_freeze(void);
+int fs_resume_unfreeze(void);
+#else
+static inline int fs_suspend_freeze(void)
+{
+   return 0;
+}
+static inline int fs_resume_unfreeze(void)
+{
+   return 0;
+}
+#endif
 extern bool our_mnt(struct vfsmount *mnt);
 extern __printf(2, 3)
 int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index c326d7235c5f..7a44f8310968 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -145,6 +145,16 @@ int freeze_processes(void)
pr_cont("\n");
BUG_ON(in_atomic());
 
+   pr_info("Freezing filesystems ... ");
+   error = fs_suspend_freez

[PATCH 09/11] f2fs: remove not needed freezing calls

2017-11-29 Thread Luis R. Rodriguez
This removes superflous freezer calls as they are no longer needed
as the VFS now performs filesystem freezing/thaw if the filesystem has
support for it.

The following Coccinelle rule was used as follows:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2, S3;
expression task;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

Generated-by: Coccinelle SmPL
Signed-off-by: Luis R. Rodriguez 
---
 fs/f2fs/gc.c  | 5 +
 fs/f2fs/segment.c | 6 +-
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index d844dcb80570..1032d6aa1756 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -32,10 +32,9 @@ static int gc_thread_func(void *data)
 
wait_ms = gc_th->min_sleep_time;
 
-   set_freezable();
do {
wait_event_interruptible_timeout(*wq,
-   kthread_should_stop() || freezing(current) ||
+   kthread_should_stop() ||
gc_th->gc_wake,
msecs_to_jiffies(wait_ms));
 
@@ -43,8 +42,6 @@ static int gc_thread_func(void *data)
if (gc_th->gc_wake)
gc_th->gc_wake = 0;
 
-   if (try_to_freeze())
-   continue;
if (kthread_should_stop())
break;
 
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index c117e0913f2a..a55e456e67ee 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1382,18 +1382,14 @@ static int issue_discard_thread(void *data)
unsigned int wait_ms = DEF_MIN_DISCARD_ISSUE_TIME;
int issued;
 
-   set_freezable();
-
do {
init_discard_policy(&dpolicy, DPOLICY_BG,
dcc->discard_granularity);
 
wait_event_interruptible_timeout(*q,
-   kthread_should_stop() || freezing(current) ||
+   kthread_should_stop() ||
dcc->discard_wake,
msecs_to_jiffies(wait_ms));
-   if (try_to_freeze())
-   continue;
if (kthread_should_stop())
return 0;
 
-- 
2.15.0



[PATCH 08/11] ext4: remove not needed freezing calls

2017-11-29 Thread Luis R. Rodriguez
This removes superflous freezer calls as they are no longer needed
as the VFS now performs filesystem freezing/thaw if the filesystem has
support for it.

The following Coccinelle rule was used as follows:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2, S3;
expression task;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

Generated-by: Coccinelle SmPL
Signed-off-by: Luis R. Rodriguez 
---
 fs/ext4/super.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 57a8fa451d3e..8a510b1c2d92 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2980,8 +2980,6 @@ static int ext4_lazyinit_thread(void *arg)
}
mutex_unlock(&eli->li_list_mtx);
 
-   try_to_freeze();
-
cur = jiffies;
if ((time_after_eq(cur, next_wakeup)) ||
(MAX_JIFFY_OFFSET == next_wakeup)) {
-- 
2.15.0



[PATCH 10/11] nilfs2: remove not needed freezing calls

2017-11-29 Thread Luis R. Rodriguez
This removes superflous freezer calls as they are no longer needed
as the VFS now performs filesystem freezing/thaw if the filesystem has
support for it.

The following Coccinelle rule was used as follows:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2, S3;
expression task;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

Generated-by: Coccinelle SmPL
Signed-off-by: Luis R. Rodriguez 
---
 fs/nilfs2/segment.c | 48 
 1 file changed, 20 insertions(+), 28 deletions(-)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9f3ffba41533..407e12a60145 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -2543,6 +2543,8 @@ static int nilfs_segctor_thread(void *arg)
struct nilfs_sc_info *sci = (struct nilfs_sc_info *)arg;
struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
int timeout = 0;
+   DEFINE_WAIT(wait);
+   int should_sleep = 1;
 
sci->sc_timer_task = current;
 
@@ -2574,38 +2576,28 @@ static int nilfs_segctor_thread(void *arg)
timeout = 0;
}
 
+   prepare_to_wait(&sci->sc_wait_daemon, &wait,
+   TASK_INTERRUPTIBLE);
 
-   if (freezing(current)) {
+   if (sci->sc_seq_request != sci->sc_seq_done)
+   should_sleep = 0;
+   else if (sci->sc_flush_request)
+   should_sleep = 0;
+   else if (sci->sc_state & NILFS_SEGCTOR_COMMIT)
+   should_sleep = time_before(jiffies,
+   sci->sc_timer.expires);
+
+   if (should_sleep) {
spin_unlock(&sci->sc_state_lock);
-   try_to_freeze();
+   schedule();
spin_lock(&sci->sc_state_lock);
-   } else {
-   DEFINE_WAIT(wait);
-   int should_sleep = 1;
-
-   prepare_to_wait(&sci->sc_wait_daemon, &wait,
-   TASK_INTERRUPTIBLE);
-
-   if (sci->sc_seq_request != sci->sc_seq_done)
-   should_sleep = 0;
-   else if (sci->sc_flush_request)
-   should_sleep = 0;
-   else if (sci->sc_state & NILFS_SEGCTOR_COMMIT)
-   should_sleep = time_before(jiffies,
-   sci->sc_timer.expires);
-
-   if (should_sleep) {
-   spin_unlock(&sci->sc_state_lock);
-   schedule();
-   spin_lock(&sci->sc_state_lock);
-   }
-   finish_wait(&sci->sc_wait_daemon, &wait);
-   timeout = ((sci->sc_state & NILFS_SEGCTOR_COMMIT) &&
-  time_after_eq(jiffies, sci->sc_timer.expires));
-
-   if (nilfs_sb_dirty(nilfs) && nilfs_sb_need_update(nilfs))
-   set_nilfs_discontinued(nilfs);
}
+   finish_wait(&sci->sc_wait_daemon, &wait);
+   timeout = ((sci->sc_state & NILFS_SEGCTOR_COMMIT) &&
+  time_after_eq(jiffies, sci->sc_timer.expires));
+
+   if (nilfs_sb_dirty(nilfs) && nilfs_sb_need_update(nilfs))
+   set_nilfs_discontinued(nilfs);
goto loop;
 
  end_thread:
-- 
2.15.0



[PATCH 11/11] jfs: remove not needed freezing calls

2017-11-29 Thread Luis R. Rodriguez
This removes superflous freezer calls as they are no longer needed
as the VFS now performs filesystem freezing/thaw if the filesystem has
support for it.

The following Coccinelle rule was used as follows:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2, S3;
expression task;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

Generated-by: Coccinelle SmPL
Signed-off-by: Luis R. Rodriguez 
---
 fs/jfs/jfs_logmgr.c | 11 +++
 fs/jfs/jfs_txnmgr.c | 31 +--
 2 files changed, 12 insertions(+), 30 deletions(-)

diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index 0e5d412c0b01..fa5a95d8fba8 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -2344,14 +2344,9 @@ int jfsIOWait(void *arg)
spin_lock_irq(&log_redrive_lock);
}
 
-   if (freezing(current)) {
-   spin_unlock_irq(&log_redrive_lock);
-   try_to_freeze();
-   } else {
-   set_current_state(TASK_INTERRUPTIBLE);
-   spin_unlock_irq(&log_redrive_lock);
-   schedule();
-   }
+   set_current_state(TASK_INTERRUPTIBLE);
+   spin_unlock_irq(&log_redrive_lock);
+   schedule();
} while (!kthread_should_stop());
 
jfs_info("jfsIOWait being killed!");
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index 4d973524c887..a313300c4651 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -2747,6 +2747,7 @@ int jfs_lazycommit(void *arg)
struct tblock *tblk;
unsigned long flags;
struct jfs_sb_info *sbi;
+   DECLARE_WAITQUEUE(wq, current);
 
do {
LAZY_LOCK(flags);
@@ -2793,19 +2794,11 @@ int jfs_lazycommit(void *arg)
}
/* In case a wakeup came while all threads were active */
jfs_commit_thread_waking = 0;
-
-   if (freezing(current)) {
-   LAZY_UNLOCK(flags);
-   try_to_freeze();
-   } else {
-   DECLARE_WAITQUEUE(wq, current);
-
-   add_wait_queue(&jfs_commit_thread_wait, &wq);
-   set_current_state(TASK_INTERRUPTIBLE);
-   LAZY_UNLOCK(flags);
-   schedule();
-   remove_wait_queue(&jfs_commit_thread_wait, &wq);
-   }
+   add_wait_queue(&jfs_commit_thread_wait, &wq);
+   set_current_state(TASK_INTERRUPTIBLE);
+   LAZY_UNLOCK(flags);
+   schedule();
+   remove_wait_queue(&jfs_commit_thread_wait, &wq);
} while (!kthread_should_stop());
 
if (!list_empty(&TxAnchor.unlock_queue))
@@ -2982,15 +2975,9 @@ int jfs_sync(void *arg)
}
/* Add anon_list2 back to anon_list */
list_splice_init(&TxAnchor.anon_list2, &TxAnchor.anon_list);
-
-   if (freezing(current)) {
-   TXN_UNLOCK();
-   try_to_freeze();
-   } else {
-   set_current_state(TASK_INTERRUPTIBLE);
-   TXN_UNLOCK();
-   schedule();
-   }
+   set_current_state(TASK_INTERRUPTIBLE);
+   TXN_UNLOCK();
+   schedule();
} while (!kthread_should_stop());
 
jfs_info("jfs_sync being killed");
-- 
2.15.0



Re: [RFC 5/5] pm: remove kernel thread freezing

2017-11-29 Thread Luis R. Rodriguez
On Wed, Oct 04, 2017 at 01:03:54AM +, Bart Van Assche wrote:
> On Wed, 2017-10-04 at 02:47 +0200, Luis R. Rodriguez wrote:
> >   3) Lookup for kthreads which seem to generate IO -- address / review if
> >  removal of the freezer API can be done somehow with a quescing. This
> >  is currently for example being done on SCSI / md.
> >  4) Only after all the above is done should we consider this patch or some
> > form of it.
> 
> After having given this more thought, I think we should omit these last two
> steps. Modifying the md driver such that it does not submit I/O requests while
> processes are frozen requires either to use the freezer API or to open-code 
> it.
> I think there is general agreement in the kernel community that open-coding a
> single mechanism in multiple drivers is wrong. Does this mean that instead of
> removing the freezer API we should keep it and review all its users instead?

Agreed.

  Luis


Re: [PATCH v5 2/8] md: Neither resync nor reshape while the system is frozen

2017-10-03 Thread Luis R. Rodriguez
On Mon, Oct 02, 2017 at 03:52:12PM -0700, Bart Van Assche wrote:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3f7426120a3b..a2cf2a93b0cb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8961,6 +8963,37 @@ static void md_stop_all_writes(void)
>   mdelay(1000*1);
>  }
>  
> +/*
> + * Ensure that neither resyncing nor reshaping occurs while the system is
> + * frozen.
> + */
> +static int md_notify_pm(struct notifier_block *bl, unsigned long state,
> + void *unused)
> +{
> + struct mddev *mddev;
> + struct list_head *tmp;
> +
> + pr_debug("%s: state = %ld; system_freezing_cnt = %d\n", __func__, state,
> +  atomic_read(&system_freezing_cnt));
> +
> + switch (state) {
> + case PM_HIBERNATION_PREPARE:

Hm, why not also include and use this for PM_SUSPEND_PREPARE and/or
a PM_RESTORE_PREPARE.

case PM_HIBERNATION_PREPARE:
case PM_SUSPEND_PREPARE:
case PM_RESTORE_PREPARE:

> + md_stop_all_writes();
> + break;
> + case PM_POST_HIBERNATION:

Likewise here:

case PM_POST_SUSPEND:
case PM_POST_HIBERNATION:
case PM_POST_RESTORE:

I have revised the kernel suspend ordering and indeed things issued
with the pm notifier should suffice to be processed given we actually
call the pm ops (dpm_prepare()) for device drivers *after* the notifier
is called and then after userspace is frozen. That is:

   pm_suspend() --> enter_state() -->
 sys_sync()
 suspend_prepare() -->
 __pm_notifier_call_chain(PM_SUSPEND_PREPARE, ...);
 suspend_freeze_processes() -->
 freeze_processes() -->
 __usermodehelper_set_disable_depth(UMH_DISABLED);
 freeze all tasks ...
 freeze_kernel_threads()
 suspend_devices_and_enter() -->
 dpm_suspend_start() -->
 dpm_prepare()
 dpm_suspend()
 suspend_enter()  -->
 platform_suspend_prepare()
 dpm_suspend_late()
 freeze_enter()
 syscore_suspend()

On our way back up:

 enter_state() -->
 suspend_devices_and_enter() --> (bail from loop)
 dpm_resume_end() -->
 dpm_resume()
 dpm_complete()
 suspend_finish() -->
 suspend_thaw_processes() -->
 thaw_processes() -->
 __usermodehelper_set_disable_depth(UMH_FREEZING);
 thaw_workqueues();
 thaw all processes ...
 usermodehelper_enable();
 pm_notifier_call_chain(PM_POST_SUSPEND);

So notifier would indeed be the right tooling to use here.

  Luis


Re: [PATCH v5 1/8] md: Introduce md_stop_all_writes()

2017-10-03 Thread Luis R. Rodriguez
On Mon, Oct 02, 2017 at 03:52:11PM -0700, Bart Van Assche wrote:
> Introduce md_stop_all_writes() because the next patch will add
> a second caller for this function. Rename md_notifier into
> md_reboot_notifier to avoid that the name of this notifier will
> become confusing due to the next patch. This patch does not
> change any functionality.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 

Reviewed-by: Luis R. Rodriguez 

  Luis


Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 11:15:07PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, October 3, 2017 8:59:00 PM CEST Rafael J. Wysocki wrote:
> > On Tuesday, October 3, 2017 8:53:13 PM CEST Luis R. Rodriguez wrote:
> > > Now that all filesystems which used to rely on kthread
> > > freezing have been converted to filesystem freeze/thawing
> > > we can remove the kernel kthread freezer.
> > > 
> > > Signed-off-by: Luis R. Rodriguez 
> > 
> > I like this one. :-)
> 
> However, suspend_freeze/thaw_processes() require some more work.
> 
> In particular, the freezing of workqueues is being removed here
> without a replacement.

Hrm, where do you see that? freeze_workqueues_busy() is still called on
try_to_freeze_tasks(). Likewise thaw_processes() also calls thaw_workqueues().
I did forget to nuke pm_nosig_freezing though.

Also as I have noted a few times now we have yet to determine if we can remove
all freezer calls on kthreads without causing a regression. Granted I'm being
overly careful here, I still would not be surprised to learn about a stupid use
case where this will be hard to remove now.

Only once this is done should this patch be considered. This will take time. So
I'd like more general consensus on long term approach:

  1) first address each fs to use its freezer calls on susend/hibernate / and 
thaw
 on resume. While at it, remove freezer API calls on their respective
 kthreads.
  2) In parallel address removing freezer API calls on non-IO kthreads, these
 should be trivial, but who knows what surprises lurk here
  3) Lookup for kthreads which seem to generate IO -- address / review if
 removal of the freezer API can be done somehow with a quescing. This
 is currently for example being done on SCSI / md.
  4) Only after all the above is done should we consider this patch or some
 form of it.

  Luis


Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 03:09:53PM -0600, Shuah Khan wrote:
> On Tue, Oct 3, 2017 at 3:00 PM, Jiri Kosina  wrote:
> > On Tue, 3 Oct 2017, Pavel Machek wrote:
> >
> >> > Again, I agree that the (rare) kthreads that are actually "creating" new
> >> > I/O have to be somehow frozen and require special care.
> >>
> >> Agreed. Was any effort made to identify those special kernel threads?
> >
> > I don't think there is any other way than just inspecting all the
> > try_to_freeze() instances in the kernel, and understanding what that
> > particular kthread is doing.
> >
> > I've cleaned up most of the low-hanging fruit already, where the
> > try_to_freeze() was obviously completely pointless, but a lot more time
> > needs to be invested into this.
> >
> 
> There are about 36 drivers that call try_to_freeze() and half (18 ) of
> those are media drivers. Maybe it is easier handle sub-system by
> sub-system basis for a review of which one of these usages could be
> removed. cc'ing Mauro and linux-media

Yes :)

I guess no one reads cover letters, but indeed. To be clear, this last
patch should only go in after a few kernels from now all kthreads
are vetted for piece-meal wise.

This patch would be the nail on the kthread freezer coffin. It should
go in last, who knows how many years from now, and if ever.

  Luis


Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez
On Wed, Oct 04, 2017 at 07:58:41AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 11:53:10AM -0700, Luis R. Rodriguez wrote:
> > diff --git a/fs/super.c b/fs/super.c
> > index d45e92d9a38f..ce8da8b187b1 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1572,3 +1572,82 @@ int thaw_super(struct super_block *sb)
> > return 0;
> >  }
> >  EXPORT_SYMBOL(thaw_super);
> > +
> > +#ifdef CONFIG_PM_SLEEP
> > +static bool super_allows_freeze(struct super_block *sb)
> > +{
> > +   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > +}
> 
> That's a completely misleading function name. All superblocks can be
> frozen - freeze_super() is filesystem independent. And given that, I
> don't see why these super_should_freeze() hoops need to be jumped
> through...

I added this given ext4's current thaw implementation stalls on resume 
as it implicates a bio_submit() and this never completes. So I refactored
ext4 thaw skip a sync on thaw. This requires more eyeballs. This may be an
underlying issue elsewhere.  If its not an bug elsewhere or on ordering, then
there may be some restrictions on thaw when used on resume. Refer to my notes
to Ted on patch #4 for ext4.

> > +int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
> > +{
> > +   int error = 0;
> > +
> > +   spin_lock(&sb_lock);
> > +   if (!super_should_freeze(sb))
> > +   goto out;
> > +
> > +   up_read(&sb->s_umount);
> > +   pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
> > +   error = freeze_super(sb);
> > +   down_read(&sb->s_umount);
> > +out:
> > +   if (error && error != -EBUSY)
> > +   pr_notice("%s (%s): Unable to freeze, error=%d",
> > + sb->s_type->name, sb->s_id, error);
> > +   spin_unlock(&sb_lock);
> > +   return error;
> > +}
> 
> I don't think this was ever tested.  Calling freeze_super() with a
> spinlock held with through "sleeping in atomic" errors all over the
> place.

No, I run time tested it with a rootfs with ext4 and xfs with my
development tree and hammering on both while I loop on suspend
and resume.

> Also, the s_umount lock juggling is nasty. Your new copy+pasted
> iterate_supers_reverse() takes the lock in read mode, yet all the
> freeze/thaw callers here want to take it in write mode.

Yeap, yuk!

> So, really,
> iterate_supers_reverse() needs to be iterate_supers_reverse_excl()
> and take the write lock, and freeze_super/thaw_super need to be
> factored into locked and unlocked versions.
> 
> In which case, we end up with:
> 
> int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
> {
>   return freeze_locked_super(sb);
> }
> 
> int fs_suspend_thaw_sb(struct super_block *sb, void *priv)
> {
>   return thaw_locked_super(sb);
> }

Groovy, thanks, I suspected that locking was convoluted and
we could come up with something better. Will do it this way.

Its really what I hoped we could do :) I just needed to get
it in writing.

  Luis


Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
On Wed, Oct 04, 2017 at 08:04:49AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 11:53:13AM -0700, Luis R. Rodriguez wrote:
> > Now that all filesystems which used to rely on kthread
> > freezing have been converted to filesystem freeze/thawing
> > we can remove the kernel kthread freezer.
> 
> Really? There's no other subsystem that relies on kernel thread
> and workqueue freezing to function correctly on suspend?

On my cover letter I noted this patch should not be consideredd
until *all* kthreads are vetted. So I agree vehemently with you.
The patch set only addressed 2 filesystem, so two kthreads. Much
other work would be needed before this lat patch could *ever*
be considered.

  Luis


Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 01:47:55PM -0700, Matthew Wilcox wrote:
> On Tue, Oct 03, 2017 at 10:05:11PM +0200, Luis R. Rodriguez wrote:
> > On Wed, Oct 04, 2017 at 03:33:01AM +0800, Ming Lei wrote:
> > > Even though this patch can make kthread to not do I/O during
> > > suspend/resume, the SCSI quiesce still can cause similar issue
> > > in other case, like when sending SCSI domain validation
> > > to transport_spi, which happens in revalidate path, nothing
> > > to do with suspend/resume.
> > 
> > Are you saying that the SCSI layer can generate IO even without the 
> > filesystem
> > triggering it?
> 
> The SCSI layer can send SCSI commands; they aren't I/Os in the sense that
> they do reads and writes to media,

Then there should be no issue waiting for the device to respond?

> but they are block requests.  Maybe those should be allowed even to frozen
> devices?

I'm afraid *maybe* won't suffice here, we need a clear explanation as
to what happens with frozen devices on the block layer *if* the device
driver is already suspended. Does the block layer just queue these?
If the goal is to just get queued but not processed, then why even
send these quiesce commands? Wouldn't they only be processed *after*
resume?

  Luis


Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 10:12:04PM +0200, Pavel Machek wrote:
> On Tue 2017-10-03 11:53:13, Luis R. Rodriguez wrote:
> > Now that all filesystems which used to rely on kthread
> > freezing have been converted to filesystem freeze/thawing
> > we can remove the kernel kthread freezer.
> 
> Are you surely-sure? You mentioned other in kernel sources of writes;
> what about those?

You perhaps did not read the cover letter, it describes this patch will very
likely take time to *ever* apply. We must cover tons of grounds first, to
address precisely what you say.

In fact other than kthreads that generate IO we may have now even crazy stupid
kthreads using the freezer API which *do not generate IO* which are totally
bogus but now acts as "features". We'll need to carefully carve out all freezer 
API uses on all kthreads. This should be done atomically to avoid regressions
and ensure bisectability.

  Luis


Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 08:32:39PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 22:23 +0200, Luis R. Rodriguez wrote:
> > On Tue, Oct 03, 2017 at 08:02:22PM +, Bart Van Assche wrote:
> > > On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> > > > +static bool super_allows_freeze(struct super_block *sb)
> > > > +{
> > > > +   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > > > +}
> > > 
> > > A minor comment: if "!!" would be left out the compiler will perform the
> > > conversion from int to bool implicitly
> > 
> > For all compilers?
> 
> Let's have a look at the output of the following commands:
> 
> $ PAGER= git grep 'typedef.*[[:blank:]]bool;' include
> include/linux/types.h:typedef _Bool bool;
> $ PAGER= git grep std= Makefile
> Makefile:   -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS)
> Makefile:  -std=gnu89 $(call cc-option,-fno-PIE)
> 
> From 
> https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/C-Dialect-Options.html#C-Dialect-Options:
> ‘gnu89’
> GNU dialect of ISO C90 (including some C99 features).
> 
> I think this means that the Linux kernel tree can only be compiled correctly
> by compilers that support the C11 type _Bool.

:*) beautiful, thanks.

> > > Anyway, I agree with the approach of this patch and I think
> > > that freezing filesystems before processes are frozen would be a big step
> > > forward.
> > 
> > Great! But please note, the current implementation calls fs_suspend_freeze()
> > *after* try_to_freeze_tasks(), ie: this implementation freezes userspace and
> > only after then filesystems.
> 
> What will the impact be of that choice on filesystems implemented in 
> userspace?

Depends on what kernel hooks those use? Also now is a good time for those 
working
on userspace filesystems to chime in :) Its why I am stating -- I am not saying
I have found the right order, I have find the right order that works for me, and
we need consensus on what the right order should be.

  Luis


Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 08:21:42PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 22:17 +0200, Jiri Kosina wrote:
> > On Tue, 3 Oct 2017, Bart Van Assche wrote:
> > > What about the many drivers outside filesystems that use the 
> > > set_freezable() / try_to_freeze() / wait_event_freezable() API?
> > 
> > Many/most of them are just completely bogus and pointless. I've killed a 
> > lot of those in the past, but the copy/paste programming is just too 
> > strong enemy to fight against.
> 
> If just a single driver would use that API to prevent that I/O occurs while
> processes are frozen then this patch will break that driver.

Yes! And although as Jiri points out, its debatable where this is being used,
but as you suggest there may be *valid* reasons for it *now* even though
originally it *may* have been bogus...

Its why I believe this now can only be done piecemeal wise, slowly but steady.
To avoid regressions, and make this effort bisectable.

  Luis


Re: [RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 08:02:22PM +, Bart Van Assche wrote:
> On Tue, 2017-10-03 at 11:53 -0700, Luis R. Rodriguez wrote:
> > +static bool super_allows_freeze(struct super_block *sb)
> > +{
> > +   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
> > +}
> 
> A minor comment: if "!!" would be left out the compiler will perform the
> conversion from int to bool implicitly

For all compilers?

> so I propose to leave out the "!!" and parentheses.

OK!

> Anyway, I agree with the approach of this patch and I think
> that freezing filesystems before processes are frozen would be a big step
> forward.

Great! But please note, the current implementation calls fs_suspend_freeze()
*after* try_to_freeze_tasks(), ie: this implementation freezes userspace and
only after then filesystems.

Order will be *critical* here to get right, so we should definitely figure
out if this is definitely the right place (TM) to call fs_suspend_freeze().

Lastly, a final minor implementation note:

I think using a PM notifier would have been much cleaner, in fact it was my the
way I originally implemented this orthogonally to Jiri's work, however to get
this right the semantics of __pm_notifier_call_chain() would need to be
expanded with another state, perhaps PM_USERSPACE_FROZEN, for example. I
decided in the end a new state was not worth it give we would just have one
user: fs freezing. So to be clear using a notifier to wrap this code into
the fs code and not touching kernel/power/process.c is not possible with
today's semantics nor do I think its worth it to expand on these semantics.

This approach is explicit about order and requirements for those that should
care: those that will maintain kernel/power/process.c and friends. Having
this in a notifier would shift this implicitly.

  Luis


Re: [RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 03:59:30PM -0400, Theodore Ts'o wrote:
> On Tue, Oct 03, 2017 at 11:53:12AM -0700, Luis R. Rodriguez wrote:
> > @@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
> > ext4_set_feature_journal_needs_recovery(sb);
> > }
> >  
> > -   ext4_commit_super(sb, 1);
> > +   ext4_commit_super(sb, 0);
> > return 0;
> >  }
> >
> 
> After we remove add the NEEDS_RECOVERY flag, we need to make sure
> recovery flag is pushed out to disk before any other changes are
> allowed to be pushed out to disk.  That's why we originally did the
> update synchronously.

I see. I had to try to dig through linux-history to see why this was done, and
I actually could not find an exact commit which explained what you just did.
Thanks!

Hrm, so on freeze we issue the same commit as well, so is a sync *really* needed
on thaw?

> There are other ways we could fulfill this requirements, but doing a
> synchronous update is the simplest way to handle this.  Was it
> necessary to change this given the other changes you are making the fs
> freeze implementation?

No, it was am empirical evaluation done at testing, I observed bio_submit()
stalls, we never get completion from the device. Even if we call thaw at the
very end of resume, after the devices should be up, we still end up in the same
situation. Given how I order freezing filesystems after freezing userspace I do
believe we should thaw filesystems prior unfreezing userspace, its why I placed
the call where it is now.

So this is something I was hoping others could help grok/iron out. I'd prefer
if we could live with thaw'ing unchanged, but this requires to figure this
issue out. Why we can't implicate bio_submit() on fs thaw as-is in this
implementation.

  Luis


Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Luis R. Rodriguez
On Wed, Oct 04, 2017 at 03:33:01AM +0800, Ming Lei wrote:
> On Tue, Oct 03, 2017 at 11:53:08AM -0700, Luis R. Rodriguez wrote:
> > INFO: task kworker/u8:8:1320 blocked for more than 10 seconds.
> >   Tainted: GE   4.13.0-next-20170907+ #88
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > kworker/u8:8D0  1320  2 0x8000
> > Workqueue: events_unbound async_run_entry_fn
> > Call Trace:
> >  __schedule+0x2ec/0x7a0
> >  schedule+0x36/0x80
> >  io_schedule+0x16/0x40
> >  get_request+0x278/0x780
> >  ? remove_wait_queue+0x70/0x70
> >  blk_get_request+0x9c/0x110
> >  scsi_execute+0x7a/0x310 [scsi_mod]
> >  sd_sync_cache+0xa3/0x190 [sd_mod]
> >  ? blk_run_queue+0x3f/0x50
> >  sd_suspend_common+0x7b/0x130 [sd_mod]
> >  ? scsi_print_result+0x270/0x270 [scsi_mod]
> >  sd_suspend_system+0x13/0x20 [sd_mod]
> >  do_scsi_suspend+0x1b/0x30 [scsi_mod]
> >  scsi_bus_suspend_common+0xb1/0xd0 [scsi_mod]
> >  ? device_for_each_child+0x69/0x90
> >  scsi_bus_suspend+0x15/0x20 [scsi_mod]
> >  dpm_run_callback+0x56/0x140
> >  ? scsi_bus_freeze+0x20/0x20 [scsi_mod]
> >  __device_suspend+0xf1/0x340
> >  async_suspend+0x1f/0xa0
> >  async_run_entry_fn+0x38/0x160
> >  process_one_work+0x191/0x380
> >  worker_thread+0x4e/0x3c0
> >  kthread+0x109/0x140
> >  ? process_one_work+0x380/0x380
> >  ? kthread_create_on_node+0x70/0x70
> >  ret_from_fork+0x25/0x30
> 
> Actually we are trying to fix this issue inside block layer/SCSI, please
> see the following link:
> 
> https://marc.info/?l=linux-scsi&m=150703947029304&w=2
> 
> Even though this patch can make kthread to not do I/O during
> suspend/resume, the SCSI quiesce still can cause similar issue
> in other case, like when sending SCSI domain validation
> to transport_spi, which happens in revalidate path, nothing
> to do with suspend/resume.

Are you saying that the SCSI layer can generate IO even without the filesystem
triggering it?

If so then by all means these are certainly other areas we should address
quiescing as I noted in my email.

Also, *iff* the generated IO is triggered on the SCSI suspend callback, then
clearly the next question is if this is truly needed. If so then yes, it
should be quiesced and all restrictions should be considered.

Note that device pm ops get called first, then later the notifiers are
processed, and only later is userspace frozen. Its this gap this patch
set addresses, and its also where where I saw the issue creep in. Depending on
the questions above we may or not need more work in other layers.

So I am not saying this patch set is sufficient to address all IO quiescing,
quite the contrary I acknowledged that each subsystem should vet if they have
non-FS generated IO (seems you and Bart are doing  great job at doing this
analysis on SCSI). This patchset however should help with odd corner cases
which *are* triggered by the FS and the spaghetti code requirements of the
kthread freezing clearly does not suffice.

> So IMO the root cause is in SCSI's quiesce.
> 
> You can find the similar description in above link:
> 
>   Once SCSI device is put into QUIESCE, no new request except for
>   RQF_PREEMPT can be dispatched to SCSI successfully, and
>   scsi_device_quiesce() just simply waits for completion of I/Os
>   dispatched to SCSI stack. It isn't enough at all.

I see so the race here is *on* the pm ops of SCSI we have generated IO
to QUIESCE.

> 
>   Because new request still can be coming, but all the allocated
>   requests can't be dispatched successfully, so request pool can be
>   consumed up easily. Then RQF_PREEMPT can't be allocated, and
>   hang forever, just like the stack trace you posted.
> 

I see. Makes sense. So SCSI quiesce has restrictions and they're being
violated.

Anyway, don't think of this as a replacement for yours or Bart's work then, but
rather supplemental.

Are you saying we should not move forward with this patch set, or simply that
the above splat is rather properly fixed with SCSI quiescing? Given you're
explanation I'd have to agree. But even with this considered and accepted, from
a theoretical perspective -- why would this patch set actually seem to fix the
same issue? Is it, that it just *seems* to fix it?

  Luis


[RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

2017-10-03 Thread Luis R. Rodriguez
ll the above then I'd like to move *carefully* with the kthread freezer
replacement with filesystem freeze/thawing instead of doing it all in one shot,
as Jiri had originally intended.

Filesystmes which have been vetted to work properly can use the super block
FS_FREEZE_ON_SUSPEND flag when things are ready as a transitional flag. I hope
we can get to the point we can rely on all filesystmes supporting it, so that
once all filesystem code is vetted for, and all non-IO bound kthread callers
have been cleared of the freezer call we can just remove the
FS_FREEZE_ON_SUSPEND flag and then kill the kthread freezer completely (the
last patch in this series).

The way this would be merged then is, only the first four patches would
be considered for upstream at first. We'd then move on to do the same with
other filesystems. Then or in parallel move on to tackle vetting removal
of all disk non-IO bounds kthread freezer callers. Only once this is all
done should we consider the last patch to put final nail on the kthread
freezer coffin.

[0] https://lwn.net/Articles/662703/
[1] https://bugzilla.suse.com/show_bug.cgi?id=1043449
[2] http://alpss.at/

Thoughts? Rants?

Luis R. Rodriguez (5):
  fs: add iterate_supers_reverse()
  fs: freeze on suspend and thaw on resume
  xfs: allow fs freeze on suspend/hibernation
  ext4: add fs freezing support on suspend/hibernation
  pm: remove kernel thread freezing

 Documentation/power/freezing-of-tasks.txt |   9 ---
 drivers/xen/manage.c  |   6 --
 fs/ext4/super.c   |  13 ++--
 fs/super.c| 122 ++
 fs/xfs/xfs_super.c|   3 +-
 fs/xfs/xfs_trans_ail.c|   7 +-
 include/linux/freezer.h   |   4 -
 include/linux/fs.h|  14 
 kernel/power/hibernate.c  |  10 +--
 kernel/power/power.h  |  20 +
 kernel/power/process.c|  61 ---
 kernel/power/user.c   |   9 ---
 tools/power/pm-graph/analyze_suspend.py   |   1 -
 13 files changed, 163 insertions(+), 116 deletions(-)

-- 
2.14.0



[RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Luis R. Rodriguez
This also removes the superflous freezer calls as they are no longer
needed. We need to avoid sync call on thaw as otherwise we end up with
a stall on bio_submit().

Signed-off-by: Luis R. Rodriguez 
---
 fs/ext4/super.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3cc5635d00cc..03d3bbd27dae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -119,7 +119,8 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -134,7 +135,8 @@ static struct file_system_type ext3_fs_type = {
.name   = "ext3",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext3");
 MODULE_ALIAS("ext3");
@@ -2979,8 +2981,6 @@ static int ext4_lazyinit_thread(void *arg)
}
mutex_unlock(&eli->li_list_mtx);
 
-   try_to_freeze();
-
cur = jiffies;
if ((time_after_eq(cur, next_wakeup)) ||
(MAX_JIFFY_OFFSET == next_wakeup)) {
@@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
ext4_set_feature_journal_needs_recovery(sb);
}
 
-   ext4_commit_super(sb, 1);
+   ext4_commit_super(sb, 0);
return 0;
 }
 
@@ -5771,7 +5771,8 @@ static struct file_system_type ext4_fs_type = {
.name   = "ext4",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext4");
 
-- 
2.14.0



[RFC 3/5] xfs: allow fs freeze on suspend/hibernation

2017-10-03 Thread Luis R. Rodriguez
This also removes the freezer calls on the XFS kthread as they are
no longer needed.

Signed-off-by: Luis R. Rodriguez 
---
 fs/xfs/xfs_super.c | 3 ++-
 fs/xfs/xfs_trans_ail.c | 7 ++-
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 64013b2db8e0..75c3415657eb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1775,7 +1775,8 @@ static struct file_system_type xfs_fs_type = {
.name   = "xfs",
.mount  = xfs_fs_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV |
+ FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("xfs");
 
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 354368a906e5..8cebda88e91a 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -512,7 +512,6 @@ xfsaild(
longtout = 0;   /* milliseconds */
 
current->flags |= PF_MEMALLOC;
-   set_freezable();
 
while (!kthread_should_stop()) {
if (tout && tout <= 20)
@@ -535,19 +534,17 @@ xfsaild(
if (!xfs_ail_min(ailp) &&
ailp->xa_target == ailp->xa_target_prev) {
spin_unlock(&ailp->xa_lock);
-   freezable_schedule();
+   schedule();
tout = 0;
continue;
}
spin_unlock(&ailp->xa_lock);
 
if (tout)
-   freezable_schedule_timeout(msecs_to_jiffies(tout));
+   schedule_timeout(msecs_to_jiffies(tout));
 
__set_current_state(TASK_RUNNING);
 
-   try_to_freeze();
-
tout = xfsaild_push(ailp);
}
 
-- 
2.14.0



[RFC 2/5] fs: freeze on suspend and thaw on resume

2017-10-03 Thread Luis R. Rodriguez
This uses the existing filesystem freeze and thaw callbacks to
freeze each filesystem on suspend/hibernation and thaw upon resume.

This is needed so that we properly really stop IO in flight without
races after userspace has been frozen. Without this we rely on
kthread freezing and its semantics are loose and error prone.
For instance, even though a kthread may use try_to_freeze() and end
up being frozen we have no way of being sure that everything that
has been spawned asynchronously from it (such as timers) have also
been stopped as well.

A long term advantage of also adding filesystem freeze / thawing
supporting durign suspend / hibernation is that long term we may
be able to eventually drop the kernel's thread freezing completely
as it was originally added to stop disk IO in flight as we hibernate
or suspend.

This also implies that many kthread users exist which have been
adding freezer semantics onto its kthreads without need. These also
will need to be reviewed later.

This is based on prior work originally by Rafael Wysocki and later by
Jiri Kosina.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 79 ++
 include/linux/fs.h | 13 +
 kernel/power/process.c | 14 -
 3 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index d45e92d9a38f..ce8da8b187b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1572,3 +1572,82 @@ int thaw_super(struct super_block *sb)
return 0;
 }
 EXPORT_SYMBOL(thaw_super);
+
+#ifdef CONFIG_PM_SLEEP
+static bool super_allows_freeze(struct super_block *sb)
+{
+   return !!(sb->s_type->fs_flags & FS_FREEZE_ON_SUSPEND);
+}
+
+static bool super_should_freeze(struct super_block *sb)
+{
+   if (!sb->s_root)
+   return false;
+   if (!(sb->s_flags & MS_BORN))
+   return false;
+   /*
+* We don't freeze virtual filesystems, we skip those filesystems with
+* no backing device.
+*/
+   if (sb->s_bdi == &noop_backing_dev_info)
+   return false;
+   /* No need to freeze read-only filesystems */
+   if (sb->s_flags & MS_RDONLY)
+   return false;
+
+   if (!super_allows_freeze(sb))
+   return false;
+
+   return true;
+}
+
+int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
+{
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   if (!super_should_freeze(sb))
+   goto out;
+
+   up_read(&sb->s_umount);
+   pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
+   error = freeze_super(sb);
+   down_read(&sb->s_umount);
+out:
+   if (error && error != -EBUSY)
+   pr_notice("%s (%s): Unable to freeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+   spin_unlock(&sb_lock);
+   return error;
+}
+
+int fs_suspend_freeze(void)
+{
+   return iterate_supers_reverse(fs_suspend_freeze_sb, NULL);
+}
+
+void fs_resume_unfreeze_sb(struct super_block *sb, void *priv)
+{
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   if (!super_should_freeze(sb))
+   goto out;
+
+   up_read(&sb->s_umount);
+   pr_info("%s (%s): thawing\n", sb->s_type->name, sb->s_id);
+   error = thaw_super(sb);
+   down_read(&sb->s_umount);
+out:
+   if (error && error != -EBUSY)
+   pr_notice("%s (%s): Unable to unfreeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+   spin_unlock(&sb_lock);
+}
+
+void fs_resume_unfreeze(void)
+{
+   iterate_supers(fs_resume_unfreeze_sb, NULL);
+}
+
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cd084792cf39..7754230d6afc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2071,6 +2071,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_FREEZE_ON_SUSPEND   16  /* should filesystem freeze on suspend? 
*/
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
@@ -2172,6 +2173,18 @@ extern int fd_statfs(int, struct kstatfs *);
 extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
+#ifdef CONFIG_PM_SLEEP
+int fs_suspend_freeze(void);
+void fs_resume_unfreeze(void);
+#else
+static inline int fs_suspend_freeze(void)
+{
+   return 0;
+}
+static inline void fs_resume_unfreeze(void)
+{
+}
+#endif
 extern bool our_mnt(struct vfsmount *mnt);
 extern __printf(2, 3)
 int s

[RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
Now that all filesystems which used to rely on kthread
freezing have been converted to filesystem freeze/thawing
we can remove the kernel kthread freezer.

Signed-off-by: Luis R. Rodriguez 
---
 Documentation/power/freezing-of-tasks.txt |  9 --
 drivers/xen/manage.c  |  6 
 include/linux/freezer.h   |  4 ---
 kernel/power/hibernate.c  | 10 ++-
 kernel/power/power.h  | 20 +
 kernel/power/process.c| 47 ---
 kernel/power/user.c   |  9 --
 tools/power/pm-graph/analyze_suspend.py   |  1 -
 8 files changed, 3 insertions(+), 103 deletions(-)

diff --git a/Documentation/power/freezing-of-tasks.txt 
b/Documentation/power/freezing-of-tasks.txt
index af005770e767..477559f57c95 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -71,15 +71,6 @@ Rationale behind the functions dealing with freezing and 
thawing of tasks:
 freeze_processes():
   - freezes only userspace tasks
 
-freeze_kernel_threads():
-  - freezes all tasks (including kernel threads) because we can't freeze
-kernel threads without freezing userspace tasks
-
-thaw_kernel_threads():
-  - thaws only kernel threads; this is particularly useful if we need to do
-anything special in between thawing of kernel threads and thawing of
-userspace tasks, or if we want to postpone the thawing of userspace tasks
-
 thaw_processes():
   - thaws all tasks (including kernel threads) because we can't thaw userspace
 tasks without thawing kernel threads
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03d37d2..8ca0e0c9a7d5 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -109,12 +109,6 @@ static void do_suspend(void)
goto out;
}
 
-   err = freeze_kernel_threads();
-   if (err) {
-   pr_err("%s: freeze kernel threads failed %d\n", __func__, err);
-   goto out_thaw;
-   }
-
err = dpm_suspend_start(PMSG_FREEZE);
if (err) {
pr_err("%s: dpm_suspend_start %d\n", __func__, err);
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index dd03e837ebb7..037ef3f16173 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -43,9 +43,7 @@ extern void __thaw_task(struct task_struct *t);
 
 extern bool __refrigerator(bool check_kthr_stop);
 extern int freeze_processes(void);
-extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
-extern void thaw_kernel_threads(void);
 
 /*
  * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
@@ -263,9 +261,7 @@ static inline void __thaw_task(struct task_struct *t) {}
 
 static inline bool __refrigerator(bool check_kthr_stop) { return false; }
 static inline int freeze_processes(void) { return -ENOSYS; }
-static inline int freeze_kernel_threads(void) { return -ENOSYS; }
 static inline void thaw_processes(void) {}
-static inline void thaw_kernel_threads(void) {}
 
 static inline bool try_to_freeze_nowarn(void) { return false; }
 static inline bool try_to_freeze(void) { return false; }
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index a5c36e9c56a6..7c3af084b10a 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -352,10 +352,6 @@ int hibernation_snapshot(int platform_mode)
if (error)
goto Close;
 
-   error = freeze_kernel_threads();
-   if (error)
-   goto Cleanup;
-
if (hibernation_test(TEST_FREEZER)) {
 
/*
@@ -363,13 +359,13 @@ int hibernation_snapshot(int platform_mode)
 * successful freezer test.
 */
freezer_test_done = true;
-   goto Thaw;
+   goto Cleanup;
}
 
error = dpm_prepare(PMSG_FREEZE);
if (error) {
dpm_complete(PMSG_RECOVER);
-   goto Thaw;
+   goto Cleanup;
}
 
suspend_console();
@@ -405,8 +401,6 @@ int hibernation_snapshot(int platform_mode)
platform_end(platform_mode);
return error;
 
- Thaw:
-   thaw_kernel_threads();
  Cleanup:
swsusp_free();
goto Close;
diff --git a/kernel/power/power.h b/kernel/power/power.h
index 1d2d761e3c25..333bde062e42 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -253,25 +253,7 @@ extern int pm_test_level;
 #ifdef CONFIG_SUSPEND_FREEZER
 static inline int suspend_freeze_processes(void)
 {
-   int error;
-
-   error = freeze_processes();
-   /*
-* freeze_processes() automatically thaws every task if freezing
-* fails. So we need not do anything extra upon error.
-*/
-   if (error)
-   return error;
-
-   error = freeze_kernel_threads();
-   /*
-* freeze_kernel_threads() thaws only kernel threads upon free

[RFC 1/5] fs: add iterate_supers_reverse()

2017-10-03 Thread Luis R. Rodriguez
There are use cases where we wish to traverse the superblock list
in reverse order. This particular implementation will also enable
to capture errors.

Signed-off-by: Luis R. Rodriguez 
---
 fs/super.c | 43 +++
 include/linux/fs.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 02da00410de8..d45e92d9a38f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -609,6 +609,49 @@ void iterate_supers(void (*f)(struct super_block *, void 
*), void *arg)
spin_unlock(&sb_lock);
 }
 
+/**
+ * iterate_supers_reverse - call function for active superblocks in reverse
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument, in reverse order. Returns if
+ * an error occurred.
+ */
+int iterate_supers_reverse(int (*f)(struct super_block *, void *), void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(&sb_lock);
+   list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+   if (hlist_unhashed(&sb->s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(&sb_lock);
+
+   down_read(&sb->s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   spin_lock(&sb_lock);
+   break;
+   }
+   }
+   up_read(&sb->s_umount);
+
+   spin_lock(&sb_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(&sb_lock);
+
+   return error;
+}
+
 /**
  * iterate_supers_type - call function for superblocks of given type
  * @type: fs type
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 46796b2f956b..cd084792cf39 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3097,6 +3097,7 @@ extern struct super_block *get_active_super(struct 
block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern int iterate_supers_reverse(int (*)(struct super_block *, void *), void 
*);
 extern void iterate_supers_type(struct file_system_type *,
void (*)(struct super_block *, void *), void *);
 
-- 
2.14.0