Re: [PATCH v9 1/2] Add a "nosymfollow" mount option.
On Thu, Aug 27, 2020 at 2:25 PM Ross Zwisler wrote: > On Thu, Aug 27, 2020 at 09:10:15PM +0100, Al Viro wrote: > > On Thu, Aug 27, 2020 at 09:08:01PM +0100, Al Viro wrote: > > Applied (to -rc1) and pushed > > Many thanks! (apologies for the resend, the previous one had HTML and was rejected by the lists) Just FYI, here is the related commit in upstream util-linux: https://github.com/karelzak/util-linux/commit/50a531f667c31d54fbb920d394e6008df89ae636 and the thread to linux-man, which I will ping when the v5.10 merge window closes: https://lore.kernel.org/linux-man/CAKgNAkiAkyUjd=cUvASaT2tyhaCdiMF48KA3Ov_1mQf0=j2...@mail.gmail.com/
Re: [PATCH v9 1/2] Add a "nosymfollow" mount option.
On Thu, Aug 27, 2020 at 09:10:15PM +0100, Al Viro wrote: > On Thu, Aug 27, 2020 at 09:08:01PM +0100, Al Viro wrote: > > AFAICS, it applies clean to -rc1; what was the rebase about? Oh, sorry if that was confusing, I just wanted to make sure that it still applied cleanly to the latest -rc so that you didn't hit a merge conflict. Yes, these patches apply cleanly to both -rc1 and -rc2. > Applied (to -rc1) and pushed Many thanks!
[PATCH v9 2/2] selftests: mount: add nosymfollow tests
Add tests for the new 'nosymfollow' mount option. We test to make sure that symlink traversal fails with ELOOP when 'nosymfollow' is set, but that readlink(2) and realpath(3) still work as expected. We also verify that statfs(2) correctly returns ST_NOSYMFOLLOW when we are mounted with the 'nosymfollow' option. Signed-off-by: Ross Zwisler --- tools/testing/selftests/mount/.gitignore | 1 + tools/testing/selftests/mount/Makefile| 4 +- .../selftests/mount/nosymfollow-test.c| 218 ++ .../selftests/mount/run_nosymfollow.sh| 4 + ...n_tests.sh => run_unprivileged_remount.sh} | 0 5 files changed, 225 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/mount/nosymfollow-test.c create mode 100755 tools/testing/selftests/mount/run_nosymfollow.sh rename tools/testing/selftests/mount/{run_tests.sh => run_unprivileged_remount.sh} (100%) diff --git a/tools/testing/selftests/mount/.gitignore b/tools/testing/selftests/mount/.gitignore index 0bc64a6d4c181..17f2d84151622 100644 --- a/tools/testing/selftests/mount/.gitignore +++ b/tools/testing/selftests/mount/.gitignore @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0-only unprivileged-remount-test +nosymfollow-test diff --git a/tools/testing/selftests/mount/Makefile b/tools/testing/selftests/mount/Makefile index 026890744215b..2d9454841644a 100644 --- a/tools/testing/selftests/mount/Makefile +++ b/tools/testing/selftests/mount/Makefile @@ -3,7 +3,7 @@ CFLAGS = -Wall \ -O2 -TEST_PROGS := run_tests.sh -TEST_GEN_FILES := unprivileged-remount-test +TEST_PROGS := run_unprivileged_remount.sh run_nosymfollow.sh +TEST_GEN_FILES := unprivileged-remount-test nosymfollow-test include ../lib.mk diff --git a/tools/testing/selftests/mount/nosymfollow-test.c b/tools/testing/selftests/mount/nosymfollow-test.c new file mode 100644 index 0..650d6d80a1d27 --- /dev/null +++ b/tools/testing/selftests/mount/nosymfollow-test.c @@ -0,0 +1,218 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifndef MS_NOSYMFOLLOW +# define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */ +#endif + +#ifndef ST_NOSYMFOLLOW +# define ST_NOSYMFOLLOW 0x2000 /* Do not follow symlinks */ +#endif + +#define DATA "/tmp/data" +#define LINK "/tmp/symlink" +#define TMP "/tmp" + +static void die(char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vfprintf(stderr, fmt, ap); + va_end(ap); + exit(EXIT_FAILURE); +} + +static void vmaybe_write_file(bool enoent_ok, char *filename, char *fmt, + va_list ap) +{ + ssize_t written; + char buf[4096]; + int buf_len; + int fd; + + buf_len = vsnprintf(buf, sizeof(buf), fmt, ap); + if (buf_len < 0) + die("vsnprintf failed: %s\n", strerror(errno)); + + if (buf_len >= sizeof(buf)) + die("vsnprintf output truncated\n"); + + fd = open(filename, O_WRONLY); + if (fd < 0) { + if ((errno == ENOENT) && enoent_ok) + return; + die("open of %s failed: %s\n", filename, strerror(errno)); + } + + written = write(fd, buf, buf_len); + if (written != buf_len) { + if (written >= 0) { + die("short write to %s\n", filename); + } else { + die("write to %s failed: %s\n", + filename, strerror(errno)); + } + } + + if (close(fd) != 0) + die("close of %s failed: %s\n", filename, strerror(errno)); +} + +static void maybe_write_file(char *filename, char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vmaybe_write_file(true, filename, fmt, ap); + va_end(ap); +} + +static void write_file(char *filename, char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vmaybe_write_file(false, filename, fmt, ap); + va_end(ap); +} + +static void create_and_enter_ns(void) +{ + uid_t uid = getuid(); + gid_t gid = getgid(); + + if (unshare(CLONE_NEWUSER) != 0) + die("unshare(CLONE_NEWUSER) failed: %s\n", strerror(errno)); + + maybe_write_file("/proc/self/setgroups", "deny"); + write_file("/proc/self/uid_map", "0 %d 1", uid); + write_file("/proc/self/gid_map", "0 %d 1", gid); + + if (setgid(0) != 0) + die("setgid(0) failed %s\n", strerror(errno)); + if (setuid(0) != 0) + die("setuid(0) failed %s\n", strerror(errno)); +
[PATCH v9 1/2] Add a "nosymfollow" mount option.
From: Mattias Nissler For mounts that have the new "nosymfollow" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options, as well as to the LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD variants have been supporting the "nosymfollow" mount option for a long time with equivalent implementations. Note that symlinks may still be created on file systems mounted with the "nosymfollow" option present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymfollow" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymfollow" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. More information on the history and motivation for this patch can be found here: https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal Signed-off-by: Mattias Nissler Signed-off-by: Ross Zwisler Reviewed-by: Aleksa Sarai --- Changes since v8 [1]: * Look for MNT_NOSYMFOLLOW in link->mnt->mnt_flags so we are testing the link itself rather than the directory holding the link. (Al Viro) * Rebased onto v5.9-rc2. After this lands I will upstream changes to util-linux[2] and man-pages [3]. [1]: https://patchwork.kernel.org/patch/11724607/ [2]: https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55e1e724d4e15 [3]: https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e580399a71668 --- fs/namei.c | 3 ++- fs/namespace.c | 2 ++ fs/proc_namespace.c| 1 + fs/statfs.c| 2 ++ include/linux/mount.h | 3 ++- include/linux/statfs.h | 1 + include/uapi/linux/mount.h | 1 + 7 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index e99e2a9da0f7d..33e8c79bc761e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link, return ERR_PTR(error); } - if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) || + unlikely(link->mnt->mnt_flags & MNT_NOSYMFOLLOW)) return ERR_PTR(-ELOOP); if (!(nd->flags & LOOKUP_RCU)) { diff --git a/fs/namespace.c b/fs/namespace.c index bae0e95b3713a..6408788a649e1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3160,6 +3160,8 @@ int path_mount(const char *dev_name, struct path *path, mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; + if (flags & MS_NOSYMFOLLOW) + mnt_flags |= MNT_NOSYMFOLLOW; /* The default atime for remount is preservation */ if ((flags & MS_REMOUNT) && diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 3059a9394c2d6..e59d4bb3a89e4 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, + { MNT_NOSYMFOLLOW, ",nosymfollow" }, { 0, NULL } }; const struct proc_fs_opts *fs_infop; diff --git a/fs/statfs.c b/fs/statfs.c index 2616424012ea7..59f33752c1311 100644 --- a/fs/statfs.c +++ b/fs/statfs.c @@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags) flags |= ST_NODIRATIME; if (mnt_flags & MNT_RELATIME) flags |= ST_RELATIME; + if (mnt_flags & MNT_NOSYMFOLLOW) + flags |= ST_NOSYMFOLLOW; return flags; } diff --git a/include/linux/mount.h b/include/linux/mount.h index de657bd211fa6..aaf343b38671c 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -30,6 +30,7 @@ struct fs_context; #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 #define MNT_READONLY 0x40/* does the user want this to be r/o? */ +#define MNT_NOSYMFOLLOW0x80 #define MNT_SHRINKABLE 0x100 #define MNT_WRITE_HOLD 0x200 @@ -46,7 +47,7 @@ struct fs_context; #define MNT_SHARED_MASK(MNT_UNBINDABLE) #define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \ | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \ -| MNT_READONLY) +|
Re: [PATCH v8 1/2] Add a "nosymfollow" mount option.
On Fri, Aug 28, 2020 at 01:41:39AM +1000, Aleksa Sarai wrote: > On 2020-08-27, Al Viro wrote: > > On Wed, Aug 26, 2020 at 02:48:19PM -0600, Ross Zwisler wrote: > > > > > Al, now that the changes to fs/namei.c have landed and we're past the > > > merge > > > window for v5.9, what are your thoughts on this patch and the associated > > > test? > > > > Humm... should that be nd->path.mnt->mnt_flags or link->mnt->mnt_flags? > > Usually it's the same thing, but they might differ. IOW, is that about the > > directory we'd found it in, or is it about the link itself? > > Now that you mention it, I think link->mnt->mnt_flags makes more sense. > The restriction should apply in the context of whatever filesystem > contains the symlink, and that would matches FreeBSD's semantics (at > least as far as I can tell from a quick look at sys/kern/vfs_lookup.c). Yep, changing this to link->mnt->mnt_flags makes sense to me, as you're right that we care about the link itself and not the link's parent directory. Thank you for the review, and I'll send out v9 momentarily.
Re: [PATCH v8 1/2] Add a "nosymfollow" mount option.
O Wed, Aug 19, 2020 at 10:43:16AM -0600, Ross Zwisler wrote: > From: Mattias Nissler > > For mounts that have the new "nosymfollow" option, don't follow symlinks > when resolving paths. The new option is similar in spirit to the > existing "nodev", "noexec", and "nosuid" options, as well as to the > LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD > variants have been supporting the "nosymfollow" mount option for a long > time with equivalent implementations. > > Note that symlinks may still be created on file systems mounted with > the "nosymfollow" option present. readlink() remains functional, so > user space code that is aware of symlinks can still choose to follow > them explicitly. > > Setting the "nosymfollow" mount option helps prevent privileged > writers from modifying files unintentionally in case there is an > unexpected link along the accessed path. The "nosymfollow" option is > thus useful as a defensive measure for systems that need to deal with > untrusted file systems in privileged contexts. > > More information on the history and motivation for this patch can be > found here: > > https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal > > Signed-off-by: Mattias Nissler > Signed-off-by: Ross Zwisler > Reviewed-by: Aleksa Sarai > --- > Changes since v7 [1]: > * Rebased onto v5.9-rc1. > * Added selftest in second patch. > * Added Aleska's Reviewed-By tag. Thank you for the review! > > After this lands I will upstream changes to util-linux[2] and man-pages > [3]. > > [1]: https://lkml.org/lkml/2020/8/11/896 > [2]: > https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55e1e724d4e15 > [3]: > https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e580399a71668 > --- Friendly ping on this. Al, now that the changes to fs/namei.c have landed and we're past the merge window for v5.9, what are your thoughts on this patch and the associated test?
[PATCH v8 2/2] selftests: mount: add nosymfollow tests
Add tests for the new 'nosymfollow' mount option. We test to make sure that symlink traversal fails with ELOOP when 'nosymfollow' is set, but that readlink(2) and realpath(3) still work as expected. We also verify that statfs(2) correctly returns ST_NOSYMFOLLOW when we are mounted with the 'nosymfollow' option. Signed-off-by: Ross Zwisler --- tools/testing/selftests/mount/.gitignore | 1 + tools/testing/selftests/mount/Makefile| 4 +- .../selftests/mount/nosymfollow-test.c| 218 ++ .../selftests/mount/run_nosymfollow.sh| 4 + ...n_tests.sh => run_unprivileged_remount.sh} | 0 5 files changed, 225 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/mount/nosymfollow-test.c create mode 100755 tools/testing/selftests/mount/run_nosymfollow.sh rename tools/testing/selftests/mount/{run_tests.sh => run_unprivileged_remount.sh} (100%) diff --git a/tools/testing/selftests/mount/.gitignore b/tools/testing/selftests/mount/.gitignore index 0bc64a6d4c181..17f2d84151622 100644 --- a/tools/testing/selftests/mount/.gitignore +++ b/tools/testing/selftests/mount/.gitignore @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0-only unprivileged-remount-test +nosymfollow-test diff --git a/tools/testing/selftests/mount/Makefile b/tools/testing/selftests/mount/Makefile index 026890744215b..2d9454841644a 100644 --- a/tools/testing/selftests/mount/Makefile +++ b/tools/testing/selftests/mount/Makefile @@ -3,7 +3,7 @@ CFLAGS = -Wall \ -O2 -TEST_PROGS := run_tests.sh -TEST_GEN_FILES := unprivileged-remount-test +TEST_PROGS := run_unprivileged_remount.sh run_nosymfollow.sh +TEST_GEN_FILES := unprivileged-remount-test nosymfollow-test include ../lib.mk diff --git a/tools/testing/selftests/mount/nosymfollow-test.c b/tools/testing/selftests/mount/nosymfollow-test.c new file mode 100644 index 0..650d6d80a1d27 --- /dev/null +++ b/tools/testing/selftests/mount/nosymfollow-test.c @@ -0,0 +1,218 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifndef MS_NOSYMFOLLOW +# define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */ +#endif + +#ifndef ST_NOSYMFOLLOW +# define ST_NOSYMFOLLOW 0x2000 /* Do not follow symlinks */ +#endif + +#define DATA "/tmp/data" +#define LINK "/tmp/symlink" +#define TMP "/tmp" + +static void die(char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vfprintf(stderr, fmt, ap); + va_end(ap); + exit(EXIT_FAILURE); +} + +static void vmaybe_write_file(bool enoent_ok, char *filename, char *fmt, + va_list ap) +{ + ssize_t written; + char buf[4096]; + int buf_len; + int fd; + + buf_len = vsnprintf(buf, sizeof(buf), fmt, ap); + if (buf_len < 0) + die("vsnprintf failed: %s\n", strerror(errno)); + + if (buf_len >= sizeof(buf)) + die("vsnprintf output truncated\n"); + + fd = open(filename, O_WRONLY); + if (fd < 0) { + if ((errno == ENOENT) && enoent_ok) + return; + die("open of %s failed: %s\n", filename, strerror(errno)); + } + + written = write(fd, buf, buf_len); + if (written != buf_len) { + if (written >= 0) { + die("short write to %s\n", filename); + } else { + die("write to %s failed: %s\n", + filename, strerror(errno)); + } + } + + if (close(fd) != 0) + die("close of %s failed: %s\n", filename, strerror(errno)); +} + +static void maybe_write_file(char *filename, char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vmaybe_write_file(true, filename, fmt, ap); + va_end(ap); +} + +static void write_file(char *filename, char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vmaybe_write_file(false, filename, fmt, ap); + va_end(ap); +} + +static void create_and_enter_ns(void) +{ + uid_t uid = getuid(); + gid_t gid = getgid(); + + if (unshare(CLONE_NEWUSER) != 0) + die("unshare(CLONE_NEWUSER) failed: %s\n", strerror(errno)); + + maybe_write_file("/proc/self/setgroups", "deny"); + write_file("/proc/self/uid_map", "0 %d 1", uid); + write_file("/proc/self/gid_map", "0 %d 1", gid); + + if (setgid(0) != 0) + die("setgid(0) failed %s\n", strerror(errno)); + if (setuid(0) != 0) + die("setuid(0) failed %s\n", strerror(errno)); +
[PATCH v8 1/2] Add a "nosymfollow" mount option.
From: Mattias Nissler For mounts that have the new "nosymfollow" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options, as well as to the LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD variants have been supporting the "nosymfollow" mount option for a long time with equivalent implementations. Note that symlinks may still be created on file systems mounted with the "nosymfollow" option present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymfollow" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymfollow" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. More information on the history and motivation for this patch can be found here: https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal Signed-off-by: Mattias Nissler Signed-off-by: Ross Zwisler Reviewed-by: Aleksa Sarai --- Changes since v7 [1]: * Rebased onto v5.9-rc1. * Added selftest in second patch. * Added Aleska's Reviewed-By tag. Thank you for the review! After this lands I will upstream changes to util-linux[2] and man-pages [3]. [1]: https://lkml.org/lkml/2020/8/11/896 [2]: https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55e1e724d4e15 [3]: https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e580399a71668 --- fs/namei.c | 3 ++- fs/namespace.c | 2 ++ fs/proc_namespace.c| 1 + fs/statfs.c| 2 ++ include/linux/mount.h | 3 ++- include/linux/statfs.h | 1 + include/uapi/linux/mount.h | 1 + 7 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index e99e2a9da0f7d..12d92af2e2ca7 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link, return ERR_PTR(error); } - if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) || + unlikely(nd->path.mnt->mnt_flags & MNT_NOSYMFOLLOW)) return ERR_PTR(-ELOOP); if (!(nd->flags & LOOKUP_RCU)) { diff --git a/fs/namespace.c b/fs/namespace.c index bae0e95b3713a..6408788a649e1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3160,6 +3160,8 @@ int path_mount(const char *dev_name, struct path *path, mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; + if (flags & MS_NOSYMFOLLOW) + mnt_flags |= MNT_NOSYMFOLLOW; /* The default atime for remount is preservation */ if ((flags & MS_REMOUNT) && diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 3059a9394c2d6..e59d4bb3a89e4 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, + { MNT_NOSYMFOLLOW, ",nosymfollow" }, { 0, NULL } }; const struct proc_fs_opts *fs_infop; diff --git a/fs/statfs.c b/fs/statfs.c index 2616424012ea7..59f33752c1311 100644 --- a/fs/statfs.c +++ b/fs/statfs.c @@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags) flags |= ST_NODIRATIME; if (mnt_flags & MNT_RELATIME) flags |= ST_RELATIME; + if (mnt_flags & MNT_NOSYMFOLLOW) + flags |= ST_NOSYMFOLLOW; return flags; } diff --git a/include/linux/mount.h b/include/linux/mount.h index de657bd211fa6..aaf343b38671c 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -30,6 +30,7 @@ struct fs_context; #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 #define MNT_READONLY 0x40/* does the user want this to be r/o? */ +#define MNT_NOSYMFOLLOW0x80 #define MNT_SHRINKABLE 0x100 #define MNT_WRITE_HOLD 0x200 @@ -46,7 +47,7 @@ struct fs_context; #define MNT_SHARED_MASK(MNT_UNBINDABLE) #define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \ | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \ -| MNT_READONLY) +| MNT_READONLY | MNT_NOSYMFOLLOW) #define MNT_ATIME_MAS
Re: [PATCH v7] Add a "nosymfollow" mount option.
On Wed, Aug 12, 2020 at 12:36 PM Matthew Wilcox wrote: > On Tue, Aug 11, 2020 at 04:28:03PM -0600, Ross Zwisler wrote: > > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h > > index 96a0240f23fed..dd8306ea336c1 100644 > > --- a/include/uapi/linux/mount.h > > +++ b/include/uapi/linux/mount.h > > @@ -16,6 +16,7 @@ > > #define MS_REMOUNT 32 /* Alter flags of a mounted FS */ > > #define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */ > > #define MS_DIRSYNC 128 /* Directory modifications are synchronous */ > > +#define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */ > > #define MS_NOATIME 1024/* Do not update access times. */ > > #define MS_NODIRATIME2048/* Do not update directory access > > times */ > > #define MS_BIND 4096 > > Does this need to be added to MS_RMT_MASK too? I don't think so, because IIUC that is only for "superblock flags", i.e. flags which are part of the sb_flags definition in do_mount()/path_mount()? https://github.com/torvalds/linux/blob/master/fs/namespace.c#L3172 With the current code I'm able to remount and flip nosymfollow both directions without issue. Similarly, MS_NOEXEC can be turned on and off at will, and it's not part of MS_RMT_MASK nor sb_flags. Interestingly, though, if you change MS_RMT_MASK to be 0, I would expect us to be unable to twiddle all the flags which are currently part of it, but that isn't the case. I was still able to change between RO/RW, and turn on lazytime. So, I think this flag is working as expected, but that there is probably a bug in there somewhere WRT the handling of MS_RMT_MASK.
Re: [PATCH v7] Add a "nosymfollow" mount option.
On Tue, Aug 11, 2020 at 7:43 PM Aleksa Sarai wrote: > On 2020-08-11, Ross Zwisler wrote: > > From: Mattias Nissler > > > > For mounts that have the new "nosymfollow" option, don't follow symlinks > > when resolving paths. The new option is similar in spirit to the > > existing "nodev", "noexec", and "nosuid" options, as well as to the > > LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD > > variants have been supporting the "nosymfollow" mount option for a long > > time with equivalent implementations. > > > > Note that symlinks may still be created on file systems mounted with > > the "nosymfollow" option present. readlink() remains functional, so > > user space code that is aware of symlinks can still choose to follow > > them explicitly. > > > > Setting the "nosymfollow" mount option helps prevent privileged > > writers from modifying files unintentionally in case there is an > > unexpected link along the accessed path. The "nosymfollow" option is > > thus useful as a defensive measure for systems that need to deal with > > untrusted file systems in privileged contexts. > > > > More information on the history and motivation for this patch can be > > found here: > > > > https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal > > Looks good. Did you plan to add an in-tree test for this (you could > shove it in tools/testing/selftests/mount)? Sure, that sounds like a good idea. I'll take a look. > Reviewed-by: Aleksa Sarai Thank you for the review.
[PATCH v7] Add a "nosymfollow" mount option.
From: Mattias Nissler For mounts that have the new "nosymfollow" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options, as well as to the LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD variants have been supporting the "nosymfollow" mount option for a long time with equivalent implementations. Note that symlinks may still be created on file systems mounted with the "nosymfollow" option present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymfollow" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymfollow" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. More information on the history and motivation for this patch can be found here: https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal Signed-off-by: Mattias Nissler Signed-off-by: Ross Zwisler --- Changes since v6 [1]: * Rebased onto v5.8. * Another round of testing including readlink(1), readlink(2), realpath(1), realpath(3), statfs(2) and mount(2) to make sure everything still works. After this lands I will upstream changes to util-linux[2] and man-pages [3]. [1]: https://lkml.org/lkml/2020/3/4/770 [2]: https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55e1e724d4e15 [3]: https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e580399a71668 --- fs/namei.c | 3 ++- fs/namespace.c | 2 ++ fs/proc_namespace.c| 1 + fs/statfs.c| 2 ++ include/linux/mount.h | 3 ++- include/linux/statfs.h | 1 + include/uapi/linux/mount.h | 1 + 7 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 72d4219c93acb..ed68478fb1fb6 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link, return ERR_PTR(error); } - if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) || + unlikely(nd->path.mnt->mnt_flags & MNT_NOSYMFOLLOW)) return ERR_PTR(-ELOOP); if (!(nd->flags & LOOKUP_RCU)) { diff --git a/fs/namespace.c b/fs/namespace.c index 4a0f600a33285..1cbbf5a9b954f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3167,6 +3167,8 @@ long do_mount(const char *dev_name, const char __user *dir_name, mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; + if (flags & MS_NOSYMFOLLOW) + mnt_flags |= MNT_NOSYMFOLLOW; /* The default atime for remount is preservation */ if ((flags & MS_REMOUNT) && diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 3059a9394c2d6..e59d4bb3a89e4 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, + { MNT_NOSYMFOLLOW, ",nosymfollow" }, { 0, NULL } }; const struct proc_fs_opts *fs_infop; diff --git a/fs/statfs.c b/fs/statfs.c index 2616424012ea7..59f33752c1311 100644 --- a/fs/statfs.c +++ b/fs/statfs.c @@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags) flags |= ST_NODIRATIME; if (mnt_flags & MNT_RELATIME) flags |= ST_RELATIME; + if (mnt_flags & MNT_NOSYMFOLLOW) + flags |= ST_NOSYMFOLLOW; return flags; } diff --git a/include/linux/mount.h b/include/linux/mount.h index de657bd211fa6..aaf343b38671c 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -30,6 +30,7 @@ struct fs_context; #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 #define MNT_READONLY 0x40/* does the user want this to be r/o? */ +#define MNT_NOSYMFOLLOW0x80 #define MNT_SHRINKABLE 0x100 #define MNT_WRITE_HOLD 0x200 @@ -46,7 +47,7 @@ struct fs_context; #define MNT_SHARED_MASK(MNT_UNBINDABLE) #define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \ | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \ -| MNT_READONLY) +| MNT_READONLY | MNT_
[tip:x86/urgent] Revert "x86/build: Move _etext to actual end of .text"
Commit-ID: 013c66edf207ddb78422b8b636f56c87939c9e34 Gitweb: https://git.kernel.org/tip/013c66edf207ddb78422b8b636f56c87939c9e34 Author: Ross Zwisler AuthorDate: Mon, 1 Jul 2019 09:52:08 -0600 Committer: Ingo Molnar CommitDate: Tue, 9 Jul 2019 13:57:31 +0200 Revert "x86/build: Move _etext to actual end of .text" This reverts commit 392bef709659abea614abfe53cf228e7a59876a4. Per the discussion here: https://lkml.kernel.org/r/201906201042.3BF5CD6@keescook the above referenced commit breaks kernel compilation with old GCC toolchains as well as current versions of the Gold linker. Revert it to fix the regression and to keep the ability to compile the kernel with these tools. Signed-off-by: Ross Zwisler Signed-off-by: Thomas Gleixner Reviewed-by: Guenter Roeck Cc: Cc: "H. Peter Anvin" Cc: Borislav Petkov Cc: Kees Cook Cc: Johannes Hirte Cc: Klaus Kusche Cc: samitolva...@google.com Cc: Guenter Roeck Link: https://lkml.kernel.org/r/20190701155208.211815-1-zwis...@google.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/vmlinux.lds.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 0850b5149345..4d1517022a14 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -141,10 +141,10 @@ SECTIONS *(.text.__x86.indirect_thunk) __indirect_thunk_end = .; #endif - } :text = 0x9090 - /* End of text section */ - _etext = .; + /* End of text section */ + _etext = .; + } :text = 0x9090 NOTES :text :note
[tip:x86/urgent] Revert "x86/build: Move _etext to actual end of .text"
Commit-ID: 77a1619947ab31564aed54621d5b1e34af9b395d Gitweb: https://git.kernel.org/tip/77a1619947ab31564aed54621d5b1e34af9b395d Author: Ross Zwisler AuthorDate: Mon, 1 Jul 2019 09:52:08 -0600 Committer: Thomas Gleixner CommitDate: Tue, 2 Jul 2019 21:09:44 +0200 Revert "x86/build: Move _etext to actual end of .text" This reverts commit 392bef709659abea614abfe53cf228e7a59876a4. Per the discussion here: https://lkml.kernel.org/r/201906201042.3BF5CD6@keescook the above referenced commit breaks kernel compilation with old GCC toolchains as well as current versions of the Gold linker. Revert it to fix the regression and to keep the ability to compile the kernel with these tools. Signed-off-by: Ross Zwisler Signed-off-by: Thomas Gleixner Reviewed-by: Guenter Roeck Cc: "H. Peter Anvin" Cc: Borislav Petkov Cc: Kees Cook Cc: Johannes Hirte Cc: Klaus Kusche Cc: samitolva...@google.com Cc: Guenter Roeck Link: https://lkml.kernel.org/r/20190701155208.211815-1-zwis...@google.com --- arch/x86/kernel/vmlinux.lds.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 0850b5149345..4d1517022a14 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -141,10 +141,10 @@ SECTIONS *(.text.__x86.indirect_thunk) __indirect_thunk_end = .; #endif - } :text = 0x9090 - /* End of text section */ - _etext = .; + /* End of text section */ + _etext = .; + } :text = 0x9090 NOTES :text :note
[PATCH] Revert "x86/build: Move _etext to actual end of .text"
This reverts commit 392bef709659abea614abfe53cf228e7a59876a4. Per the discussion here: https://lkml.org/lkml/2019/6/20/830 the above referenced commit breaks kernel compilation with old GCC toolchains as well as current versions of the Gold linker. Revert it so we don't regress and lose the ability to compile the kernel with these tools. Signed-off-by: Ross Zwisler --- arch/x86/kernel/vmlinux.lds.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 0850b5149345..4d1517022a14 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -141,10 +141,10 @@ SECTIONS *(.text.__x86.indirect_thunk) __indirect_thunk_end = .; #endif - } :text = 0x9090 - /* End of text section */ - _etext = .; + /* End of text section */ + _etext = .; + } :text = 0x9090 NOTES :text :note -- 2.20.1
Re: [PATCH v2 3/3] ext4: use jbd2_inode dirty range scoping
On Mon, Jun 24, 2019 at 02:54:49AM +0800, kbuild test robot wrote: > Hi Ross, > > Thank you for the patch! Yet something to improve: > > [auto build test ERROR on linus/master] > [also build test ERROR on v5.2-rc6 next-20190621] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Ross-Zwisler/mm-add-filemap_fdatawait_range_keep_errors/20190623-181603 > config: x86_64-rhel-7.6 (attached as .config) > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=x86_64 > > If you fix the issue, kindly add following tag > Reported-by: kbuild test robot > > All errors (new ones prefixed by >>): > > >> ERROR: "jbd2_journal_inode_ranged_wait" [fs/ext4/ext4.ko] undefined! > >> ERROR: "jbd2_journal_inode_ranged_write" [fs/ext4/ext4.ko] undefined! Yep, this is caused by the lack of EXPORT_SYMBOL() calls for these two new jbd2 functions. Ted also pointed this out and fixed this up when he was committing: https://patchwork.kernel.org/patch/11007139/#22717091 Thank you for the report!
Re: [PATCH v2 2/3] jbd2: introduce jbd2_inode dirty range scoping
On Thu, Jun 20, 2019 at 3:25 PM Theodore Ts'o wrote: > On Thu, Jun 20, 2019 at 09:18:38AM -0600, Ross Zwisler wrote: > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h > > index 5c04181b7c6d8..0e0393e7f41a4 100644 > > --- a/include/linux/jbd2.h > > +++ b/include/linux/jbd2.h > > @@ -1397,6 +1413,12 @@ extern int > > jbd2_journal_force_commit(journal_t *); > > extern int jbd2_journal_force_commit_nested(journal_t *); > > extern int jbd2_journal_inode_add_write(handle_t *handle, struct > > jbd2_inode *inode); > > extern int jbd2_journal_inode_add_wait(handle_t *handle, struct > > jbd2_inode *inode); > > +extern int jbd2_journal_inode_ranged_write(handle_t *handle, > > + struct jbd2_inode *inode, loff_t start_byte, > > + loff_t length); > > +extern int jbd2_journal_inode_ranged_wait(handle_t *handle, > > + struct jbd2_inode *inode, loff_t start_byte, > > + loff_t length); > > extern int jbd2_journal_begin_ordered_truncate(journal_t *journal, > > struct jbd2_inode *inode, loff_t new_size); > > extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, > > struct inode *inode); > > You're adding two new functions that are called from outside the jbd2 > subsystem. To support compiling jbd2 as a module, we also need to add > EXPORT_SYMBOL declarations for these two functions. > > I'll take care of this when applying this change. Ah, yep, great catch. Thanks!
[PATCH v2 2/3] jbd2: introduce jbd2_inode dirty range scoping
Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out. The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation. We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure. This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara Cc: sta...@vger.kernel.org --- fs/jbd2/commit.c | 23 ++-- fs/jbd2/journal.c | 2 ++ fs/jbd2/transaction.c | 49 --- include/linux/jbd2.h | 22 +++ 4 files changed, 69 insertions(+), 27 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index efd0ce9489ae9..668f9021cf115 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -187,14 +187,15 @@ static int journal_wait_on_commit_record(journal_t *journal, * use writepages() because with dealyed allocation we may be doing * block allocation in writepages(). */ -static int journal_submit_inode_data_buffers(struct address_space *mapping) +static int journal_submit_inode_data_buffers(struct address_space *mapping, + loff_t dirty_start, loff_t dirty_end) { int ret; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = mapping->nrpages * 2, - .range_start = 0, - .range_end = i_size_read(mapping->host), + .range_start = dirty_start, + .range_end = dirty_end, }; ret = generic_writepages(mapping, &wbc); @@ -218,6 +219,9 @@ static int journal_submit_data_buffers(journal_t *journal, spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { + loff_t dirty_start = jinode->i_dirty_start; + loff_t dirty_end = jinode->i_dirty_end; + if (!(jinode->i_flags & JI_WRITE_DATA)) continue; mapping = jinode->i_vfs_inode->i_mapping; @@ -230,7 +234,8 @@ static int journal_submit_data_buffers(journal_t *journal, * only allocated blocks here. */ trace_jbd2_submit_inode_data(jinode->i_vfs_inode); - err = journal_submit_inode_data_buffers(mapping); + err = journal_submit_inode_data_buffers(mapping, dirty_start, + dirty_end); if (!ret) ret = err; spin_lock(&journal->j_list_lock); @@ -257,12 +262,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal, /* For locking, see the comment in journal_submit_data_buffers() */ spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { + loff_t dirty_start = jinode->i_dirty_start; + loff_t dirty_end = jinode->i_dirty_end; + if (!(jinode->i_flags & JI_WAIT_DATA)) continue; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); - err = filemap_fdatawait_keep_errors( - jinode->i_vfs_inode->i_mapping); + err = filemap_fdatawait_range_keep_errors( + jinode->i_vfs_inode->i_mapping, dirty_start, + dirty_end); if (!ret) ret = err; spin_lock(&journal->j_list_lock); @@ -282,6 +291,8 @@ static int journal_finish_inode_data_buffers(journal_t *journal, &jinode->i_transaction->t_inode_list); } else { jinode->i_transaction = NULL; + jinode->i_dirty_start = 0; + jinode->i_dirty_end = 0; } } spin_unloc
[PATCH v2 0/3] Add dirty range scoping to jbd2
Changes from v1: - Relocated the code which resets dirty range upon transaction completion. (Jan) - Cc'd sta...@vger.kernel.org because we see this issue with v4.14 and v4.19 stable kernels in the field. --- This patch series fixes the issue I described here: https://www.spinics.net/lists/linux-block/msg38274.html Essentially the issue is that journal_finish_inode_data_buffers() operates on the entire address space of each of the inodes associated with a given journal entry. This means that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers(). This series improves this situation in ext4 by scoping each of the inode dirty ranges associated with a given transaction. Other users of jbd2 which don't (yet?) take advantage of this scoping (ocfs2) will continue to have the old behavior. Ross Zwisler (3): mm: add filemap_fdatawait_range_keep_errors() jbd2: introduce jbd2_inode dirty range scoping ext4: use jbd2_inode dirty range scoping fs/ext4/ext4_jbd2.h | 12 +-- fs/ext4/inode.c | 13 +--- fs/ext4/move_extent.c | 3 ++- fs/jbd2/commit.c | 23 ++-- fs/jbd2/journal.c | 2 ++ fs/jbd2/transaction.c | 49 --- include/linux/fs.h| 2 ++ include/linux/jbd2.h | 22 +++ mm/filemap.c | 22 +++ 9 files changed, 111 insertions(+), 37 deletions(-) -- 2.22.0.410.gd8fdbe21b5-goog
[PATCH v2 3/3] ext4: use jbd2_inode dirty range scoping
Use the newly introduced jbd2_inode dirty range scoping to prevent us from waiting forever when trying to complete a journal transaction. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara Cc: sta...@vger.kernel.org --- fs/ext4/ext4_jbd2.h | 12 ++-- fs/ext4/inode.c | 13 ++--- fs/ext4/move_extent.c | 3 ++- 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 75a5309f22315..ef8fcf7d0d3b3 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -361,20 +361,20 @@ static inline int ext4_journal_force_commit(journal_t *journal) } static inline int ext4_jbd2_inode_add_write(handle_t *handle, - struct inode *inode) + struct inode *inode, loff_t start_byte, loff_t length) { if (ext4_handle_valid(handle)) - return jbd2_journal_inode_add_write(handle, - EXT4_I(inode)->jinode); + return jbd2_journal_inode_ranged_write(handle, + EXT4_I(inode)->jinode, start_byte, length); return 0; } static inline int ext4_jbd2_inode_add_wait(handle_t *handle, - struct inode *inode) + struct inode *inode, loff_t start_byte, loff_t length) { if (ext4_handle_valid(handle)) - return jbd2_journal_inode_add_wait(handle, - EXT4_I(inode)->jinode); + return jbd2_journal_inode_ranged_wait(handle, + EXT4_I(inode)->jinode, start_byte, length); return 0; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c7f77c6430085..27fec5c594459 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -731,10 +731,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, !(flags & EXT4_GET_BLOCKS_ZERO) && !ext4_is_quota_file(inode) && ext4_should_order_data(inode)) { + loff_t start_byte = + (loff_t)map->m_lblk << inode->i_blkbits; + loff_t length = (loff_t)map->m_len << inode->i_blkbits; + if (flags & EXT4_GET_BLOCKS_IO_SUBMIT) - ret = ext4_jbd2_inode_add_wait(handle, inode); + ret = ext4_jbd2_inode_add_wait(handle, inode, + start_byte, length); else - ret = ext4_jbd2_inode_add_write(handle, inode); + ret = ext4_jbd2_inode_add_write(handle, inode, + start_byte, length); if (ret) return ret; } @@ -4085,7 +4091,8 @@ static int __ext4_block_zero_page_range(handle_t *handle, err = 0; mark_buffer_dirty(bh); if (ext4_should_order_data(inode)) - err = ext4_jbd2_inode_add_write(handle, inode); + err = ext4_jbd2_inode_add_write(handle, inode, from, + length); } unlock: diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index 1083a9f3f16a1..c7ded4e2adff5 100644 --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -390,7 +390,8 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode, /* Even in case of data=writeback it is reasonable to pin * inode to transaction, to prevent unexpected data loss */ - *err = ext4_jbd2_inode_add_write(handle, orig_inode); + *err = ext4_jbd2_inode_add_write(handle, orig_inode, + (loff_t)orig_page_offset << PAGE_SHIFT, replaced_size); unlock_pages: unlock_page(pagep[0]); -- 2.22.0.410.gd8fdbe21b5-goog
[PATCH v2 1/3] mm: add filemap_fdatawait_range_keep_errors()
In the spirit of filemap_fdatawait_range() and filemap_fdatawait_keep_errors(), introduce filemap_fdatawait_range_keep_errors() which both takes a range upon which to wait and does not clear errors from the address space. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara Cc: sta...@vger.kernel.org --- include/linux/fs.h | 2 ++ mm/filemap.c | 22 ++ 2 files changed, 24 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index f7fdfe93e25d3..79fec8a8413f4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2712,6 +2712,8 @@ extern int filemap_flush(struct address_space *); extern int filemap_fdatawait_keep_errors(struct address_space *mapping); extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, loff_t lend); +extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping, + loff_t start_byte, loff_t end_byte); static inline int filemap_fdatawait(struct address_space *mapping) { diff --git a/mm/filemap.c b/mm/filemap.c index df2006ba0cfa5..e87252ca0835a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -553,6 +553,28 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, } EXPORT_SYMBOL(filemap_fdatawait_range); +/** + * filemap_fdatawait_range_keep_errors - wait for writeback to complete + * @mapping: address space structure to wait for + * @start_byte:offset in bytes where the range starts + * @end_byte: offset in bytes where the range ends (inclusive) + * + * Walk the list of under-writeback pages of the given address space in the + * given range and wait for all of them. Unlike filemap_fdatawait_range(), + * this function does not clear error status of the address space. + * + * Use this function if callers don't handle errors themselves. Expected + * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), + * fsfreeze(8) + */ +int filemap_fdatawait_range_keep_errors(struct address_space *mapping, + loff_t start_byte, loff_t end_byte) +{ + __filemap_fdatawait_range(mapping, start_byte, end_byte); + return filemap_check_and_keep_errors(mapping); +} +EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors); + /** * file_fdatawait_range - wait for writeback to complete * @file: file pointing to address space structure to wait for -- 2.22.0.410.gd8fdbe21b5-goog
Re: [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping
On Thu, Jun 20, 2019 at 01:04:54PM +0200, Jan Kara wrote: > On Wed 19-06-19 11:21:55, Ross Zwisler wrote: > > Currently both journal_submit_inode_data_buffers() and > > journal_finish_inode_data_buffers() operate on the entire address space > > of each of the inodes associated with a given journal entry. The > > consequence of this is that if we have an inode where we are constantly > > appending dirty pages we can end up waiting for an indefinite amount of > > time in journal_finish_inode_data_buffers() while we wait for all the > > pages under writeback to be written out. > > > > The easiest way to cause this type of workload is do just dd from > > /dev/zero to a file until it fills the entire filesystem. This can > > cause journal_finish_inode_data_buffers() to wait for the duration of > > the entire dd operation. > > > > We can improve this situation by scoping each of the inode dirty ranges > > associated with a given transaction. We do this via the jbd2_inode > > structure so that the scoping is contained within jbd2 and so that it > > follows the lifetime and locking rules for that structure. > > > > This allows us to limit the writeback & wait in > > journal_submit_inode_data_buffers() and > > journal_finish_inode_data_buffers() respectively to the dirty range for > > a given struct jdb2_inode, keeping us from waiting forever if the inode > > in question is still being appended to. > > > > Signed-off-by: Ross Zwisler > > The patch looks good to me. I was thinking whether we should not have > separate ranges for current and the next transaction but I guess it is not > worth it at least for now. So just one nit below. With that applied feel free > to add: > > Reviewed-by: Jan Kara We could definitely keep separate dirty ranges for each of the current and next transaction. I think the case where you would see a difference would be if you had multiple transactions in a row which grew the dirty range for a given jbd2_inode, and then had a random I/O workload which kept dirtying pages inside that enlarged dirty range. I'm not sure how often this type of workload would be a problem. For the workloads I've been testing which purely append to the inode, having a single dirty range per jbd2_inode is sufficient. I guess for now this single range seems simpler, but if later we find that someone would benefit from separate tracking for each of the current and next transactions, I'll take a shot at adding it. Thank you for the review! > > @@ -257,15 +262,24 @@ static int > > journal_finish_inode_data_buffers(journal_t *journal, > > /* For locking, see the comment in journal_submit_data_buffers() */ > > spin_lock(&journal->j_list_lock); > > list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { > > + loff_t dirty_start = jinode->i_dirty_start; > > + loff_t dirty_end = jinode->i_dirty_end; > > + > > if (!(jinode->i_flags & JI_WAIT_DATA)) > > continue; > > jinode->i_flags |= JI_COMMIT_RUNNING; > > spin_unlock(&journal->j_list_lock); > > - err = filemap_fdatawait_keep_errors( > > - jinode->i_vfs_inode->i_mapping); > > + err = filemap_fdatawait_range_keep_errors( > > + jinode->i_vfs_inode->i_mapping, dirty_start, > > + dirty_end); > > if (!ret) > > ret = err; > > spin_lock(&journal->j_list_lock); > > + > > + if (!jinode->i_next_transaction) { > > + jinode->i_dirty_start = 0; > > + jinode->i_dirty_end = 0; > > + } > > This would be more logical in the next loop that moves jinode into the next > transaction. Yep, agreed, this is much better. Fixed in v2.
Re: [PATCH] x86/build: Move _etext to actual end of .text
On Sun, Jun 9, 2019 at 1:00 PM Johannes Hirte wrote: > On 2019 Jun 09, Klaus Kusche wrote: > > Hello, > > > > Same problem for linux 5.1.7: > > Kernel building fails with the same relocation error. > > > > 5.1.5 does not have the problem, builds fine for me. > > > > Is there anything I can do to investigate the problem? > > > > Please try linux 5.1.8. The problematic patch was reverted there. I'm having this same issue with v5.2-rc5 using an older version of gcc (4.9.2). If I use a more recent version of gcc (7.3.0) it works fine. Reverting this patch allows gcc v4.9.2 to build kernel v5.2-rc5 successfully. You said in this chain that you were reverting this patch in stable kernels. Are you going to revert it in tip-of-tree as well? - Ross
[PATCH 0/3] Add dirty range scoping to jbd2
This patch series fixes the issue I described here: https://www.spinics.net/lists/linux-block/msg38274.html Essentially the issue is that journal_finish_inode_data_buffers() operates on the entire address space of each of the inodes associated with a given journal entry. This means that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers(). This series improves this situation in ext4 by scoping each of the inode dirty ranges associated with a given transaction. Other users of jbd2 which don't (yet?) take advantage of this scoping (ocfs2) will continue to have the old behavior. Ross Zwisler (3): mm: add filemap_fdatawait_range_keep_errors() jbd2: introduce jbd2_inode dirty range scoping ext4: use jbd2_inode dirty range scoping fs/ext4/ext4_jbd2.h | 12 +-- fs/ext4/inode.c | 13 +--- fs/ext4/move_extent.c | 3 ++- fs/jbd2/commit.c | 26 +-- fs/jbd2/journal.c | 2 ++ fs/jbd2/transaction.c | 49 --- include/linux/fs.h| 2 ++ include/linux/jbd2.h | 22 +++ mm/filemap.c | 22 +++ 9 files changed, 114 insertions(+), 37 deletions(-) -- 2.22.0.410.gd8fdbe21b5-goog
[PATCH 3/3] ext4: use jbd2_inode dirty range scoping
Use the newly introduced jbd2_inode dirty range scoping to prevent us from waiting forever when trying to complete a journal transaction. Signed-off-by: Ross Zwisler --- fs/ext4/ext4_jbd2.h | 12 ++-- fs/ext4/inode.c | 13 ++--- fs/ext4/move_extent.c | 3 ++- 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 75a5309f22315..ef8fcf7d0d3b3 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -361,20 +361,20 @@ static inline int ext4_journal_force_commit(journal_t *journal) } static inline int ext4_jbd2_inode_add_write(handle_t *handle, - struct inode *inode) + struct inode *inode, loff_t start_byte, loff_t length) { if (ext4_handle_valid(handle)) - return jbd2_journal_inode_add_write(handle, - EXT4_I(inode)->jinode); + return jbd2_journal_inode_ranged_write(handle, + EXT4_I(inode)->jinode, start_byte, length); return 0; } static inline int ext4_jbd2_inode_add_wait(handle_t *handle, - struct inode *inode) + struct inode *inode, loff_t start_byte, loff_t length) { if (ext4_handle_valid(handle)) - return jbd2_journal_inode_add_wait(handle, - EXT4_I(inode)->jinode); + return jbd2_journal_inode_ranged_wait(handle, + EXT4_I(inode)->jinode, start_byte, length); return 0; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c7f77c6430085..27fec5c594459 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -731,10 +731,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, !(flags & EXT4_GET_BLOCKS_ZERO) && !ext4_is_quota_file(inode) && ext4_should_order_data(inode)) { + loff_t start_byte = + (loff_t)map->m_lblk << inode->i_blkbits; + loff_t length = (loff_t)map->m_len << inode->i_blkbits; + if (flags & EXT4_GET_BLOCKS_IO_SUBMIT) - ret = ext4_jbd2_inode_add_wait(handle, inode); + ret = ext4_jbd2_inode_add_wait(handle, inode, + start_byte, length); else - ret = ext4_jbd2_inode_add_write(handle, inode); + ret = ext4_jbd2_inode_add_write(handle, inode, + start_byte, length); if (ret) return ret; } @@ -4085,7 +4091,8 @@ static int __ext4_block_zero_page_range(handle_t *handle, err = 0; mark_buffer_dirty(bh); if (ext4_should_order_data(inode)) - err = ext4_jbd2_inode_add_write(handle, inode); + err = ext4_jbd2_inode_add_write(handle, inode, from, + length); } unlock: diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index 1083a9f3f16a1..c7ded4e2adff5 100644 --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -390,7 +390,8 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode, /* Even in case of data=writeback it is reasonable to pin * inode to transaction, to prevent unexpected data loss */ - *err = ext4_jbd2_inode_add_write(handle, orig_inode); + *err = ext4_jbd2_inode_add_write(handle, orig_inode, + (loff_t)orig_page_offset << PAGE_SHIFT, replaced_size); unlock_pages: unlock_page(pagep[0]); -- 2.22.0.410.gd8fdbe21b5-goog
[PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors()
In the spirit of filemap_fdatawait_range() and filemap_fdatawait_keep_errors(), introduce filemap_fdatawait_range_keep_errors() which both takes a range upon which to wait and does not clear errors from the address space. Signed-off-by: Ross Zwisler --- include/linux/fs.h | 2 ++ mm/filemap.c | 22 ++ 2 files changed, 24 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index f7fdfe93e25d3..79fec8a8413f4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2712,6 +2712,8 @@ extern int filemap_flush(struct address_space *); extern int filemap_fdatawait_keep_errors(struct address_space *mapping); extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, loff_t lend); +extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping, + loff_t start_byte, loff_t end_byte); static inline int filemap_fdatawait(struct address_space *mapping) { diff --git a/mm/filemap.c b/mm/filemap.c index df2006ba0cfa5..e87252ca0835a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -553,6 +553,28 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, } EXPORT_SYMBOL(filemap_fdatawait_range); +/** + * filemap_fdatawait_range_keep_errors - wait for writeback to complete + * @mapping: address space structure to wait for + * @start_byte:offset in bytes where the range starts + * @end_byte: offset in bytes where the range ends (inclusive) + * + * Walk the list of under-writeback pages of the given address space in the + * given range and wait for all of them. Unlike filemap_fdatawait_range(), + * this function does not clear error status of the address space. + * + * Use this function if callers don't handle errors themselves. Expected + * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), + * fsfreeze(8) + */ +int filemap_fdatawait_range_keep_errors(struct address_space *mapping, + loff_t start_byte, loff_t end_byte) +{ + __filemap_fdatawait_range(mapping, start_byte, end_byte); + return filemap_check_and_keep_errors(mapping); +} +EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors); + /** * file_fdatawait_range - wait for writeback to complete * @file: file pointing to address space structure to wait for -- 2.22.0.410.gd8fdbe21b5-goog
[PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping
Currently both journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() operate on the entire address space of each of the inodes associated with a given journal entry. The consequence of this is that if we have an inode where we are constantly appending dirty pages we can end up waiting for an indefinite amount of time in journal_finish_inode_data_buffers() while we wait for all the pages under writeback to be written out. The easiest way to cause this type of workload is do just dd from /dev/zero to a file until it fills the entire filesystem. This can cause journal_finish_inode_data_buffers() to wait for the duration of the entire dd operation. We can improve this situation by scoping each of the inode dirty ranges associated with a given transaction. We do this via the jbd2_inode structure so that the scoping is contained within jbd2 and so that it follows the lifetime and locking rules for that structure. This allows us to limit the writeback & wait in journal_submit_inode_data_buffers() and journal_finish_inode_data_buffers() respectively to the dirty range for a given struct jdb2_inode, keeping us from waiting forever if the inode in question is still being appended to. Signed-off-by: Ross Zwisler --- fs/jbd2/commit.c | 26 +-- fs/jbd2/journal.c | 2 ++ fs/jbd2/transaction.c | 49 --- include/linux/jbd2.h | 22 +++ 4 files changed, 72 insertions(+), 27 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index efd0ce9489ae9..b4b99ea6e8700 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -187,14 +187,15 @@ static int journal_wait_on_commit_record(journal_t *journal, * use writepages() because with dealyed allocation we may be doing * block allocation in writepages(). */ -static int journal_submit_inode_data_buffers(struct address_space *mapping) +static int journal_submit_inode_data_buffers(struct address_space *mapping, + loff_t dirty_start, loff_t dirty_end) { int ret; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = mapping->nrpages * 2, - .range_start = 0, - .range_end = i_size_read(mapping->host), + .range_start = dirty_start, + .range_end = dirty_end, }; ret = generic_writepages(mapping, &wbc); @@ -218,6 +219,9 @@ static int journal_submit_data_buffers(journal_t *journal, spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { + loff_t dirty_start = jinode->i_dirty_start; + loff_t dirty_end = jinode->i_dirty_end; + if (!(jinode->i_flags & JI_WRITE_DATA)) continue; mapping = jinode->i_vfs_inode->i_mapping; @@ -230,7 +234,8 @@ static int journal_submit_data_buffers(journal_t *journal, * only allocated blocks here. */ trace_jbd2_submit_inode_data(jinode->i_vfs_inode); - err = journal_submit_inode_data_buffers(mapping); + err = journal_submit_inode_data_buffers(mapping, dirty_start, + dirty_end); if (!ret) ret = err; spin_lock(&journal->j_list_lock); @@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t *journal, /* For locking, see the comment in journal_submit_data_buffers() */ spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { + loff_t dirty_start = jinode->i_dirty_start; + loff_t dirty_end = jinode->i_dirty_end; + if (!(jinode->i_flags & JI_WAIT_DATA)) continue; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); - err = filemap_fdatawait_keep_errors( - jinode->i_vfs_inode->i_mapping); + err = filemap_fdatawait_range_keep_errors( + jinode->i_vfs_inode->i_mapping, dirty_start, + dirty_end); if (!ret) ret = err; spin_lock(&journal->j_list_lock); + + if (!jinode->i_next_transaction) { + jinode->i_dirty_start = 0; + jinode->i_dirty_end = 0; + } jinode->i_flags &= ~JI_COMMIT_RUNNING; smp_mb(); wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING); diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 43df0c943229c..288b8e7cf21c7 100644 --- a/f
Re: [linux-4.4.y PATCH] ASoC: Intel: avoid Oops if DMA setup fails
On Thu, May 09, 2019 at 07:41:47PM +0200, Greg KH wrote: > On Sun, May 05, 2019 at 03:15:53PM +0200, Greg KH wrote: > > On Fri, May 03, 2019 at 01:45:03PM -0600, Ross Zwisler wrote: > > > From: Ross Zwisler > > > > > > commit 0efa3334d65b7f421ba12382dfa58f6ff5bf83c4 upstream. > > > > This commit id is not in Linus's tree :( > > > > Sorry for the noise, I didn't read your note :( No worries, sorry for sending early. I'll wait until after it's merged next time. :) Thanks for pulling it in.
[linux-4.4.y PATCH] ASoC: Intel: avoid Oops if DMA setup fails
From: Ross Zwisler commit 0efa3334d65b7f421ba12382dfa58f6ff5bf83c4 upstream. Currently in sst_dsp_new() if we get an error return from sst_dma_new() we just print an error message and then still complete the function successfully. This means that we are trying to run without sst->dma properly set up, which will result in NULL pointer dereference when sst->dma is later used. This was happening for me in sst_dsp_dma_get_channel(): struct sst_dma *dma = dsp->dma; ... dma->ch = dma_request_channel(mask, dma_chan_filter, dsp); This resulted in: BUG: unable to handle kernel NULL pointer dereference at 0018 IP: sst_dsp_dma_get_channel+0x4f/0x125 [snd_soc_sst_firmware] Fix this by adding proper error handling for the case where we fail to set up DMA. This change only affects Haswell and Broadwell systems. Baytrail systems explicilty opt-out of DMA via sst->pdata->resindex_dma_base being set to -1. Signed-off-by: Ross Zwisler Cc: sta...@vger.kernel.org Acked-by: Pierre-Louis Bossart Signed-off-by: Mark Brown --- The upstream patch applied cleanly to all stable trees except linux-4.4.y and linux-3.18.y. This is the backport for linux-4.4.y, and the code I'm fixing was introduced in v4.0 so there is no need for a linux-3.18.y backport. The upstream patch is currently in Mark Brown's tree: https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git/log/?h=for-next Is that good enough, or should I resend after it's been merged in the v5.2 merge window? --- sound/soc/intel/common/sst-dsp.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/sound/soc/intel/common/sst-dsp.c b/sound/soc/intel/common/sst-dsp.c index c9452e02e0dda..c0a50ecb6dbda 100644 --- a/sound/soc/intel/common/sst-dsp.c +++ b/sound/soc/intel/common/sst-dsp.c @@ -463,11 +463,15 @@ struct sst_dsp *sst_dsp_new(struct device *dev, goto irq_err; err = sst_dma_new(sst); - if (err) - dev_warn(dev, "sst_dma_new failed %d\n", err); + if (err) { + dev_err(dev, "sst_dma_new failed %d\n", err); + goto dma_err; + } return sst; +dma_err: + free_irq(sst->irq, sst); irq_err: if (sst->ops->free) sst->ops->free(sst); -- 2.21.0.1020.gf2820cf01a-goog
Re: [PATCH] MAINTAINERS: update git tree for sound entries
On Fri, May 3, 2019 at 2:17 AM Takashi Iwai wrote: > > On Thu, 02 May 2019 19:27:00 +0200, > Ross Zwisler wrote: > > > > Several sound related entries in MAINTAINERS refer to the old git tree > > at "git://git.alsa-project.org/alsa-kernel.git". This is no longer used > > for development, and Takashi Iwai's kernel.org tree is used instead. > > > > Signed-off-by: Ross Zwisler > > Ross, would you like to keep @chromium.org as your author address > while the sign-off is with @google.com? It's OK and some people even > prefer giving different addresses, but it's often an overlook. So, > just for confirmation. Hi Takashi, It's fine to leave the author address as @chromium.org with the sign-off @google.com. Thank you for checking. - Ross
[PATCH] MAINTAINERS: update git tree for sound entries
Several sound related entries in MAINTAINERS refer to the old git tree at "git://git.alsa-project.org/alsa-kernel.git". This is no longer used for development, and Takashi Iwai's kernel.org tree is used instead. Signed-off-by: Ross Zwisler --- MAINTAINERS | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index e17ebf70b5480..d373d976a9317 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3351,7 +3351,7 @@ F:include/uapi/linux/bsg.h BT87X AUDIO DRIVER M: Clemens Ladisch L: alsa-de...@alsa-project.org (moderated for non-subscribers) -T: git git://git.alsa-project.org/alsa-kernel.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git S: Maintained F: Documentation/sound/cards/bt87x.rst F: sound/pci/bt87x.c @@ -3404,7 +3404,7 @@ F:drivers/scsi/FlashPoint.* C-MEDIA CMI8788 DRIVER M: Clemens Ladisch L: alsa-de...@alsa-project.org (moderated for non-subscribers) -T: git git://git.alsa-project.org/alsa-kernel.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git S: Maintained F: sound/pci/oxygen/ @@ -5696,7 +5696,7 @@ F:drivers/edac/qcom_edac.c EDIROL UA-101/UA-1000 DRIVER M: Clemens Ladisch L: alsa-de...@alsa-project.org (moderated for non-subscribers) -T: git git://git.alsa-project.org/alsa-kernel.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git S: Maintained F: sound/usb/misc/ua101.c @@ -6036,7 +6036,7 @@ F:include/linux/f75375s.h FIREWIRE AUDIO DRIVERS M: Clemens Ladisch L: alsa-de...@alsa-project.org (moderated for non-subscribers) -T: git git://git.alsa-project.org/alsa-kernel.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git S: Maintained F: sound/firewire/ @@ -11593,7 +11593,7 @@ F: Documentation/devicetree/bindings/opp/ OPL4 DRIVER M: Clemens Ladisch L: alsa-de...@alsa-project.org (moderated for non-subscribers) -T: git git://git.alsa-project.org/alsa-kernel.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git S: Maintained F: sound/drivers/opl4/ @@ -14490,7 +14490,6 @@ M: Takashi Iwai L: alsa-de...@alsa-project.org (moderated for non-subscribers) W: http://www.alsa-project.org/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git -T: git git://git.alsa-project.org/alsa-kernel.git Q: http://patchwork.kernel.org/project/alsa-devel/list/ S: Maintained F: Documentation/sound/ @@ -16100,7 +16099,7 @@ F: drivers/usb/storage/ USB MIDI DRIVER M: Clemens Ladisch L: alsa-de...@alsa-project.org (moderated for non-subscribers) -T: git git://git.alsa-project.org/alsa-kernel.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git S: Maintained F: sound/usb/midi.* -- 2.21.0.593.g511ec345e18-goog
Re: [PATCH] ASoC: Intel: avoid Oops if DMA setup fails
On Fri, Apr 26, 2019 at 06:52:45PM +, Sasha Levin wrote: > Hi, > > [This is an automated email] > > This commit has been processed because it contains a -stable tag. > The stable tag indicates that it's relevant for the following trees: all > > The bot has tested the following trees: v5.0.9, v4.19.36, v4.14.113, > v4.9.170, v4.4.178, v3.18.138. > > v5.0.9: Build OK! > v4.19.36: Build OK! > v4.14.113: Build OK! > v4.9.170: Build OK! > v4.4.178: Failed to apply! Possible dependencies: > 2bd5bd15a518 ("ASoC: Intel: add bytct-rt5651 machine driver") > 2dcffcee23a2 ("ASoC: Intel: Create independent acpi match module") > 595788e475d0 ("ASoC: Intel: tag byt-rt5640 machine driver as deprecated") > 95f098014815 ("ASoC: Intel: Move apci find machine routines") > a395bdd6b24b ("ASoC: intel: Fix sst-dsp dependency on dw stuff") > a92ea59b74e2 ("ASoC: Intel: sst: only select sst-firmware when DW DMAC is > built-in") > cfffcc66a89a ("ASoC: Intel: Load the atom DPCM driver only") > > v3.18.138: Failed to apply! Possible dependencies: > 0d2135ecadb0 ("ASoC: Intel: Work around to fix HW D3 potential crash > issue") > 13735d1cecec ("ASoC: intel - kconfig: remove SND_SOC_INTEL_SST prompt") > 161aa49ef1b9 ("ASoC: Intel: Add new dependency for Haswell machine") > 2106241a6803 ("ASoC: Intel: create common folder and move common files > in") > 282a331fe25c ("ASoC: Intel: Add new dependency for Broadwell machine") > 2e4f75919e5a ("ASoC: Intel: Add PM support to HSW/BDW PCM driver") > 34084a436703 ("ASoC: intel: Remove superfluous backslash in Kconfig") > 544c55c810a5 ("ASoC: Intel: Delete an unnecessary check before the > function call "sst_dma_free"") > 63ae1fe7739e ("ASoC: Intel: Add dependency on DesignWare DMA controller") > 7dd6bd8926f3 ("ASoC: intel: kconfig - Move DW_DMAC_CORE dependency to > machines") > 85b88a8dd0c7 ("ASoC: Intel: Store the entry_point read from FW file") > 9449d39b990d ("ASoC: Intel: add function to load firmware image") > a395bdd6b24b ("ASoC: intel: Fix sst-dsp dependency on dw stuff") > a92ea59b74e2 ("ASoC: Intel: sst: only select sst-firmware when DW DMAC is > built-in") > aed3c7b77c85 ("ASoC: Intel: Add PM support to HSW/BDW IPC driver") > d96c53a193dd ("ASoC: Intel: Add generic support for DSP wake, sleep and > stall") > e9600bc166d5 ("ASoC: Intel: Make ADSP memory block allocation more > generic") > > > How should we proceed with this patch? After reviews I'll send backport patches for v4.4.X and v3.18.X as necessary.
[PATCH v2] ASoC: Intel: avoid Oops if DMA setup fails
Currently in sst_dsp_new() if we get an error return from sst_dma_new() we just print an error message and then still complete the function successfully. This means that we are trying to run without sst->dma properly set up, which will result in NULL pointer dereference when sst->dma is later used. This was happening for me in sst_dsp_dma_get_channel(): struct sst_dma *dma = dsp->dma; ... dma->ch = dma_request_channel(mask, dma_chan_filter, dsp); This resulted in: BUG: unable to handle kernel NULL pointer dereference at 0018 IP: sst_dsp_dma_get_channel+0x4f/0x125 [snd_soc_sst_firmware] Fix this by adding proper error handling for the case where we fail to set up DMA. This change only affects Haswell and Broadwell systems. Baytrail systems explicilty opt-out of DMA via sst->pdata->resindex_dma_base being set to -1. Signed-off-by: Ross Zwisler Cc: sta...@vger.kernel.org --- Changes in v2: - Upgraded the sst_dma_new() failure message from dev_warn() to dev_err() (Pierre-Louis). - Noted in the changelog that this change only affects Haswell and Broadwell (Pierre-Louis). --- sound/soc/intel/common/sst-firmware.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/sound/soc/intel/common/sst-firmware.c b/sound/soc/intel/common/sst-firmware.c index 1e067504b6043..f830e59f93eaa 100644 --- a/sound/soc/intel/common/sst-firmware.c +++ b/sound/soc/intel/common/sst-firmware.c @@ -1251,11 +1251,15 @@ struct sst_dsp *sst_dsp_new(struct device *dev, goto irq_err; err = sst_dma_new(sst); - if (err) - dev_warn(dev, "sst_dma_new failed %d\n", err); + if (err) { + dev_err(dev, "sst_dma_new failed %d\n", err); + goto dma_err; + } return sst; +dma_err: + free_irq(sst->irq, sst); irq_err: if (sst->ops->free) sst->ops->free(sst); -- 2.21.0.593.g511ec345e18-goog
Re: [PATCH] ASoC: Intel: avoid Oops if DMA setup fails
On Fri, Apr 26, 2019 at 04:03:47PM -0500, Pierre-Louis Bossart wrote: > On 4/26/19 11:47 AM, Ross Zwisler wrote: > > Currently in sst_dsp_new() if we get an error return from sst_dma_new() > > we just print an error message and then still complete the function > > successfully. This means that we are trying to run without sst->dma > > properly set up, which will result in NULL pointer dereference when > > sst->dma is later used. This was happening for me in > > sst_dsp_dma_get_channel(): > > > > struct sst_dma *dma = dsp->dma; > > ... > > dma->ch = dma_request_channel(mask, dma_chan_filter, dsp); > > > > This resulted in: > > > > BUG: unable to handle kernel NULL pointer dereference at > > 0018 > > IP: sst_dsp_dma_get_channel+0x4f/0x125 [snd_soc_sst_firmware] > > > > Fix this by adding proper error handling for the case where we fail to > > set up DMA. > > > > Signed-off-by: Ross Zwisler > > Cc: sta...@vger.kernel.org > > --- > > sound/soc/intel/common/sst-firmware.c | 6 +- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/sound/soc/intel/common/sst-firmware.c > > b/sound/soc/intel/common/sst-firmware.c > > index 1e067504b6043..9be3a793a55e3 100644 > > --- a/sound/soc/intel/common/sst-firmware.c > > +++ b/sound/soc/intel/common/sst-firmware.c > > @@ -1251,11 +1251,15 @@ struct sst_dsp *sst_dsp_new(struct device *dev, > > goto irq_err; > > err = sst_dma_new(sst); > > - if (err) > > + if (err) { > > dev_warn(dev, "sst_dma_new failed %d\n", err); > > + goto dma_err; > > + } > > Thanks for the patch. > The fix looks correct, but does it make sense to keep a dev_warn() here? > Should it be changed to dev_err() instead since as you mentioned it's fatal > to keep going. > Also you may want to mention in the commit message that this should only > impact Broadwell and maybe the legacy Baytrail driver. IIRC we don't use the > DMAs in other cases. Sure, I'll address both of these in a v2. Thank you for the quick review.
[PATCH] ASoC: Intel: avoid Oops if DMA setup fails
Currently in sst_dsp_new() if we get an error return from sst_dma_new() we just print an error message and then still complete the function successfully. This means that we are trying to run without sst->dma properly set up, which will result in NULL pointer dereference when sst->dma is later used. This was happening for me in sst_dsp_dma_get_channel(): struct sst_dma *dma = dsp->dma; ... dma->ch = dma_request_channel(mask, dma_chan_filter, dsp); This resulted in: BUG: unable to handle kernel NULL pointer dereference at 0018 IP: sst_dsp_dma_get_channel+0x4f/0x125 [snd_soc_sst_firmware] Fix this by adding proper error handling for the case where we fail to set up DMA. Signed-off-by: Ross Zwisler Cc: sta...@vger.kernel.org --- sound/soc/intel/common/sst-firmware.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/sound/soc/intel/common/sst-firmware.c b/sound/soc/intel/common/sst-firmware.c index 1e067504b6043..9be3a793a55e3 100644 --- a/sound/soc/intel/common/sst-firmware.c +++ b/sound/soc/intel/common/sst-firmware.c @@ -1251,11 +1251,15 @@ struct sst_dsp *sst_dsp_new(struct device *dev, goto irq_err; err = sst_dma_new(sst); - if (err) + if (err) { dev_warn(dev, "sst_dma_new failed %d\n", err); + goto dma_err; + } return sst; +dma_err: + free_irq(sst->irq, sst); irq_err: if (sst->ops->free) sst->ops->free(sst); -- 2.21.0.593.g511ec345e18-goog
Re: [PATCH v1] mmc: core: Verify SD bus width
On Mon, Apr 15, 2019 at 03:00:31PM -0600, Raul E Rangel wrote: > The SD Physical Layer Spec says the following: Since the SD Memory Card > shall support at least the two bus modes 1-bit or 4-bit width, then any SD > Card shall set at least bits 0 and 2 (SD_BUS_WIDTH="0101"). > > This change verifies the card has specified a bus width. > > verified it didn't mount. > > Signed-off-by: Raul E Rangel > --- > AMD SDHC Device 7806 can get into a bad state after a card disconnect > where anything transferred via the DATA lines will always result in a > zero filled buffer. Currently the driver will continue without error if > the HC is in this condition. A block device will be created, but reading > from it will result in a zero buffer. This makes it seem like the SD > device has been erased, when in actuality the data is never getting > copied from the DATA lines to the data buffer. > > SCR is the first command in the SD initialization sequence that uses the > DATA lines. By checking that the response was invalid, we can abort > mounting the card. > > Here is an example of a bad trace: https://pastebin.com/TY2cF9n0 > Look for sd_scr and sd_ssr. Personally I think that all the info you have here below the --- should be in the actual commit message. It provides context for the bug that the patch is fixing, what the old and new behaviors are and how you tested. The text below the --- is for ephemeral things that don't matter once the code is committed: what revision of the patch set you are on, noting that you've addressed someone's comments, etc.
Re: [PATCH] cros_ec: Add trace event to trace EC commands
On Wed, Apr 10, 2019 at 02:24:08PM -0600, Raul E Rangel wrote: > This is useful to see which EC commands are being executed and when. > > To enable: > > echo 'cros_ec:*' >> /sys/kernel/debug/tracing/set_event > > Example: > > /* cros_ec_cmd: version: 0, command: GET_VERSION */ > /* cros_ec_cmd: version: 0, command: GET_PROTOCOL_INFO */ > /* cros_ec_cmd: version: 1, command: GET_CMD_VERSIONS */ > /* cros_ec_cmd: version: 1, command: USB_PD_CONTROL */ > > Signed-off-by: Raul E Rangel Reviewed-by: Ross Zwisler
Re: [PATCH] MAINTAINERS: Update filesystem-dax and NVDIMM entries
On Thu, Jan 24, 2019 at 3:13 PM Keith Busch wrote: > On Thu, Jan 24, 2019 at 11:17:59AM -0800, Dan Williams wrote: > > Ross has moved on to other areas. > > > > Matthew and Jan are trusted reviewers for DAX. > > > > Dan is now coordinating filesystem-dax pull requests. > > > > Ira and Keith are now involved with the NVDIMM and Device-DAX > > sub-systems. > > > > The linux-nvdimm mailing hosts a patchwork instance for both DAX and > > NVDIMM patches. > > > > Cc: Jan Kara > > Cc: Ira Weiny > > Cc: Ross Zwisler > > Cc: Keith Busch > > Cc: Matthew Wilcox > > Signed-off-by: Dan Williams > > Acked-by: Keith Busch Acked-by: Ross Zwisler
Re: [PATCH] ASoC: rt5677: use more of the volume range from DACs
On Thu, Dec 20, 2018 at 04:25:55PM -0700, Ross Zwisler wrote: > From: Dylan Reid > > The DACs volume can go over 0, both according to the data sheet and > real world testing. The control can go up to +30dB. > > This was tested by playing audio at full volume on a samus chromebook. > > Signed-off-by: Dylan Reid > Reviewed-by: Hsinyu Chao > Signed-off-by: Ross Zwisler Ping on this patch.
Re: [PATCH] drm/i915/psr: simplify enable_psr handling
On Fri, Dec 21, 2018 at 11:23:07AM -0800, Dhinakaran Pandiyan wrote: > On Fri, 2018-12-21 at 10:23 -0700, Ross Zwisler wrote: > > The following commit: > > > > commit 2bdd045e3a30 ("drm/i915/psr: Check if VBT says PSR can be > > enabled.") > > > > added some code with no usable functionality. Regardless of how the > > psr > > default is set up in the BDB_DRIVER_FEATURES section, if the > > enable_psr > > module parameter isn't specified it defaults to 0. > Right, that was intentional and the commit message even makes a note of > it > " Note: The feature currently remains disabled by default for all > platforms irrespective of what VBT says." > > > Anyway, we've enabled the feature by default now and the current code > should take into account the VBT flag if the module parameter is left > to a default value. Please check git://anongit.freedesktop.org/drm-tip > drm-tip. Fair enough. It's a bad pattern to introduce dead code as a placeholder for some future work, though. This code has been in the tree for three major kernel releases (v4.{18,19,20}) without providing any useful functionality.
[PATCH] drm/i915/psr: simplify enable_psr handling
The following commit: commit 2bdd045e3a30 ("drm/i915/psr: Check if VBT says PSR can be enabled.") added some code with no usable functionality. Regardless of how the psr default is set up in the BDB_DRIVER_FEATURES section, if the enable_psr module parameter isn't specified it defaults to 0. Remove this dead code, simplify the way that enable_psr is handled and update the module parameter string to match the actual functionality. Cc: Dhinakaran Pandiyan Cc: Rodrigo Vivi Signed-off-by: Ross Zwisler --- drivers/gpu/drm/i915/i915_drv.h| 1 - drivers/gpu/drm/i915/i915_params.c | 4 +--- drivers/gpu/drm/i915/i915_params.h | 2 +- drivers/gpu/drm/i915/intel_bios.c | 1 - drivers/gpu/drm/i915/intel_psr.c | 7 --- 5 files changed, 2 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 872a2e159a5f9..b4c50ba0b22a6 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1115,7 +1115,6 @@ struct intel_vbt_data { } edp; struct { - bool enable; bool full_link; bool require_aux_wakeup; int idle_frames; diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c index 295e981e4a398..80ce8758c3c69 100644 --- a/drivers/gpu/drm/i915/i915_params.c +++ b/drivers/gpu/drm/i915/i915_params.c @@ -87,9 +87,7 @@ i915_param_named_unsafe(enable_ppgtt, int, 0400, "(-1=auto [default], 0=disabled, 1=aliasing, 2=full, 3=full with extended address space)"); i915_param_named_unsafe(enable_psr, int, 0600, - "Enable PSR " - "(0=disabled, 1=enabled) " - "Default: -1 (use per-chip default)"); + "Enable PSR (default: false)"); i915_param_named_unsafe(alpha_support, bool, 0400, "Enable alpha quality driver support for latest hardware. " diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h index 6c4d4a21474b5..144572f17a83d 100644 --- a/drivers/gpu/drm/i915/i915_params.h +++ b/drivers/gpu/drm/i915/i915_params.h @@ -42,7 +42,7 @@ struct drm_printer; param(int, enable_dc, -1) \ param(int, enable_fbc, -1) \ param(int, enable_ppgtt, -1) \ - param(int, enable_psr, -1) \ + param(int, enable_psr, 0) \ param(int, disable_power_well, -1) \ param(int, enable_ips, 1) \ param(int, invert_brightness, 0) \ diff --git a/drivers/gpu/drm/i915/intel_bios.c b/drivers/gpu/drm/i915/intel_bios.c index 1faa494e2bc91..d676d483d5cf1 100644 --- a/drivers/gpu/drm/i915/intel_bios.c +++ b/drivers/gpu/drm/i915/intel_bios.c @@ -551,7 +551,6 @@ parse_driver_features(struct drm_i915_private *dev_priv, */ if (!driver->drrs_enabled) dev_priv->vbt.drrs_type = DRRS_NOT_SUPPORTED; - dev_priv->vbt.psr.enable = driver->psr_enabled; } static void diff --git a/drivers/gpu/drm/i915/intel_psr.c b/drivers/gpu/drm/i915/intel_psr.c index b6838b525502e..26e7eb318cf07 100644 --- a/drivers/gpu/drm/i915/intel_psr.c +++ b/drivers/gpu/drm/i915/intel_psr.c @@ -1065,13 +1065,6 @@ void intel_psr_init(struct drm_i915_private *dev_priv) if (!dev_priv->psr.sink_support) return; - if (i915_modparams.enable_psr == -1) { - i915_modparams.enable_psr = dev_priv->vbt.psr.enable; - - /* Per platform default: all disabled. */ - i915_modparams.enable_psr = 0; - } - /* Set link_standby x link_off defaults */ if (IS_HASWELL(dev_priv) || IS_BROADWELL(dev_priv)) /* HSW and BDW require workarounds that we don't implement. */ -- 2.20.1.415.g653613c723-goog
[PATCH] ASoC: rt5677: use more of the volume range from DACs
From: Dylan Reid The DACs volume can go over 0, both according to the data sheet and real world testing. The control can go up to +30dB. This was tested by playing audio at full volume on a samus chromebook. Signed-off-by: Dylan Reid Reviewed-by: Hsinyu Chao Signed-off-by: Ross Zwisler --- sound/soc/codecs/rt5677.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/sound/soc/codecs/rt5677.c b/sound/soc/codecs/rt5677.c index 9b7a1833d3316..513fec86301e8 100644 --- a/sound/soc/codecs/rt5677.c +++ b/sound/soc/codecs/rt5677.c @@ -832,13 +832,13 @@ static const struct snd_kcontrol_new rt5677_snd_controls[] = { /* DAC Digital Volume */ SOC_DOUBLE_TLV("DAC1 Playback Volume", RT5677_DAC1_DIG_VOL, - RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 87, 0, dac_vol_tlv), + RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 127, 0, dac_vol_tlv), SOC_DOUBLE_TLV("DAC2 Playback Volume", RT5677_DAC2_DIG_VOL, - RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 87, 0, dac_vol_tlv), + RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 127, 0, dac_vol_tlv), SOC_DOUBLE_TLV("DAC3 Playback Volume", RT5677_DAC3_DIG_VOL, - RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 87, 0, dac_vol_tlv), + RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 127, 0, dac_vol_tlv), SOC_DOUBLE_TLV("DAC4 Playback Volume", RT5677_DAC4_DIG_VOL, - RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 87, 0, dac_vol_tlv), + RT5677_L_VOL_SFT, RT5677_R_VOL_SFT, 127, 0, dac_vol_tlv), /* IN1/IN2 Control */ SOC_SINGLE_TLV("IN1 Boost", RT5677_IN1, RT5677_BST_SFT1, 8, 0, bst_tlv), -- 2.20.1.415.g653613c723-goog
Re: [PATCH] thermal: add ratelimited thermal and power logging
On Wed, Oct 24, 2018 at 1:22 AM Viresh Kumar wrote: > On 22-10-18, 14:29, Ross Zwisler wrote: > > From: Ricky Liang > > > > Add thermal logs in devfreq_cooling and cpu_cooling. > > Why should we add them ? > > > Also add logging to > > power_allocator when it starts to control power. > > > > These changes can lead to excessive log spam when running up against > > thermal limits, so have this logging ratelimited to allow only 1 log each > > 30 seconds from each of those subsystems. > > What's the use of these logs when we are going to print them only once every > 30 > seconds ? > > I recently extended thermal sysfs support to share more stats. > > commit 8ea229511e06 ("thermal: Add cooling device's statistics in sysfs") > > Will that be helpful in your case ? > > Otherwise we should probably add trace points instead. Thank you for the review. Basically we use these prints to get a notification when a system is having thermal issues. It's easy to look in dmesg and see the prints and know that something temperature related is going on. However, I agree that the current solution is a bit hacky, and in looking at it a bit further we don't even cover all the paths that we need to. The processor_set_cur_state() function in drivers/acpi/processor_thermal.c, for example, is used on the x86_64 systems I'm testing with and wasn't augmented with prints. I'm going to take a step back and try and find another solution. The info you added to sysfs looks very promising, thank you for pointing it out.
[PATCH 2/2] gsmi: Log event for critical thermal thresholds
From: Duncan Laurie This registers a notifier for the new thermal notifier introduced in a previous commit and logs a kernel shutdown event when the notifier is called for crossing the THERMAL_TRIP_CRITICAL threshold. To test: 1) Modify critical shutdown threshold in the BIOS 2) Generate a load on the system to increase temperature 3) Wait until system powers off 4) Power back on and read the event log: 4 | 2012-07-18 10:47:02 | Kernel Event | Critical Thermal Threshold 5 | 2012-07-18 10:47:06 | Kernel Event | Clean Shutdown 6 | 2012-07-18 10:47:06 | ACPI Enter | S5 Signed-off-by: Duncan Laurie Reviewed-by: Vincent Palatin Reviewed-by: Olof Johansson [ rez: updated changelog for upstream ] Signed-off-by: Ross Zwisler --- drivers/firmware/google/gsmi.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c index c8f169bf2e27d..ee0c611b83e99 100644 --- a/drivers/firmware/google/gsmi.c +++ b/drivers/firmware/google/gsmi.c @@ -29,6 +29,7 @@ #include #include #include +#include #define GSMI_SHUTDOWN_CLEAN0 /* Clean Shutdown */ /* TODO(mi...@google.com): Tie in HARDLOCKUP_DETECTOR with NMIWDT */ @@ -40,6 +41,7 @@ #define GSMI_SHUTDOWN_SOFTWDT 6 /* Software Watchdog */ #define GSMI_SHUTDOWN_MBE 7 /* Uncorrected ECC */ #define GSMI_SHUTDOWN_TRIPLE 8 /* Triple Fault */ +#define GSMI_SHUTDOWN_THERMAL 9 /* Critical Thermal Threshold */ #define DRIVER_VERSION "1.0" #define GSMI_GUID_SIZE 16 @@ -670,6 +672,18 @@ static struct notifier_block gsmi_panic_notifier = { .notifier_call = gsmi_panic_callback, }; +static int gsmi_thermal_callback(struct notifier_block *nb, +unsigned long reason, void *arg) +{ + if (reason == THERMAL_TRIP_CRITICAL) + gsmi_shutdown_reason(GSMI_SHUTDOWN_THERMAL); + return NOTIFY_DONE; +} + +static struct notifier_block gsmi_thermal_notifier = { + .notifier_call = gsmi_thermal_callback +}; + /* * This hash function was blatantly copied from include/linux/hash.h. * It is used by this driver to obfuscate a board name that requires a @@ -892,6 +906,7 @@ static __init int gsmi_init(void) goto out_remove_sysfs_files; } + register_thermal_notifier(&gsmi_thermal_notifier); register_reboot_notifier(&gsmi_reboot_notifier); register_die_notifier(&gsmi_die_notifier); atomic_notifier_chain_register(&panic_notifier_list, @@ -918,6 +933,7 @@ static __init int gsmi_init(void) static void __exit gsmi_exit(void) { + unregister_thermal_notifier(&gsmi_thermal_notifier); unregister_reboot_notifier(&gsmi_reboot_notifier); unregister_die_notifier(&gsmi_die_notifier); atomic_notifier_chain_unregister(&panic_notifier_list, -- 2.19.1.568.g152ad8e336-goog
[PATCH 1/2] thermal: Add notifier call chain for hot/critical events
From: Duncan Laurie This will allow drivers to register a callback for important thermal events and log critical thresholds that cause the system to shut down. There are other places this might work, but after consideration I think it makes sense to have the chain at this level: The ACPI thermal driver has an existing notify function that is eventually called into, but that would limit the notifier to only working on systems that use ACPI. The cpufreq driver is already getting a notify callback executed in this path (tz->ops->notify) but the threshold info is not passed to the cpu_cooling notifier chain so it is not useful for logging. Signed-off-by: Duncan Laurie Reviewed-by: Olof Johansson Reviewed-by: Vincent Palatin [ rez: updated changelog for upstream ] Signed-off-by: Ross Zwisler --- drivers/thermal/thermal_core.c | 38 ++ include/linux/thermal.h| 4 2 files changed, 42 insertions(+) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 6ab982309e6a0..e1f8764b3d9f9 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -313,12 +314,46 @@ static void monitor_thermal_zone(struct thermal_zone_device *tz) mutex_unlock(&tz->lock); } +static BLOCKING_NOTIFIER_HEAD(thermal_notifier_list); + +/** + * register_thermal_notifier - Register function to be called for + * critical thermal events. + * + * @nb: Info about notifier function to be called + * + * Currently always returns zero, as blocking_notifier_chain_register() + * always returns zero. + */ +int register_thermal_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&thermal_notifier_list, nb); +} +EXPORT_SYMBOL(register_thermal_notifier); + +/** + * unregister_thermal_notifier - Unregister thermal notifier + * + * @nb: Hook to be unregistered + * + * Returns zero on success, or %-ENOENT on failure. + */ +int unregister_thermal_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&thermal_notifier_list, nb); +} +EXPORT_SYMBOL(unregister_thermal_notifier); + + static void handle_non_critical_trips(struct thermal_zone_device *tz, int trip, enum thermal_trip_type trip_type) { tz->governor ? tz->governor->throttle(tz, trip) : def_governor->throttle(tz, trip); + + blocking_notifier_call_chain(&thermal_notifier_list, +trip_type, NULL); } /** @@ -385,6 +420,9 @@ static void handle_critical_trips(struct thermal_zone_device *tz, if (tz->ops->notify) tz->ops->notify(tz, trip, trip_type); + blocking_notifier_call_chain(&thermal_notifier_list, +trip_type, NULL); + if (trip_type == THERMAL_TRIP_CRITICAL) { dev_emerg(&tz->device, "critical temperature reached (%d C), shutting down\n", diff --git a/include/linux/thermal.h b/include/linux/thermal.h index 5f4705f46c2f9..b948344d55cab 100644 --- a/include/linux/thermal.h +++ b/include/linux/thermal.h @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -542,4 +543,7 @@ static inline int thermal_generate_netlink_event(struct thermal_zone_device *tz, } #endif +extern int register_thermal_notifier(struct notifier_block *); +extern int unregister_thermal_notifier(struct notifier_block *); + #endif /* __THERMAL_H__ */ -- 2.19.1.568.g152ad8e336-goog
[PATCH] thermal: add ratelimited thermal and power logging
From: Ricky Liang Add thermal logs in devfreq_cooling and cpu_cooling. Also add logging to power_allocator when it starts to control power. These changes can lead to excessive log spam when running up against thermal limits, so have this logging ratelimited to allow only 1 log each 30 seconds from each of those subsystems. Signed-off-by: Ricky Liang Co-authored-by: Stephen Barber Signed-off-by: Stephen Barber Reviewed-by: Daniel Kurtz [ rez: squashed initial implementation & fixes, updated changelog for upstream. ] Signed-off-by: Ross Zwisler --- drivers/thermal/cpu_cooling.c | 16 drivers/thermal/devfreq_cooling.c | 12 drivers/thermal/power_allocator.c | 17 + 3 files changed, 45 insertions(+) diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c index dfd23245f778a..d8d1855d7d991 100644 --- a/drivers/thermal/cpu_cooling.c +++ b/drivers/thermal/cpu_cooling.c @@ -31,9 +31,17 @@ #include #include #include +#include #include +static DEFINE_RATELIMIT_STATE(cpu_cooling_ratelimit_state, 30 * HZ, 1); + +static int cpu_cooling_ratelimit(void) +{ + return __ratelimit(&cpu_cooling_ratelimit_state); +} + /* * Cooling state <-> CPUFreq frequency * @@ -389,6 +397,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev, { struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata; unsigned int clip_freq; + struct device *cpu_dev; /* Request state should be less than max_level */ if (WARN_ON(state > cpufreq_cdev->max_level)) @@ -404,6 +413,13 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev, cpufreq_update_policy(cpufreq_cdev->policy->cpu); + if (cpu_cooling_ratelimit()) { + cpu_dev = get_cpu_device(cpufreq_cdev->policy->cpu); + dev_info(cpu_dev, +"Cooling state set to %lu. New max freq = %u\n", +state, clip_freq); + } + return 0; } diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index ef59256887ff6..f95c7f513f05a 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -25,6 +25,7 @@ #include #include #include +#include #include @@ -32,6 +33,13 @@ static DEFINE_IDA(devfreq_ida); +static DEFINE_RATELIMIT_STATE(devfreq_cooling_ratelimit_state, 30 * HZ, 1); + +static int devfreq_cooling_ratelimit(void) +{ + return __ratelimit(&devfreq_cooling_ratelimit_state); +} + /** * struct devfreq_cooling_device - Devfreq cooling device * @id:unique integer value corresponding to each @@ -150,6 +158,10 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev, dfc->cooling_state = state; + if (devfreq_cooling_ratelimit()) + dev_info(dev, "Cooling state set to %lu. New max freq = %u\n", +state, dfc->freq_table[state]); + return 0; } diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c index 3055f9a12a170..5140a07fe60aa 100644 --- a/drivers/thermal/power_allocator.c +++ b/drivers/thermal/power_allocator.c @@ -15,6 +15,7 @@ #define pr_fmt(fmt) "Power allocator: " fmt +#include #include #include #include @@ -30,6 +31,13 @@ #define int_to_frac(x) ((x) << FRAC_BITS) #define frac_to_int(x) ((x) >> FRAC_BITS) +static DEFINE_RATELIMIT_STATE(power_allocator_ratelimit_state, 30 * HZ, 1); + +static int power_allocator_ratelimit(void) +{ + return __ratelimit(&power_allocator_ratelimit_state); +} + /** * mul_frac() - multiply two fixed-point numbers * @x: first multiplicand @@ -443,6 +451,15 @@ static int allocate_power(struct thermal_zone_device *tz, max_allocatable_power, tz->temperature, control_temp - tz->temperature); + if (total_granted_power < total_req_power && + power_allocator_ratelimit()) { + dev_info(&tz->device, "Controlling power: control_temp=%d " +"last_temp=%d, curr_temp=%d total_requested_power=%d " +"total_granted_power=%d\n", control_temp, +tz->last_temperature, tz->temperature, +total_req_power, total_granted_power); + } + kfree(req_power); unlock: mutex_unlock(&tz->lock); -- 2.19.1.568.g152ad8e336-goog
Re: [PATCH] pstore/ram: Clarify resource reservation labels
On Thu, Oct 18, 2018 at 2:31 PM Kees Cook wrote: > > On Thu, Oct 18, 2018 at 8:33 AM, Dan Williams > wrote: > > [ add Ross ] > > Hi Ross! :) > > > On Thu, Oct 18, 2018 at 12:15 AM Kees Cook wrote: > >> As for nvdimm specifically, yes, I'd love to get pstore hooked up > >> correctly to nvdimm. How do the namespaces work? Right now pstore > >> depends one of platform driver data, device tree specification, or > >> manual module parameters. > > > > From the userspace side we have the ndctl utility to wrap > > personalities on top of namespaces. So for example, I envision we > > would be able to do: > > > > ndctl create-namespace --mode=pstore --size=128M > > > > ...and create a small namespace that will register with the pstore > > sub-system. > > > > On the kernel side this would involve registering a 'pstore_dev' child > > / seed device under each region device. The 'seed-device' sysfs scheme > > is described in our documentation [1]. The short summary is ndctl > > finds a seed device assigns a namespace to it and then binding that > > device to a driver causes it to be initialized by the kernel. > > > > [1]: https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt > > Interesting! > > Really, this would be a way to configure "ramoops" (the persistent RAM > backend to pstore), rather than pstore itself (pstore is just the > framework). From reading the ndctl man page it sounds like there isn't > a way to store configuration information beyond just size? Ramoops needs a start (mem_address), size (mem_size) and mapping type (mem_type), right? I think we get the first two for free based on the size of the namespace, so really we'd just be looking for a way to switch between cacheable/noncached memory? > ramoops will auto-configure itself and fill available space using its > default parameters, but it might be nice to have a way to store that > somewhere (traditionally it's part of device tree or platform data). > ramoops could grow a "header", but normally the regions are very small > so I've avoided that. Several of the other modes (BTT and DAX) have space for additional metadata in their namespaces. If we just need a single bit, though, maybe we can grab that out of the "flags" field of the namespace label. http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf section 2.2.3. Dan, is this workable or is there a better option? Is it a useful feature to have other types of namespaces be able to control their caching attributes in this way? > I'm not sure I understand the right way to glue ramoops_probe() to the > "seed-device" stuff. (It needs to be probed VERY early to catch early > crashes -- ramoops uses postcore_initcall() normally.)
[PATCH 3/4] gsmi: Remove autoselected dependency on EFI and EFI_VARS
From: Duncan Laurie Instead of selecting EFI and EFI_VARS automatically when GSMI is enabled let that portion of the driver be conditionally compiled if EFI and EFI_VARS are enabled. This allows the rest of the driver (specifically event log) to be used if EFI_VARS is not enabled. To test: 1) verify that EFI_VARS is not automatically selected when CONFIG_GOOGLE_GSMI is enabled 2) verify that the kernel boots on Link and that GSMI event log is still available and functional 3) specifically boot the kernel on Alex to ensure it does not try to load efivars and that gsmi also does not load because it is not in the supported DMI table Signed-off-by: Duncan Laurie Reviewed-by: Olof Johansson Signed-off-by: Benson Leung Signed-off-by: Ben Zhang Signed-off-by: Filipe Brandenburger Signed-off-by: Furquan Shaikh Tested-by: Furquan Shaikh Reviewed-by: Aaron Durbin [zwisler: update changelog for upstream] Signed-off-by: Ross Zwisler --- drivers/firmware/google/Kconfig | 6 +++--- drivers/firmware/google/gsmi.c | 16 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/firmware/google/Kconfig b/drivers/firmware/google/Kconfig index a456a48b..9ac36762e503 100644 --- a/drivers/firmware/google/Kconfig +++ b/drivers/firmware/google/Kconfig @@ -10,12 +10,12 @@ if GOOGLE_FIRMWARE config GOOGLE_SMI tristate "SMI interface for Google platforms" - depends on X86 && ACPI && DMI && EFI - select EFI_VARS + depends on X86 && ACPI && DMI help Say Y here if you want to enable SMI callbacks for Google platforms. This provides an interface for writing to and - clearing the EFI event log and reading and writing NVRAM + clearing the event log. If EFI_VARS is also enabled this + driver provides an interface for reading and writing NVRAM variables. config GOOGLE_COREBOOT_TABLE diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c index 252884787266..edab00cc6bba 100644 --- a/drivers/firmware/google/gsmi.c +++ b/drivers/firmware/google/gsmi.c @@ -289,6 +289,10 @@ static int gsmi_exec(u8 func, u8 sub) return rc; } +#ifdef CONFIG_EFI_VARS + +static struct efivars efivars; + static efi_status_t gsmi_get_variable(efi_char16_t *name, efi_guid_t *vendor, u32 *attr, unsigned long *data_size, @@ -466,6 +470,8 @@ static const struct efivar_operations efivar_ops = { .get_next_variable = gsmi_get_next_variable, }; +#endif /* CONFIG_EFI_VARS */ + static ssize_t eventlog_write(struct file *filp, struct kobject *kobj, struct bin_attribute *bin_attr, char *buf, loff_t pos, size_t count) @@ -767,7 +773,6 @@ static __init int gsmi_system_valid(void) } static struct kobject *gsmi_kobj; -static struct efivars efivars; static const struct platform_device_info gsmi_dev_info = { .name = "gsmi", @@ -891,11 +896,14 @@ static __init int gsmi_init(void) goto out_remove_bin_file; } +#ifdef CONFIG_EFI_VARS ret = efivars_register(&efivars, &efivar_ops, gsmi_kobj); if (ret) { printk(KERN_INFO "gsmi: Failed to register efivars\n"); - goto out_remove_sysfs_files; + sysfs_remove_files(gsmi_kobj, gsmi_attrs); + goto out_remove_bin_file; } +#endif register_reboot_notifier(&gsmi_reboot_notifier); register_die_notifier(&gsmi_die_notifier); @@ -906,8 +914,6 @@ static __init int gsmi_init(void) return 0; -out_remove_sysfs_files: - sysfs_remove_files(gsmi_kobj, gsmi_attrs); out_remove_bin_file: sysfs_remove_bin_file(gsmi_kobj, &eventlog_bin_attr); out_err: @@ -927,7 +933,9 @@ static void __exit gsmi_exit(void) unregister_die_notifier(&gsmi_die_notifier); atomic_notifier_chain_unregister(&panic_notifier_list, &gsmi_panic_notifier); +#ifdef CONFIG_EFI_VARS efivars_unregister(&efivars); +#endif sysfs_remove_files(gsmi_kobj, gsmi_attrs); sysfs_remove_bin_file(gsmi_kobj, &eventlog_bin_attr); -- 2.19.0.605.g01d371f741-goog
[PATCH 1/4] gsmi: Fix bug in append_to_eventlog sysfs handler
From: Duncan Laurie The sysfs handler should return the number of bytes consumed, which in the case of a successful write is the entire buffer. Also fix a bug where param.data_len was being set to (count - (2 * sizeof(u32))) instead of just (count - sizeof(u32)). The latter is correct because we skip over the leading u32 which is our param.type, but we were also incorrectly subtracting sizeof(u32) on the line where we were actually setting param.data_len: param.data_len = count - sizeof(u32); This meant that for our example event.kernel_software_watchdog with total length 10 bytes, param.data_len was just 2 prior to this change. To test, successfully append an event to the log with gsmi sysfs. This sample event is for a "Kernel Software Watchdog" > xxd -g 1 event.kernel_software_watchdog 000: 01 00 00 00 ad de 06 00 00 00 > cat event.kernel_software_watchdog > /sys/firmware/gsmi/append_to_eventlog > mosys eventlog list | tail -1 14 | 2012-06-25 10:14:14 | Kernl Event | Software Watchdog Signed-off-by: Duncan Laurie Reviewed-by: Vadim Bendebury Reviewed-by: Stefan Reinauer Signed-off-by: Furquan Shaikh Tested-by: Furquan Shaikh Reviewed-by: Aaron Durbin Reviewed-by: Justin TerAvest [zwisler: updated changelog for 2nd bug fix and upstream] Signed-off-by: Ross Zwisler --- drivers/firmware/google/gsmi.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c index c8f169bf2e27..62337be07afc 100644 --- a/drivers/firmware/google/gsmi.c +++ b/drivers/firmware/google/gsmi.c @@ -480,11 +480,10 @@ static ssize_t eventlog_write(struct file *filp, struct kobject *kobj, if (count < sizeof(u32)) return -EINVAL; param.type = *(u32 *)buf; - count -= sizeof(u32); buf += sizeof(u32); /* The remaining buffer is the data payload */ - if (count > gsmi_dev.data_buf->length) + if ((count - sizeof(u32)) > gsmi_dev.data_buf->length) return -EINVAL; param.data_len = count - sizeof(u32); @@ -504,7 +503,7 @@ static ssize_t eventlog_write(struct file *filp, struct kobject *kobj, spin_unlock_irqrestore(&gsmi_dev.lock, flags); - return rc; + return (rc == 0) ? count : rc; } -- 2.19.0.605.g01d371f741-goog
[PATCH 0/4] gsmi: Google specific firmware patches
This series contains some Google specific firmware patches that we've been carrying out of tree. I've updated the changelog for each so that it is suitable for upstream, and I've retested them to make sure I know what the patches are doing. Duncan Laurie (3): gsmi: Fix bug in append_to_eventlog sysfs handler gsmi: Add coreboot to list of matching BIOS vendors gsmi: Remove autoselected dependency on EFI and EFI_VARS Furquan Shaikh (1): gsmi: Add GSMI commands to log S0ix info drivers/firmware/google/Kconfig | 6 +- drivers/firmware/google/gsmi.c | 120 +--- 2 files changed, 115 insertions(+), 11 deletions(-) -- 2.19.0.605.g01d371f741-goog
[PATCH 2/4] gsmi: Add coreboot to list of matching BIOS vendors
From: Duncan Laurie In order to use this coreboot needs board support for: CONFIG_ELOG=y CONFIG_ELOG_GSMI=y And the kernel driver needs enabled: CONFIG_GOOGLE_GSMI=y To test, verify that clean shutdown event is added to the log: > mosys eventlog list | grep 'Clean Shutdown' 11 | 2012-06-25 09:49:24 | Kernl Event | Clean Shutdown Signed-off-by: Duncan Laurie Reviewed-by: Vadim Bendebury Reviewed-by: Stefan Reinauer Signed-off-by: Furquan Shaikh Tested-by: Furquan Shaikh Reviewed-by: Aaron Durbin Reviewed-by: Justin TerAvest [zwisler: update changelog for upstream] Signed-off-by: Ross Zwisler --- drivers/firmware/google/gsmi.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c index 62337be07afc..252884787266 100644 --- a/drivers/firmware/google/gsmi.c +++ b/drivers/firmware/google/gsmi.c @@ -715,6 +715,12 @@ static const struct dmi_system_id gsmi_dmi_table[] __initconst = { DMI_MATCH(DMI_BOARD_VENDOR, "Google, Inc."), }, }, + { + .ident = "Coreboot Firmware", + .matches = { + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), + }, + }, {} }; MODULE_DEVICE_TABLE(dmi, gsmi_dmi_table); -- 2.19.0.605.g01d371f741-goog
[PATCH 4/4] gsmi: Add GSMI commands to log S0ix info
From: Furquan Shaikh Add new GSMI commands (GSMI_CMD_LOG_S0IX_SUSPEND = 0xa, GSMI_CMD_LOG_S0IX_RESUME = 0xb) that allow firmware to log any information during S0ix suspend/resume paths. Traditional ACPI suspend S3 involves BIOS both during the suspend and the resume paths. However, modern suspend type like S0ix does not involve firmware on either of the paths. This command gives the firmware an opportunity to log any required information about the suspend and resume operations e.g. wake sources. Additionally, this change adds a module parameter to allow platforms to specifically enable S0ix logging if required. This prevents any other platforms from unnecessarily making a GSMI call which could have any side-effects. Tested by verifying that wake sources are correctly logged in eventlog. Signed-off-by: Furquan Shaikh Reviewed-by: Aaron Durbin Reviewed-by: Rajat Jain Signed-off-by: Furquan Shaikh Tested-by: Furquan Shaikh Reviewed-by: Aaron Durbin [zwisler: update changelog for upstream] Signed-off-by: Ross Zwisler --- drivers/firmware/google/gsmi.c | 93 +- 1 file changed, 92 insertions(+), 1 deletion(-) diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c index edab00cc6bba..7ee25ce0e318 100644 --- a/drivers/firmware/google/gsmi.c +++ b/drivers/firmware/google/gsmi.c @@ -29,6 +29,7 @@ #include #include #include +#include #define GSMI_SHUTDOWN_CLEAN0 /* Clean Shutdown */ /* TODO(mi...@google.com): Tie in HARDLOCKUP_DETECTOR with NMIWDT */ @@ -70,6 +71,8 @@ #define GSMI_CMD_SET_NVRAM_VAR 0x03 #define GSMI_CMD_SET_EVENT_LOG 0x08 #define GSMI_CMD_CLEAR_EVENT_LOG 0x09 +#define GSMI_CMD_LOG_S0IX_SUSPEND 0x0a +#define GSMI_CMD_LOG_S0IX_RESUME 0x0b #define GSMI_CMD_CLEAR_CONFIG 0x20 #define GSMI_CMD_HANDSHAKE_TYPE0xC1 @@ -122,7 +125,6 @@ struct gsmi_log_entry_type_1 { u32 instance; } __packed; - /* * Some platforms don't have explicit SMI handshake * and need to wait for SMI to complete. @@ -133,6 +135,15 @@ module_param(spincount, uint, 0600); MODULE_PARM_DESC(spincount, "The number of loop iterations to use when using the spin handshake."); +/* + * Platforms might not support S0ix logging in their GSMI handlers. In order to + * avoid any side-effects of generating an SMI for S0ix logging, use the S0ix + * related GSMI commands only for those platforms that explicitly enable this + * option. + */ +static bool s0ix_logging_enable; +module_param(s0ix_logging_enable, bool, 0600); + static struct gsmi_buf *gsmi_buf_alloc(void) { struct gsmi_buf *smibuf; @@ -781,6 +792,78 @@ static const struct platform_device_info gsmi_dev_info = { .dma_mask = DMA_BIT_MASK(32), }; +#ifdef CONFIG_PM +static void gsmi_log_s0ix_info(u8 cmd) +{ + unsigned long flags; + + /* +* If platform has not enabled S0ix logging, then no action is +* necessary. +*/ + if (!s0ix_logging_enable) + return; + + spin_lock_irqsave(&gsmi_dev.lock, flags); + + memset(gsmi_dev.param_buf->start, 0, gsmi_dev.param_buf->length); + + gsmi_exec(GSMI_CALLBACK, cmd); + + spin_unlock_irqrestore(&gsmi_dev.lock, flags); +} + +static int gsmi_log_s0ix_suspend(struct device *dev) +{ + /* +* If system is not suspending via firmware using the standard ACPI Sx +* types, then make a GSMI call to log the suspend info. +*/ + if (!pm_suspend_via_firmware()) + gsmi_log_s0ix_info(GSMI_CMD_LOG_S0IX_SUSPEND); + + /* +* Always return success, since we do not want suspend +* to fail just because of logging failure. +*/ + return 0; +} + +static int gsmi_log_s0ix_resume(struct device *dev) +{ + /* +* If system did not resume via firmware, then make a GSMI call to log +* the resume info and wake source. +*/ + if (!pm_resume_via_firmware()) + gsmi_log_s0ix_info(GSMI_CMD_LOG_S0IX_RESUME); + + /* +* Always return success, since we do not want resume +* to fail just because of logging failure. +*/ + return 0; +} + +static const struct dev_pm_ops gsmi_pm_ops = { + .suspend_noirq = gsmi_log_s0ix_suspend, + .resume_noirq = gsmi_log_s0ix_resume, +}; + +static int gsmi_platform_driver_probe(struct platform_device *dev) +{ + return 0; +} + +static struct platform_driver gsmi_driver_info = { + .driver = { + .name = "gsmi", + .pm = &gsmi_pm_ops, + }, + .probe = gsmi_platform_driver_probe, +}; +#endif + static __init int gsmi_init(void) { unsigned long flags; @@ -792,6 +875,14 @@ static __init int gsmi_init(void) gsmi_dev.smi_cmd = acpi_gbl_FADT.smi_command; +#ifdef CONFIG_PM + ret = platform_driver_regist
Re: [PATCH v14 00/74] Convert page cache to XArray
On Wed, Jul 25, 2018 at 02:03:23PM -0700, Matthew Wilcox wrote: > On Wed, Jun 27, 2018 at 01:44:38PM -0600, Ross Zwisler wrote: > > On Wed, Jun 27, 2018 at 04:05:29AM -0700, Matthew Wilcox wrote: > > > On Tue, Jun 19, 2018 at 10:16:38AM -0700, Matthew Wilcox wrote: > > > > I think I see a bug. No idea if it's the one you're hitting ;-) > > > > > > > > I had been intending to not use the 'entry' to decide whether we were > > > > waiting on a 2MB or 4kB page, but rather the xas. I shelved that idea, > > > > but not before dropping the DAX_PMD flag being passed from the PMD > > > > pagefault caller. So if I put that back ... > > > > > > Did you get a chance to test this? > > > > With this patch it doesn't deadlock, but the test dies with a SIGBUS and we > > hit a WARN_ON in the DAX code: > > > > WARNING: CPU: 5 PID: 1678 at fs/dax.c:226 get_unlocked_entry+0xf7/0x120 > > > > I don't have a lot of time this week to debug further. The quickest path to > > victory is probably for you to get this reproducing in your test setup. > > Does > > XFS + DAX + generic/340 pass for you? > > I now have generic/340 passing. I've pushed a new version to > git://git.infradead.org/users/willy/linux-dax.git xarray Thanks, I'll throw it in my test setup.
Re: [PATCH v2 0/6] kaddr and pfn can be NULL to ->direct_access()
On Thu, Jul 26, 2018 at 12:28:43AM +0800, Huaisheng Ye wrote: > From: Huaisheng Ye > > Changes since v1 [1]: > * Involve the previous patches for pfn can be NULL. > * Reword the patch descriptions according to Christian's comment. > * According to Ross's suggestion, replace local pointer dummy_addr > with NULL within md/dm-writecache for direct_access. > > [1]: https://lkml.org/lkml/2018/7/24/199 > > Some functions within fs/dax, dax/super and md/dm-writecache don't > need to get local pointer kaddr or variable pfn from direct_access. > Assigning NULL to kaddr or pfn to ->direct_access() is more > straightforward and simple than offering a useless local pointer or > variable. > > So all ->direct_access() need to check the validity of pointer kaddr > and pfn for NULL assignment. If either of them is equal to NULL, that > is to say callers may have no need for kaddr or pfn, so this series of > patch are prepared for allowing them to pass in NULL instead of having > to pass in a local pointer or variable that they then just throw away. Looks good. For the series: Reviewed-by: Ross Zwisler
Re: [PATCH] mm/sparse: Make sparse_init_one_section void and remove check
On Mon, Jul 2, 2018 at 12:48 PM Pavel Tatashin wrote: > > On Mon, Jul 2, 2018 at 11:43 AM wrote: > > > > From: Oscar Salvador > > > > sparse_init_one_section() is being called from two sites: > > sparse_init() and sparse_add_one_section(). > > The former calls it from a for_each_present_section_nr() loop, > > and the latter marks the section as present before calling it. > > This means that when sparse_init_one_section() gets called, we already know > > that the section is present. > > So there is no point to double check that in the function. > > > > This removes the check and makes the function void. > > > > Signed-off-by: Oscar Salvador > > Thank you Oscar. > > Reviewed-by: Pavel Tatashin It looks like this change breaks "fsdax" mode namespaces in next-20180705. The offending commit is: commit 054620849110 ("mm/sparse.c: make sparse_init_one_section void and remove check") Here is the stack trace I get when converting a raw mode namespace to fsdax mode, and from then on during each reboot as the namespace is being initialized: [6.067166] BUG: unable to handle kernel paging request at ea000580 [6.068084] PGD 13ffdd067 P4D 13ffdd067 PUD 13ffdc067 PMD 0 [6.068771] Oops: 0002 [#1] PREEMPT SMP PTI [6.069262] CPU: 11 PID: 180 Comm: kworker/u24:2 Not tainted 4.18.0-rc3-00193-g054620849110 #1 [6.070440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 [6.071689] Workqueue: events_unbound async_run_entry_fn [6.072261] RIP: 0010:memmap_init_zone+0x154/0x1cf [6.072882] Code: 48 89 c3 48 c1 eb 0c e9 82 00 00 00 48 89 da 48 b8 00 00 00 00 00 ea ff ff b9 10 00 00 00 48 c1 e2 06 48 01 c2 31 c0 48 89 d7 ab 48 b8 ff ff ff ff ff ff 7f 00 48 23 45 c0 c7 42 34 01 00 00 [6.075396] RSP: 0018:c900024bfa70 EFLAGS: 00010246 [6.076052] RAX: RBX: 00140002 RCX: 0010 [6.076845] RDX: ea000580 RSI: RDI: ea000580 [6.077604] RBP: c900024bfab0 R08: 0001 R09: 88010eb50d38 [6.078394] R10: R11: R12: [6.079331] R13: 0004 R14: 0001 R15: [6.080274] FS: () GS:880115a0() knlGS: [6.081337] CS: 0010 DS: ES: CR0: 80050033 [6.082092] CR2: ea000580 CR3: 02824000 CR4: 06e0 [6.083032] Call Trace: [6.083371] move_pfn_range_to_zone+0x168/0x180 [6.083965] devm_memremap_pages+0x29b/0x480 [6.084550] pmem_attach_disk+0x1ae/0x6c0 [nd_pmem] [6.085204] ? devm_memremap+0x79/0xb0 [6.085714] nd_pmem_probe+0x7e/0xa0 [nd_pmem] [6.086320] nvdimm_bus_probe+0x6e/0x160 [libnvdimm] [6.086977] driver_probe_device+0x310/0x480 [6.087543] __device_attach_driver+0x86/0x100 [6.088136] ? __driver_attach+0x110/0x110 [6.088681] bus_for_each_drv+0x6e/0xb0 [6.089190] __device_attach+0xe2/0x160 [6.089705] device_initial_probe+0x13/0x20 [6.090266] bus_probe_device+0xa6/0xc0 [6.090772] device_add+0x41b/0x660 [6.091249] ? lock_acquire+0xa3/0x210 [6.091743] nd_async_device_register+0x12/0x40 [libnvdimm] [6.092398] async_run_entry_fn+0x3e/0x170 [6.092921] process_one_work+0x230/0x680 [6.093455] worker_thread+0x3f/0x3b0 [6.093930] kthread+0x12f/0x150 [6.094362] ? process_one_work+0x680/0x680 [6.094903] ? kthread_create_worker_on_cpu+0x70/0x70 [6.095574] ret_from_fork+0x3a/0x50 [6.096069] Modules linked in: nd_pmem nd_btt dax_pmem device_dax nfit libnvdimm [6.097179] CR2: ea000580 [6.097764] ---[ end trace a5b8bd6a5500b68c ]--- - Ross
Re: [PATCH] x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
On Mon, Jul 02, 2018 at 02:16:10PM -0700, Dan Williams wrote: > All copy_to_user() implementations need to be prepared to handle faults > accessing userspace. The __memcpy_mcsafe() implementation handles both > mmu-faults on the user destination and machine-check-exceptions on the > source buffer. However, the memcpy_mcsafe() wrapper may silently > fallback to memcpy() depending on build options and cpu-capabilities. > > Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when > available, and otherwise disable all of the copy_to_user_mcsafe() > infrastructure when __memcpy_mcsafe() is not available, i.e. > CONFIG_X86_MCE=n. > > This fixes crashes of the form: > run fstests generic/323 at 2018-07-02 12:46:23 > BUG: unable to handle kernel paging request at 7f0d50001000 > RIP: 0010:__memcpy+0x12/0x20 > [..] > Call Trace: > copyout_mcsafe+0x3a/0x50 > _copy_to_iter_mcsafe+0xa1/0x4a0 > ? dax_alive+0x30/0x50 > dax_iomap_actor+0x1f9/0x280 > ? dax_iomap_rw+0x100/0x100 > iomap_apply+0xba/0x130 > ? dax_iomap_rw+0x100/0x100 > dax_iomap_rw+0x95/0x100 > ? dax_iomap_rw+0x100/0x100 > xfs_file_dax_read+0x7b/0x1d0 [xfs] > xfs_file_read_iter+0xa7/0xc0 [xfs] > aio_read+0x11c/0x1a0 > > Fixes: 8780356ef630 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()") > Cc: Al Viro > Cc: Andrew Morton > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: Tony Luck > Reported-by: Ross Zwisler > Signed-off-by: Dan Williams Tested-by: Ross Zwisler
Re: [PATCH] lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe
On Sun, Jul 01, 2018 at 08:52:20AM -0700, Dan Williams wrote: > By mistake the ITER_PIPE early-exit / warning from copy_from_iter() was > cargo-culted in _copy_to_iter_mcsafe() rather than a machine-check-safe > version of copy_to_iter_pipe(). > > Implement copy_pipe_to_iter_mcsafe() being careful to return the > indication of short copies due to a CPU exception. > > Without this regression-fix all splice reads to dax-mode files fail. > > Fixes: 8780356ef630 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()") > Cc: Al Viro > Cc: Andrew Morton > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: Tony Luck > Reported-by: Ross Zwisler > Signed-off-by: Dan Williams Tested-by: Ross Zwisler
Re: [PATCH] lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe
On Sun, Jul 01, 2018 at 08:52:20AM -0700, Dan Williams wrote: > By mistake the ITER_PIPE early-exit / warning from copy_from_iter() was > cargo-culted in _copy_to_iter_mcsafe() rather than a machine-check-safe > version of copy_to_iter_pipe(). > > Implement copy_pipe_to_iter_mcsafe() being careful to return the > indication of short copies due to a CPU exception. > > Without this regression-fix all splice reads to dax-mode files fail. > > Fixes: 8780356ef630 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()") > Cc: Al Viro > Cc: Andrew Morton > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: Tony Luck > Reported-by: Ross Zwisler > Signed-off-by: Dan Williams > --- > Hi Ingo, > > I'm submitting this fix back through the tip tree since the regression > originated through tip/x86/dax. > > lib/iov_iter.c | 37 + > 1 file changed, 33 insertions(+), 4 deletions(-) Hey Dan, I retested the current linux/master with this patch applied, and XFS + DAX + generic/323 still dies for me: run fstests generic/323 at 2018-07-02 10:51:35 BUG: unable to handle kernel paging request at 7f16dc001000 PGD 8000bb71a067 P4D 8000bb71a067 PUD bb71b067 PMD bb6e8067 PTE 0 Oops: 0002 [#1] PREEMPT SMP PTI CPU: 1 PID: 1598 Comm: aio-last-ref-he Not tainted 4.18.0-rc3-1-g5174f2f2b6e5 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:__memcpy+0x12/0x20 Code: c3 e8 42 fb ff ff 48 8b 43 60 48 2b 43 50 88 43 4e 5b 5d c3 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4 RSP: 0018:c90002783a60 EFLAGS: 00010246 RAX: 7f16dc001000 RBX: 880151229000 RCX: 2000 RDX: RSI: 880151219000 RDI: 7f16dc001000 RBP: c90002783a68 R08: 004227a4083c R09: 880151219000 R10: c90002783d40 R11: R12: R13: 0001 R14: c90002783d18 R15: 0001 FS: 7f16f1ec5700() GS:88011460() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f16dc001000 CR3: 35508000 CR4: 06e0 Call Trace: ? copyout_mcsafe+0x3e/0x60 _copy_to_iter_mcsafe+0x9e/0x4c0 ? __lock_is_held+0x65/0xb0 pmem_copy_to_iter+0x17/0x20 [nd_pmem] dax_copy_to_iter+0x49/0x70 dax_iomap_actor+0x1f8/0x280 ? dax_iomap_rw+0x100/0x100 iomap_apply+0xb5/0x130 ? dax_iomap_rw+0x100/0x100 dax_iomap_rw+0x95/0x100 ? dax_iomap_rw+0x100/0x100 xfs_file_dax_read+0x83/0x1f0 xfs_file_read_iter+0xac/0xc0 aio_read+0x11f/0x1a0 ? __might_fault+0x3e/0x90 io_submit_one+0x39d/0x5f0 ? io_submit_one+0x39d/0x5f0 __x64_sys_io_submit+0xa1/0x280 do_syscall_64+0x65/0x220 ? do_syscall_64+0x65/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe This failure looks identical to what I was hitting with the original bug report. - Ross
Re: [PATCH v3 1/3] pmem: only set QUEUE_FLAG_DAX for fsdax mode
On Thu, Jun 28, 2018 at 05:42:34PM +, Kani, Toshi wrote: > On Tue, 2018-06-26 at 16:04 -0600, Ross Zwisler wrote: > > On Tue, Jun 26, 2018 at 02:51:52PM -0700, Dan Williams wrote: > > > On Tue, Jun 26, 2018 at 2:31 PM, Kani, Toshi wrote: > > > > On Tue, 2018-06-26 at 14:28 -0700, Dan Williams wrote: > > > > > On Tue, Jun 26, 2018 at 2:23 PM, Kani, Toshi > > > > > wrote: > > > > > > On Tue, 2018-06-26 at 14:02 -0700, Dan Williams wrote: > > > > > > > On Tue, Jun 26, 2018 at 1:54 PM, Kani, Toshi > > > > > > > wrote: > > > > > > > > > > [..] > > > > > > > > When this dm change was made, the pmem driver supported DAX for > > > > > > > > both raw > > > > > > > > and memory modes (note: sector mode does not use the pmem > > > > > > > > driver). I > > > > > > > > think the issue was introduced when we dropped DAX support from > > > > > > > > raw > > > > > > > > mode. > > > > > > > > > > > > > > Still DAX with raw mode never really worked any way. It was also > > > > > > > something that was broken from day one. So what happens to > > > > > > > someone who > > > > > > > happened to avoid all the problems with page-less DAX and enabled > > > > > > > device-mapper on top? That failure mode detail needs to be added > > > > > > > to > > > > > > > this changelog if we want to propose this for -stable. > > > > > > > > > > > > My point is that the behavior should be consistent between pmem and > > > > > > device-mapper. When -o dax succeeds on a pmem, then it should > > > > > > succeed > > > > > > on a device-mapper on top of that pmem. > > > > > > > > > > > > Has the drop of dax support from raw mode made to -stable back to > > > > > > the > > > > > > baseline accepted 545ed20e6df6? It will introduce inconsistency, > > > > > > otherwise. > > > > > > > > > > That commit, 569d0365f571 "dax: require 'struct page' by default for > > > > > filesystem dax", has not been tagged for -stable. > > > > > > > > Then, Fixes tag should be set to 569d0365f571 to keep the behavior > > > > consistent. > > > > > > Sure, and the failure mode is...? I'm thinking the commit log should say: > > > > > > "Starting with commit 569d0365f571 "dax: require 'struct page' by > > > default for filesystem dax", dax is no longer supported for page-less > > > configurations. However, device-mapper sees the QUEUE_FLAG_DAX still > > > being set and falsely assumes that DAX is enabled, this leads to > > > " > > > > Dan is correct that there is no user visible change for this. It is the > > right > > thing to do for consistency and sanity, but it doesn't actually have user > > visible behavior that needs to be backported to stable. > > > > Toshi is correct that this change is only for raw mode namespaces, not btt > > namespaces. > > > > I'll adjust the changelog and remove the stable flag for v5, and I'll add a > > Fixes: tag for patch 2. > > Hi Ross, > > Your patches look good. But I am still not clear about the Fixes & > stable handling. Talking about user visible behavior, I do not think we > had any issue until dax support was dropped from raw mode. Until then, > the pmem driver supported dax for all modes, and the check for > direct_access worked. I agree that the fsdax + raw mode failure mode I mentioned in my cover letter only started when we restricted filesystem DAX to having struct page, but I think that the other failure mode, fsdax + some random block driver (I used brd) was present in DM from the beginning. In any case, I think both are fixed with the patches, and I think it's fine that all 3 get thrown at stable. Thanks, Mike, for the help.
[PATCH v2 2/7] dax: change bdev_dax_supported() to support boolean returns
From: Dave Jiang The function return values are confusing with the way the function is named. We expect a true or false return value but it actually returns 0/-errno. This makes the code very confusing. Changing the return values to return a bool where if DAX is supported then return true and no DAX support returns false. Signed-off-by: Dave Jiang Signed-off-by: Ross Zwisler --- drivers/dax/super.c | 16 fs/ext2/super.c | 3 +-- fs/ext4/super.c | 3 +-- fs/xfs/xfs_ioctl.c | 4 ++-- fs/xfs/xfs_super.c | 12 ++-- include/linux/dax.h | 8 6 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 3943feb9a090..1d7bd96511f0 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -80,9 +80,9 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); * This is a library function for filesystems to check if the block device * can be mounted with dax option. * - * Return: negative errno if unsupported, 0 if supported. + * Return: true if supported, false if unsupported */ -int __bdev_dax_supported(struct block_device *bdev, int blocksize) +bool __bdev_dax_supported(struct block_device *bdev, int blocksize) { struct dax_device *dax_dev; pgoff_t pgoff; @@ -95,21 +95,21 @@ int __bdev_dax_supported(struct block_device *bdev, int blocksize) if (blocksize != PAGE_SIZE) { pr_debug("%s: error: unsupported blocksize for dax\n", bdevname(bdev, buf)); - return -EINVAL; + return false; } err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff); if (err) { pr_debug("%s: error: unaligned partition for dax\n", bdevname(bdev, buf)); - return err; + return false; } dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); if (!dax_dev) { pr_debug("%s: error: device does not support dax\n", bdevname(bdev, buf)); - return -EOPNOTSUPP; + return false; } id = dax_read_lock(); @@ -121,7 +121,7 @@ int __bdev_dax_supported(struct block_device *bdev, int blocksize) if (len < 1) { pr_debug("%s: error: dax access failed (%ld)\n", bdevname(bdev, buf), len); - return len < 0 ? len : -EIO; + return false; } if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) { @@ -139,10 +139,10 @@ int __bdev_dax_supported(struct block_device *bdev, int blocksize) } else { pr_debug("%s: error: dax support not enabled\n", bdevname(bdev, buf)); - return -EOPNOTSUPP; + return false; } - return 0; + return true; } EXPORT_SYMBOL_GPL(__bdev_dax_supported); #endif diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 9627c3054b5c..c09289a42dc5 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -961,8 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { - err = bdev_dax_supported(sb->s_bdev, blocksize); - if (err) { + if (!bdev_dax_supported(sb->s_bdev, blocksize)) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); sbi->s_mount_opt &= ~EXT2_MOUNT_DAX; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 089170e99895..2e1622907f4a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3732,8 +3732,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) " that may contain inline data"); sbi->s_mount_opt &= ~EXT4_MOUNT_DAX; } - err = bdev_dax_supported(sb->s_bdev, blocksize); - if (err) { + if (!bdev_dax_supported(sb->s_bdev, blocksize)) { ext4_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); sbi->s_mount_opt &= ~EXT4_MOUNT_DAX; diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 0effd46b965f..2c70a0a4f59f 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1103,8 +1103,8 @@ xfs_ioctl_setattr_dax_invalidate( if (fa->fsx_xflags & FS_XFLAG_DAX) { if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))) return -EINVAL; - if (bdev_dax_sup
[PATCH resend 1/7] fs: allow per-device dax status checking for filesystems
From: "Darrick J. Wong" Remove __bdev_dax_supported and change to bdev_dax_supported that takes a bdev parameter. This enables multi-device filesystems like xfs to check that a dax device can work for the particular filesystem. Once that's in place, actually fix all the parts of XFS where we need to be able to distinguish between datadev and rtdev. This patch fixes the problem where we screw up the dax support checking in xfs if the datadev and rtdev have different dax capabilities. Signed-off-by: Darrick J. Wong Signed-off-by: Ross Zwisler --- drivers/dax/super.c | 30 +++--- fs/ext2/super.c | 2 +- fs/ext4/super.c | 2 +- fs/xfs/xfs_ioctl.c | 3 ++- fs/xfs/xfs_iops.c | 30 +- fs/xfs/xfs_super.c | 10 -- include/linux/dax.h | 10 +++--- 7 files changed, 55 insertions(+), 32 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 2b2332b605e4..9206539c8330 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -73,8 +73,8 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); #endif /** - * __bdev_dax_supported() - Check if the device supports dax for filesystem - * @sb: The superblock of the device + * bdev_dax_supported() - Check if the device supports dax for filesystem + * @bdev: block device to check * @blocksize: The block size of the device * * This is a library function for filesystems to check if the block device @@ -82,33 +82,33 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); * * Return: negative errno if unsupported, 0 if supported. */ -int __bdev_dax_supported(struct super_block *sb, int blocksize) +int bdev_dax_supported(struct block_device *bdev, int blocksize) { - struct block_device *bdev = sb->s_bdev; struct dax_device *dax_dev; pgoff_t pgoff; int err, id; void *kaddr; pfn_t pfn; long len; + char buf[BDEVNAME_SIZE]; if (blocksize != PAGE_SIZE) { - pr_debug("VFS (%s): error: unsupported blocksize for dax\n", - sb->s_id); + pr_debug("%s: error: unsupported blocksize for dax\n", + bdevname(bdev, buf)); return -EINVAL; } err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff); if (err) { - pr_debug("VFS (%s): error: unaligned partition for dax\n", - sb->s_id); + pr_debug("%s: error: unaligned partition for dax\n", + bdevname(bdev, buf)); return err; } dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); if (!dax_dev) { - pr_debug("VFS (%s): error: device does not support dax\n", - sb->s_id); + pr_debug("%s: error: device does not support dax\n", + bdevname(bdev, buf)); return -EOPNOTSUPP; } @@ -119,8 +119,8 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize) put_dax(dax_dev); if (len < 1) { - pr_debug("VFS (%s): error: dax access failed (%ld)\n", - sb->s_id, len); + pr_debug("%s: error: dax access failed (%ld)\n", + bdevname(bdev, buf), len); return len < 0 ? len : -EIO; } @@ -137,14 +137,14 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize) } else if (pfn_t_devmap(pfn)) { /* pass */; } else { - pr_debug("VFS (%s): error: dax support not enabled\n", - sb->s_id); + pr_debug("%s: error: dax support not enabled\n", + bdevname(bdev, buf)); return -EOPNOTSUPP; } return 0; } -EXPORT_SYMBOL_GPL(__bdev_dax_supported); +EXPORT_SYMBOL_GPL(bdev_dax_supported); #endif enum dax_device_flags { diff --git a/fs/ext2/super.c b/fs/ext2/super.c index de1694512f1f..9627c3054b5c 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -961,7 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { - err = bdev_dax_supported(sb, blocksize); + err = bdev_dax_supported(sb->s_bdev, blocksize); if (err) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index eb104e8476f0..089170e99895 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3732,7 +3732,7
[PATCH resend 6/7] dm-snap: remove unnecessary direct_access() stub
This stub was added so that we could use dm-snap with DM_TYPE_DAX_BIO_BASED mode devices. That mode and the transition issues associated with it no longer exist, so we can remove this dead code. Signed-off-by: Ross Zwisler --- drivers/md/dm-snap.c | 8 1 file changed, 8 deletions(-) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index 216035be5661..0143b158d52d 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2305,13 +2305,6 @@ static int origin_map(struct dm_target *ti, struct bio *bio) return do_origin(o->dev, bio); } -static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) -{ - DMWARN("device does not support dax."); - return -EIO; -} - /* * Set the target "max_io_len" field to the minimum of all the snapshots' * chunk sizes. @@ -2371,7 +2364,6 @@ static struct target_type origin_target = { .postsuspend = origin_postsuspend, .status = origin_status, .iterate_devices = origin_iterate_devices, - .direct_access = origin_dax_direct_access, }; static struct target_type snapshot_target = { -- 2.14.3
[PATCH resend 0/7] Fix DM DAX handling
No code changes from v1. Just CCing the xfs mailing list & adding one Reviewed-by from Darrick. --- This series fixes a few issues that I found with DM's handling of DAX devices. Here are some of the issues I found: * We can create a dm-stripe or dm-linear device which is made up of an fsdax PMEM namespace and a raw PMEM namespace but which can hold a filesystem mounted with the -o dax mount option. DAX operations to the raw PMEM namespace part lack struct page and can fail in interesting/unexpected ways when doing things like fork(), examining memory with gdb, etc. * We can create a dm-stripe or dm-linear device which is made up of an fsdax PMEM namespace and a BRD ramdisk which can hold a filesystem mounted with the -o dax mount option. All I/O to this filesystem will fail. * In DM you can't transition a dm target which could possibly support DAX (mode DM_TYPE_DAX_BIO_BASED) to one which can't support DAX (mode DM_TYPE_BIO_BASED), even if you never use DAX. The first 2 patches in this series are prep work from Darrick and Dave which improve bdev_dax_supported(). The last 5 problems fix the above mentioned problems in DM. I feel that this series simplifies the handling of DAX devices in DM, and the last 5 DM-related patches have a net code reduction of 50 lines. Darrick J. Wong (1): fs: allow per-device dax status checking for filesystems Dave Jiang (1): dax: change bdev_dax_supported() to support boolean returns Ross Zwisler (5): dm: fix test for DAX device support dm: prevent DAX mounts if not supported dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode dm-snap: remove unnecessary direct_access() stub dm-error: remove unnecessary direct_access() stub drivers/dax/super.c | 44 +-- drivers/md/dm-ioctl.c | 16 ++-- drivers/md/dm-snap.c | 8 drivers/md/dm-table.c | 29 +++- drivers/md/dm-target.c| 7 --- drivers/md/dm.c | 7 ++- fs/ext2/super.c | 3 +-- fs/ext4/super.c | 3 +-- fs/xfs/xfs_ioctl.c| 3 ++- fs/xfs/xfs_iops.c | 30 - fs/xfs/xfs_super.c| 10 -- include/linux/dax.h | 12 include/linux/device-mapper.h | 8 ++-- 13 files changed, 88 insertions(+), 92 deletions(-) -- 2.14.3
[PATCH resend 2/7] dax: change bdev_dax_supported() to support boolean returns
From: Dave Jiang The function return values are confusing with the way the function is named. We expect a true or false return value but it actually returns 0/-errno. This makes the code very confusing. Changing the return values to return a bool where if DAX is supported then return true and no DAX support returns false. Signed-off-by: Dave Jiang Signed-off-by: Ross Zwisler Reviewed-by: Darrick J. Wong --- drivers/dax/super.c | 16 fs/ext2/super.c | 3 +-- fs/ext4/super.c | 3 +-- fs/xfs/xfs_ioctl.c | 4 ++-- fs/xfs/xfs_super.c | 12 ++-- include/linux/dax.h | 6 +++--- 6 files changed, 21 insertions(+), 23 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9206539c8330..e5447eddecf8 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -80,9 +80,9 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); * This is a library function for filesystems to check if the block device * can be mounted with dax option. * - * Return: negative errno if unsupported, 0 if supported. + * Return: true if supported, false if unsupported */ -int bdev_dax_supported(struct block_device *bdev, int blocksize) +bool bdev_dax_supported(struct block_device *bdev, int blocksize) { struct dax_device *dax_dev; pgoff_t pgoff; @@ -95,21 +95,21 @@ int bdev_dax_supported(struct block_device *bdev, int blocksize) if (blocksize != PAGE_SIZE) { pr_debug("%s: error: unsupported blocksize for dax\n", bdevname(bdev, buf)); - return -EINVAL; + return false; } err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff); if (err) { pr_debug("%s: error: unaligned partition for dax\n", bdevname(bdev, buf)); - return err; + return false; } dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); if (!dax_dev) { pr_debug("%s: error: device does not support dax\n", bdevname(bdev, buf)); - return -EOPNOTSUPP; + return false; } id = dax_read_lock(); @@ -121,7 +121,7 @@ int bdev_dax_supported(struct block_device *bdev, int blocksize) if (len < 1) { pr_debug("%s: error: dax access failed (%ld)\n", bdevname(bdev, buf), len); - return len < 0 ? len : -EIO; + return false; } if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) { @@ -139,10 +139,10 @@ int bdev_dax_supported(struct block_device *bdev, int blocksize) } else { pr_debug("%s: error: dax support not enabled\n", bdevname(bdev, buf)); - return -EOPNOTSUPP; + return false; } - return 0; + return true; } EXPORT_SYMBOL_GPL(bdev_dax_supported); #endif diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 9627c3054b5c..c09289a42dc5 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -961,8 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { - err = bdev_dax_supported(sb->s_bdev, blocksize); - if (err) { + if (!bdev_dax_supported(sb->s_bdev, blocksize)) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); sbi->s_mount_opt &= ~EXT2_MOUNT_DAX; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 089170e99895..2e1622907f4a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3732,8 +3732,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) " that may contain inline data"); sbi->s_mount_opt &= ~EXT4_MOUNT_DAX; } - err = bdev_dax_supported(sb->s_bdev, blocksize); - if (err) { + if (!bdev_dax_supported(sb->s_bdev, blocksize)) { ext4_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); sbi->s_mount_opt &= ~EXT4_MOUNT_DAX; diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 0effd46b965f..2c70a0a4f59f 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1103,8 +1103,8 @@ xfs_ioctl_setattr_dax_invalidate( if (fa->fsx_xflags & FS_XFLAG_DAX) { if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))) return -EINVAL; - if (bdev_dax_
[PATCH resend 7/7] dm-error: remove unnecessary direct_access() stub
This stub was added so that we could use dm-error with DM_TYPE_DAX_BIO_BASED mode devices. That mode and the transition issues associated with it no longer exist, so we can remove this dead code. Signed-off-by: Ross Zwisler --- drivers/md/dm-target.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 314d17ca6466..c4dbc15f7862 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -140,12 +140,6 @@ static void io_err_release_clone_rq(struct request *clone) { } -static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) -{ - return -EIO; -} - static struct target_type error_target = { .name = "error", .version = {1, 5, 0}, @@ -155,7 +149,6 @@ static struct target_type error_target = { .map = io_err_map, .clone_and_map_rq = io_err_clone_and_map_rq, .release_clone_rq = io_err_release_clone_rq, - .direct_access = io_err_dax_direct_access, }; int __init dm_target_init(void) -- 2.14.3
[PATCH resend 3/7] dm: fix test for DAX device support
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX flag is set on the device's request queue to decide whether or not the device supports filesystem DAX. This is insufficient because there are devices like PMEM namespaces in raw mode which have QUEUE_FLAG_DAX set but which don't actually support DAX. This means that you could create a dm-linear device, for example, where the first part of the dm-linear device was a PMEM namespace in fsdax mode and the second part was a PMEM namespace in raw mode. Both DM and the filesystem you put on that dm-linear device would think the whole device supports DAX, which would lead to bad behavior once your raw PMEM namespace part using DAX needed struct page for something. Fix this by using bdev_dax_supported() like filesystems do at mount time. This checks for raw mode and also performs other tests like checking to make sure the dax_direct_access() path works. Signed-off-by: Ross Zwisler Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") --- drivers/md/dm-table.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 0589a4da12bb..5bb994b012ca 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -885,9 +885,7 @@ EXPORT_SYMBOL_GPL(dm_table_set_type); static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { - struct request_queue *q = bdev_get_queue(dev->bdev); - - return q && blk_queue_dax(q); + return bdev_dax_supported(dev->bdev, PAGE_SIZE); } static bool dm_table_supports_dax(struct dm_table *t) -- 2.14.3
[PATCH resend 5/7] dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode
The DM_TYPE_DAX_BIO_BASED dm_queue_mode was introduced to prevent DM devices that could possibly support DAX from transitioning into DM devices that cannot support DAX. For example, the following transition will currently fail: dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw] DM_TYPE_DAX_BIO_BASED DM_TYPE_BIO_BASED but these will both succeed: dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw] DM_TYPE_DAX_BASEDDM_TYPE_BIO_BASED dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem] DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED This seems arbitrary, as really the choice on whether to use DAX happens at filesystem mount time. There's no guarantee that the in the first case (double fsdax pmem) we were using the dax mount option with our file system. Instead, get rid of DM_TYPE_DAX_BIO_BASED and all the special casing around it, and instead make the request queue's QUEUE_FLAG_DAX be our one source of truth. If this is set, we can use DAX, and if not, not. We keep this up to date in table_load() as the table changes. As with regular block devices the filesystem will then know at mount time whether DAX is a supported mount option or not. Signed-off-by: Ross Zwisler --- drivers/md/dm-ioctl.c | 16 ++-- drivers/md/dm-table.c | 25 ++--- drivers/md/dm.c | 2 -- include/linux/device-mapper.h | 8 ++-- 4 files changed, 22 insertions(+), 29 deletions(-) diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c index 5acf77de5945..d1f86d0bb2d0 100644 --- a/drivers/md/dm-ioctl.c +++ b/drivers/md/dm-ioctl.c @@ -1292,15 +1292,6 @@ static int populate_table(struct dm_table *table, return dm_table_complete(table); } -static bool is_valid_type(enum dm_queue_mode cur, enum dm_queue_mode new) -{ - if (cur == new || - (cur == DM_TYPE_BIO_BASED && new == DM_TYPE_DAX_BIO_BASED)) - return true; - - return false; -} - static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_size) { int r; @@ -1343,12 +1334,17 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si DMWARN("unable to set up device queue for new table."); goto err_unlock_md_type; } - } else if (!is_valid_type(dm_get_md_type(md), dm_table_get_type(t))) { + } else if (dm_get_md_type(md) != dm_table_get_type(t)) { DMWARN("can't change device type after initial table load."); r = -EINVAL; goto err_unlock_md_type; } + if (dm_table_supports_dax(t)) + blk_queue_flag_set(QUEUE_FLAG_DAX, md->queue); + else + blk_queue_flag_clear(QUEUE_FLAG_DAX, md->queue); + dm_unlock_md_type(md); /* stage inactive table */ diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 5bb994b012ca..ea5c4a1e6f5b 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -866,7 +866,6 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED || - table_type == DM_TYPE_DAX_BIO_BASED || table_type == DM_TYPE_NVME_BIO_BASED); } @@ -888,7 +887,7 @@ static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, return bdev_dax_supported(dev->bdev, PAGE_SIZE); } -static bool dm_table_supports_dax(struct dm_table *t) +bool dm_table_supports_dax(struct dm_table *t) { struct dm_target *ti; unsigned i; @@ -907,6 +906,7 @@ static bool dm_table_supports_dax(struct dm_table *t) return true; } +EXPORT_SYMBOL_GPL(dm_table_supports_dax); static bool dm_table_does_not_support_partial_completion(struct dm_table *t); @@ -944,7 +944,6 @@ static int dm_table_determine_type(struct dm_table *t) /* possibly upgrade to a variant of bio-based */ goto verify_bio_based; } - BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED); goto verify_rq_based; } @@ -981,18 +980,14 @@ static int dm_table_determine_type(struct dm_table *t) verify_bio_based: /* We must use this table as bio-based */ t->type = DM_TYPE_BIO_BASED; - if (dm_table_supports_dax(t) || - (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { - t->type = DM_TYPE_DAX_BIO_BASED; - } else { - /* Check if upgrading to NVMe bio-based is valid or required */ - tgt = dm_table_get_immutable_target(t); -
[PATCH resend 4/7] dm: prevent DAX mounts if not supported
Currently the code in dm_dax_direct_access() only checks whether the target type has a direct_access() operation defined, not whether the underlying block devices all support DAX. This latter property can be seen by looking at whether we set the QUEUE_FLAG_DAX request queue flag when creating the DM device. This is problematic if we have, for example, a dm-linear device made up of a PMEM namespace in fsdax mode followed by a ramdisk from BRD. QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but we have a working direct_access() entry point and the first member of the dm-linear set *does* support DAX. This allows the user to create a filesystem on the dm-linear device, and then mount it with DAX. The filesystem's bdev_dax_supported() test will pass because it'll operate on the first member of the dm-linear device, which happens to be a fsdax PMEM namespace. All DAX I/O will then fail to that dm-linear device because the lack of QUEUE_FLAG_DAX prevents fs_dax_get_by_bdev() from working. This means that the struct dax_device isn't ever set in the filesystem, so dax_direct_access() will always return -EOPNOTSUPP. By failing out of dm_dax_direct_access() if QUEUE_FLAG_DAX isn't set we let the filesystem know we don't support DAX at mount time. The filesystem will then silently fall back and remove the dax mount option, causing it to work properly. Signed-off-by: Ross Zwisler Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") --- drivers/md/dm.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 0a7b0107ca78..9728433362d1 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1050,14 +1050,13 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, if (!ti) goto out; - if (!ti->type->direct_access) + if (!blk_queue_dax(md->queue)) goto out; len = max_io_len(sector, ti) / PAGE_SECTORS; if (len < 1) goto out; nr_pages = min(len, nr_pages); - if (ti->type->direct_access) - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); + ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); out: dm_put_live_table(md, srcu_idx); -- 2.14.3
Re: [PATCH 4/7] dm: prevent DAX mounts if not supported
On Fri, May 25, 2018 at 03:54:10PM -0400, Mike Snitzer wrote: > On Thu, May 24 2018 at 10:55pm -0400, > Ross Zwisler wrote: > > > Currently the code in dm_dax_direct_access() only checks whether the target > > type has a direct_access() operation defined, not whether the underlying > > block devices all support DAX. This latter property can be seen by looking > > at whether we set the QUEUE_FLAG_DAX request queue flag when creating the > > DM device. > > > > This is problematic if we have, for example, a dm-linear device made up of > > a PMEM namespace in fsdax mode followed by a ramdisk from BRD. > > QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but > > we have a working direct_access() entry point and the first member of the > > dm-linear set *does* support DAX. > > > > This allows the user to create a filesystem on the dm-linear device, and > > then mount it with DAX. The filesystem's bdev_dax_supported() test will > > pass because it'll operate on the first member of the dm-linear device, > > which happens to be a fsdax PMEM namespace. > > > > All DAX I/O will then fail to that dm-linear device because the lack of > > QUEUE_FLAG_DAX prevents fs_dax_get_by_bdev() from working. This means that > > the struct dax_device isn't ever set in the filesystem, so > > dax_direct_access() will always return -EOPNOTSUPP. > > > > By failing out of dm_dax_direct_access() if QUEUE_FLAG_DAX isn't set we let > > the filesystem know we don't support DAX at mount time. The filesystem > > will then silently fall back and remove the dax mount option, causing it to > > work properly. > > > > Signed-off-by: Ross Zwisler > > Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") > > --- > > drivers/md/dm.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > index 0a7b0107ca78..9728433362d1 100644 > > --- a/drivers/md/dm.c > > +++ b/drivers/md/dm.c > > @@ -1050,14 +1050,13 @@ static long dm_dax_direct_access(struct dax_device > > *dax_dev, pgoff_t pgoff, > > > > if (!ti) > > goto out; > > - if (!ti->type->direct_access) > > + if (!blk_queue_dax(md->queue)) > > goto out; > > len = max_io_len(sector, ti) / PAGE_SECTORS; > > if (len < 1) > > goto out; > > nr_pages = min(len, nr_pages); > > - if (ti->type->direct_access) > > - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); > > + ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); > > So I followed all the rationale for this patch. But the last change > doesn't make any sense. We should still verify that the target has > ti->type->direct_access before calling it. So please reinstate that > check before calling it. You know that type has direct_access() via the blk_queue_dax() check. This tells you not only that the target has direct_access(), but also that you've successfully checked all members of that DM device and they all have working DAX I/O paths, etc. This is all done via the bdev_dax_supported() check and the rest of the code in dm_table_supports_dax() and device_supports_dax(). If this is too subtle I can add a comment or add the check back.
Re: [PATCH 1/7] fs: allow per-device dax status checking for filesystems
On Thu, May 24, 2018 at 10:02:18PM -0700, Darrick J. Wong wrote: > On Thu, May 24, 2018 at 08:55:12PM -0600, Ross Zwisler wrote: > > From: "Darrick J. Wong" > > > > Remove __bdev_dax_supported and change to bdev_dax_supported that takes a > > bdev parameter. This enables multi-device filesystems like xfs to check > > that a dax device can work for the particular filesystem. Once that's > > in place, actually fix all the parts of XFS where we need to be able to > > distinguish between datadev and rtdev. > > > > This patch fixes the problem where we screw up the dax support checking > > in xfs if the datadev and rtdev have different dax capabilities. > > > > Signed-off-by: Darrick J. Wong > > Signed-off-by: Ross Zwisler > > Reviewed-by: Darr...oh, I'm not allowed to do that, am I? > > Would you mind (re)sending this to the xfs list so that someone else can > review it? > > --D Thanks for the review, Darrick. I think at one point Dave said that if you touch more than 1 filesystem with a series you should just CC linux-fsdevel and omit the individual filesystems? I realize that this series only touches ext2 and ext4 a little, but that's what I opted for. Is that sufficient to get to the rest of the XFS developers, or would you like a resend adding linux-xfs?
[PATCH 6/7] dm-snap: remove unnecessary direct_access() stub
This stub was added so that we could use dm-snap with DM_TYPE_DAX_BIO_BASED mode devices. That mode and the transition issues associated with it no longer exist, so we can remove this dead code. Signed-off-by: Ross Zwisler --- drivers/md/dm-snap.c | 8 1 file changed, 8 deletions(-) diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index 216035be5661..0143b158d52d 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2305,13 +2305,6 @@ static int origin_map(struct dm_target *ti, struct bio *bio) return do_origin(o->dev, bio); } -static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) -{ - DMWARN("device does not support dax."); - return -EIO; -} - /* * Set the target "max_io_len" field to the minimum of all the snapshots' * chunk sizes. @@ -2371,7 +2364,6 @@ static struct target_type origin_target = { .postsuspend = origin_postsuspend, .status = origin_status, .iterate_devices = origin_iterate_devices, - .direct_access = origin_dax_direct_access, }; static struct target_type snapshot_target = { -- 2.14.3
[PATCH 1/7] fs: allow per-device dax status checking for filesystems
From: "Darrick J. Wong" Remove __bdev_dax_supported and change to bdev_dax_supported that takes a bdev parameter. This enables multi-device filesystems like xfs to check that a dax device can work for the particular filesystem. Once that's in place, actually fix all the parts of XFS where we need to be able to distinguish between datadev and rtdev. This patch fixes the problem where we screw up the dax support checking in xfs if the datadev and rtdev have different dax capabilities. Signed-off-by: Darrick J. Wong Signed-off-by: Ross Zwisler --- drivers/dax/super.c | 30 +++--- fs/ext2/super.c | 2 +- fs/ext4/super.c | 2 +- fs/xfs/xfs_ioctl.c | 3 ++- fs/xfs/xfs_iops.c | 30 +- fs/xfs/xfs_super.c | 10 -- include/linux/dax.h | 10 +++--- 7 files changed, 55 insertions(+), 32 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 2b2332b605e4..9206539c8330 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -73,8 +73,8 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); #endif /** - * __bdev_dax_supported() - Check if the device supports dax for filesystem - * @sb: The superblock of the device + * bdev_dax_supported() - Check if the device supports dax for filesystem + * @bdev: block device to check * @blocksize: The block size of the device * * This is a library function for filesystems to check if the block device @@ -82,33 +82,33 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); * * Return: negative errno if unsupported, 0 if supported. */ -int __bdev_dax_supported(struct super_block *sb, int blocksize) +int bdev_dax_supported(struct block_device *bdev, int blocksize) { - struct block_device *bdev = sb->s_bdev; struct dax_device *dax_dev; pgoff_t pgoff; int err, id; void *kaddr; pfn_t pfn; long len; + char buf[BDEVNAME_SIZE]; if (blocksize != PAGE_SIZE) { - pr_debug("VFS (%s): error: unsupported blocksize for dax\n", - sb->s_id); + pr_debug("%s: error: unsupported blocksize for dax\n", + bdevname(bdev, buf)); return -EINVAL; } err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff); if (err) { - pr_debug("VFS (%s): error: unaligned partition for dax\n", - sb->s_id); + pr_debug("%s: error: unaligned partition for dax\n", + bdevname(bdev, buf)); return err; } dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); if (!dax_dev) { - pr_debug("VFS (%s): error: device does not support dax\n", - sb->s_id); + pr_debug("%s: error: device does not support dax\n", + bdevname(bdev, buf)); return -EOPNOTSUPP; } @@ -119,8 +119,8 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize) put_dax(dax_dev); if (len < 1) { - pr_debug("VFS (%s): error: dax access failed (%ld)\n", - sb->s_id, len); + pr_debug("%s: error: dax access failed (%ld)\n", + bdevname(bdev, buf), len); return len < 0 ? len : -EIO; } @@ -137,14 +137,14 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize) } else if (pfn_t_devmap(pfn)) { /* pass */; } else { - pr_debug("VFS (%s): error: dax support not enabled\n", - sb->s_id); + pr_debug("%s: error: dax support not enabled\n", + bdevname(bdev, buf)); return -EOPNOTSUPP; } return 0; } -EXPORT_SYMBOL_GPL(__bdev_dax_supported); +EXPORT_SYMBOL_GPL(bdev_dax_supported); #endif enum dax_device_flags { diff --git a/fs/ext2/super.c b/fs/ext2/super.c index de1694512f1f..9627c3054b5c 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -961,7 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { - err = bdev_dax_supported(sb, blocksize); + err = bdev_dax_supported(sb->s_bdev, blocksize); if (err) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index eb104e8476f0..089170e99895 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3732,7 +3732,7
[PATCH 4/7] dm: prevent DAX mounts if not supported
Currently the code in dm_dax_direct_access() only checks whether the target type has a direct_access() operation defined, not whether the underlying block devices all support DAX. This latter property can be seen by looking at whether we set the QUEUE_FLAG_DAX request queue flag when creating the DM device. This is problematic if we have, for example, a dm-linear device made up of a PMEM namespace in fsdax mode followed by a ramdisk from BRD. QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but we have a working direct_access() entry point and the first member of the dm-linear set *does* support DAX. This allows the user to create a filesystem on the dm-linear device, and then mount it with DAX. The filesystem's bdev_dax_supported() test will pass because it'll operate on the first member of the dm-linear device, which happens to be a fsdax PMEM namespace. All DAX I/O will then fail to that dm-linear device because the lack of QUEUE_FLAG_DAX prevents fs_dax_get_by_bdev() from working. This means that the struct dax_device isn't ever set in the filesystem, so dax_direct_access() will always return -EOPNOTSUPP. By failing out of dm_dax_direct_access() if QUEUE_FLAG_DAX isn't set we let the filesystem know we don't support DAX at mount time. The filesystem will then silently fall back and remove the dax mount option, causing it to work properly. Signed-off-by: Ross Zwisler Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") --- drivers/md/dm.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 0a7b0107ca78..9728433362d1 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1050,14 +1050,13 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, if (!ti) goto out; - if (!ti->type->direct_access) + if (!blk_queue_dax(md->queue)) goto out; len = max_io_len(sector, ti) / PAGE_SECTORS; if (len < 1) goto out; nr_pages = min(len, nr_pages); - if (ti->type->direct_access) - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); + ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); out: dm_put_live_table(md, srcu_idx); -- 2.14.3
[PATCH 2/7] dax: change bdev_dax_supported() to support boolean returns
From: Dave Jiang The function return values are confusing with the way the function is named. We expect a true or false return value but it actually returns 0/-errno. This makes the code very confusing. Changing the return values to return a bool where if DAX is supported then return true and no DAX support returns false. Signed-off-by: Dave Jiang Signed-off-by: Ross Zwisler --- drivers/dax/super.c | 16 fs/ext2/super.c | 3 +-- fs/ext4/super.c | 3 +-- fs/xfs/xfs_ioctl.c | 4 ++-- fs/xfs/xfs_super.c | 12 ++-- include/linux/dax.h | 6 +++--- 6 files changed, 21 insertions(+), 23 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9206539c8330..e5447eddecf8 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -80,9 +80,9 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev); * This is a library function for filesystems to check if the block device * can be mounted with dax option. * - * Return: negative errno if unsupported, 0 if supported. + * Return: true if supported, false if unsupported */ -int bdev_dax_supported(struct block_device *bdev, int blocksize) +bool bdev_dax_supported(struct block_device *bdev, int blocksize) { struct dax_device *dax_dev; pgoff_t pgoff; @@ -95,21 +95,21 @@ int bdev_dax_supported(struct block_device *bdev, int blocksize) if (blocksize != PAGE_SIZE) { pr_debug("%s: error: unsupported blocksize for dax\n", bdevname(bdev, buf)); - return -EINVAL; + return false; } err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff); if (err) { pr_debug("%s: error: unaligned partition for dax\n", bdevname(bdev, buf)); - return err; + return false; } dax_dev = dax_get_by_host(bdev->bd_disk->disk_name); if (!dax_dev) { pr_debug("%s: error: device does not support dax\n", bdevname(bdev, buf)); - return -EOPNOTSUPP; + return false; } id = dax_read_lock(); @@ -121,7 +121,7 @@ int bdev_dax_supported(struct block_device *bdev, int blocksize) if (len < 1) { pr_debug("%s: error: dax access failed (%ld)\n", bdevname(bdev, buf), len); - return len < 0 ? len : -EIO; + return false; } if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) { @@ -139,10 +139,10 @@ int bdev_dax_supported(struct block_device *bdev, int blocksize) } else { pr_debug("%s: error: dax support not enabled\n", bdevname(bdev, buf)); - return -EOPNOTSUPP; + return false; } - return 0; + return true; } EXPORT_SYMBOL_GPL(bdev_dax_supported); #endif diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 9627c3054b5c..c09289a42dc5 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -961,8 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { - err = bdev_dax_supported(sb->s_bdev, blocksize); - if (err) { + if (!bdev_dax_supported(sb->s_bdev, blocksize)) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); sbi->s_mount_opt &= ~EXT2_MOUNT_DAX; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 089170e99895..2e1622907f4a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3732,8 +3732,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) " that may contain inline data"); sbi->s_mount_opt &= ~EXT4_MOUNT_DAX; } - err = bdev_dax_supported(sb->s_bdev, blocksize); - if (err) { + if (!bdev_dax_supported(sb->s_bdev, blocksize)) { ext4_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); sbi->s_mount_opt &= ~EXT4_MOUNT_DAX; diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 0effd46b965f..2c70a0a4f59f 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1103,8 +1103,8 @@ xfs_ioctl_setattr_dax_invalidate( if (fa->fsx_xflags & FS_XFLAG_DAX) { if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))) return -EINVAL; - if (bdev_dax_sup
[PATCH 5/7] dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode
The DM_TYPE_DAX_BIO_BASED dm_queue_mode was introduced to prevent DM devices that could possibly support DAX from transitioning into DM devices that cannot support DAX. For example, the following transition will currently fail: dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw] DM_TYPE_DAX_BIO_BASED DM_TYPE_BIO_BASED but these will both succeed: dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw] DM_TYPE_DAX_BASEDDM_TYPE_BIO_BASED dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem] DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED This seems arbitrary, as really the choice on whether to use DAX happens at filesystem mount time. There's no guarantee that the in the first case (double fsdax pmem) we were using the dax mount option with our file system. Instead, get rid of DM_TYPE_DAX_BIO_BASED and all the special casing around it, and instead make the request queue's QUEUE_FLAG_DAX be our one source of truth. If this is set, we can use DAX, and if not, not. We keep this up to date in table_load() as the table changes. As with regular block devices the filesystem will then know at mount time whether DAX is a supported mount option or not. Signed-off-by: Ross Zwisler --- drivers/md/dm-ioctl.c | 16 ++-- drivers/md/dm-table.c | 25 ++--- drivers/md/dm.c | 2 -- include/linux/device-mapper.h | 8 ++-- 4 files changed, 22 insertions(+), 29 deletions(-) diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c index 5acf77de5945..d1f86d0bb2d0 100644 --- a/drivers/md/dm-ioctl.c +++ b/drivers/md/dm-ioctl.c @@ -1292,15 +1292,6 @@ static int populate_table(struct dm_table *table, return dm_table_complete(table); } -static bool is_valid_type(enum dm_queue_mode cur, enum dm_queue_mode new) -{ - if (cur == new || - (cur == DM_TYPE_BIO_BASED && new == DM_TYPE_DAX_BIO_BASED)) - return true; - - return false; -} - static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_size) { int r; @@ -1343,12 +1334,17 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si DMWARN("unable to set up device queue for new table."); goto err_unlock_md_type; } - } else if (!is_valid_type(dm_get_md_type(md), dm_table_get_type(t))) { + } else if (dm_get_md_type(md) != dm_table_get_type(t)) { DMWARN("can't change device type after initial table load."); r = -EINVAL; goto err_unlock_md_type; } + if (dm_table_supports_dax(t)) + blk_queue_flag_set(QUEUE_FLAG_DAX, md->queue); + else + blk_queue_flag_clear(QUEUE_FLAG_DAX, md->queue); + dm_unlock_md_type(md); /* stage inactive table */ diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 5bb994b012ca..ea5c4a1e6f5b 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -866,7 +866,6 @@ EXPORT_SYMBOL(dm_consume_args); static bool __table_type_bio_based(enum dm_queue_mode table_type) { return (table_type == DM_TYPE_BIO_BASED || - table_type == DM_TYPE_DAX_BIO_BASED || table_type == DM_TYPE_NVME_BIO_BASED); } @@ -888,7 +887,7 @@ static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, return bdev_dax_supported(dev->bdev, PAGE_SIZE); } -static bool dm_table_supports_dax(struct dm_table *t) +bool dm_table_supports_dax(struct dm_table *t) { struct dm_target *ti; unsigned i; @@ -907,6 +906,7 @@ static bool dm_table_supports_dax(struct dm_table *t) return true; } +EXPORT_SYMBOL_GPL(dm_table_supports_dax); static bool dm_table_does_not_support_partial_completion(struct dm_table *t); @@ -944,7 +944,6 @@ static int dm_table_determine_type(struct dm_table *t) /* possibly upgrade to a variant of bio-based */ goto verify_bio_based; } - BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED); goto verify_rq_based; } @@ -981,18 +980,14 @@ static int dm_table_determine_type(struct dm_table *t) verify_bio_based: /* We must use this table as bio-based */ t->type = DM_TYPE_BIO_BASED; - if (dm_table_supports_dax(t) || - (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { - t->type = DM_TYPE_DAX_BIO_BASED; - } else { - /* Check if upgrading to NVMe bio-based is valid or required */ - tgt = dm_table_get_immutable_target(t); -
[PATCH 3/7] dm: fix test for DAX device support
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX flag is set on the device's request queue to decide whether or not the device supports filesystem DAX. This is insufficient because there are devices like PMEM namespaces in raw mode which have QUEUE_FLAG_DAX set but which don't actually support DAX. This means that you could create a dm-linear device, for example, where the first part of the dm-linear device was a PMEM namespace in fsdax mode and the second part was a PMEM namespace in raw mode. Both DM and the filesystem you put on that dm-linear device would think the whole device supports DAX, which would lead to bad behavior once your raw PMEM namespace part using DAX needed struct page for something. Fix this by using bdev_dax_supported() like filesystems do at mount time. This checks for raw mode and also performs other tests like checking to make sure the dax_direct_access() path works. Signed-off-by: Ross Zwisler Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") --- drivers/md/dm-table.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 0589a4da12bb..5bb994b012ca 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -885,9 +885,7 @@ EXPORT_SYMBOL_GPL(dm_table_set_type); static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { - struct request_queue *q = bdev_get_queue(dev->bdev); - - return q && blk_queue_dax(q); + return bdev_dax_supported(dev->bdev, PAGE_SIZE); } static bool dm_table_supports_dax(struct dm_table *t) -- 2.14.3
[PATCH 7/7] dm-error: remove unnecessary direct_access() stub
This stub was added so that we could use dm-error with DM_TYPE_DAX_BIO_BASED mode devices. That mode and the transition issues associated with it no longer exist, so we can remove this dead code. Signed-off-by: Ross Zwisler --- drivers/md/dm-target.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 314d17ca6466..c4dbc15f7862 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -140,12 +140,6 @@ static void io_err_release_clone_rq(struct request *clone) { } -static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) -{ - return -EIO; -} - static struct target_type error_target = { .name = "error", .version = {1, 5, 0}, @@ -155,7 +149,6 @@ static struct target_type error_target = { .map = io_err_map, .clone_and_map_rq = io_err_clone_and_map_rq, .release_clone_rq = io_err_release_clone_rq, - .direct_access = io_err_dax_direct_access, }; int __init dm_target_init(void) -- 2.14.3
[PATCH 0/7] Fix DM DAX handling
This series fixes a few issues that I found with DM's handling of DAX devices. Here are some of the issues I found: * We can create a dm-stripe or dm-linear device which is made up of an fsdax PMEM namespace and a raw PMEM namespace but which can hold a filesystem mounted with the -o dax mount option. DAX operations to the raw PMEM namespace part lack struct page and can fail in interesting/unexpected ways when doing things like fork(), examining memory with gdb, etc. * We can create a dm-stripe or dm-linear device which is made up of an fsdax PMEM namespace and a BRD ramdisk which can hold a filesystem mounted with the -o dax mount option. All I/O to this filesystem will fail. * In DM you can't transition a dm target which could possibly support DAX (mode DM_TYPE_DAX_BIO_BASED) to one which can't support DAX (mode DM_TYPE_BIO_BASED), even if you never use DAX. The first 2 patches in this series are prep work from Darrick and Dave which improve bdev_dax_supported(). The last 5 problems fix the above mentioned problems in DM. I feel that this series simplifies the handling of DAX devices in DM, and the last 5 DM-related patches have a net code reduction of 50 lines. Darrick J. Wong (1): fs: allow per-device dax status checking for filesystems Dave Jiang (1): dax: change bdev_dax_supported() to support boolean returns Ross Zwisler (5): dm: fix test for DAX device support dm: prevent DAX mounts if not supported dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode dm-snap: remove unnecessary direct_access() stub dm-error: remove unnecessary direct_access() stub drivers/dax/super.c | 44 +-- drivers/md/dm-ioctl.c | 16 ++-- drivers/md/dm-snap.c | 8 drivers/md/dm-table.c | 29 +++- drivers/md/dm-target.c| 7 --- drivers/md/dm.c | 7 ++- fs/ext2/super.c | 3 +-- fs/ext4/super.c | 3 +-- fs/xfs/xfs_ioctl.c| 3 ++- fs/xfs/xfs_iops.c | 30 - fs/xfs/xfs_super.c| 10 -- include/linux/dax.h | 12 include/linux/device-mapper.h | 8 ++-- 13 files changed, 88 insertions(+), 92 deletions(-) -- 2.14.3
Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()
On Wed, May 23, 2018 at 09:53:38AM -0700, Dan Williams wrote: > On Wed, May 23, 2018 at 9:47 AM, Ross Zwisler > wrote: > > On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote: > >> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler > >> wrote: > >> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote: > >> >> In preparation for protecting the dax read(2) path from media errors > >> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the > >> >> implementation to report the bytes successfully transferred. > >> >> > >> >> Cc: > >> >> Cc: Ingo Molnar > >> >> Cc: Borislav Petkov > >> >> Cc: Tony Luck > >> >> Cc: Al Viro > >> >> Cc: Thomas Gleixner > >> >> Cc: Andy Lutomirski > >> >> Cc: Peter Zijlstra > >> >> Cc: Andrew Morton > >> >> Cc: Linus Torvalds > >> >> Signed-off-by: Dan Williams > >> >> --- > >> >> fs/dax.c | 20 +++- > >> >> 1 file changed, 11 insertions(+), 9 deletions(-) > >> >> > >> >> diff --git a/fs/dax.c b/fs/dax.c > >> >> index a64afdf7ec0d..34a2d435ae4b 100644 > >> >> --- a/fs/dax.c > >> >> +++ b/fs/dax.c > >> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > >> >> loff_t length, void *data, > >> >> struct iov_iter *iter = data; > >> >> loff_t end = pos + length, done = 0; > >> >> ssize_t ret = 0; > >> >> + size_t xfer; > >> >> int id; > >> >> > >> >> if (iov_iter_rw(iter) == READ) { > >> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t > >> >> pos, loff_t length, void *data, > >> >>* vfs_write(), depending on which operation we are doing. > >> >>*/ > >> >> if (iov_iter_rw(iter) == WRITE) > >> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, > >> >> kaddr, > >> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr, > >> >> map_len, iter); > >> >> else > >> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr, > >> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr, > >> >> map_len, iter); > >> >> - if (map_len <= 0) { > >> >> - ret = map_len ? map_len : -EFAULT; > >> >> - break; > >> >> - } > >> >> > >> >> - pos += map_len; > >> >> - length -= map_len; > >> >> - done += map_len; > >> >> + pos += xfer; > >> >> + length -= xfer; > >> >> + done += xfer; > >> >> + > >> >> + if (xfer == 0) > >> >> + ret = -EFAULT; > >> >> + if (xfer < map_len) > >> >> + break; > >> > > >> > I'm confused by this error handling. So if we hit an error on a given > >> > iov and > >> > we don't transfer the expected number of bytes, we have two cases: > >> > > >> > 1) We transferred *something* on this iov but not everything - return > >> > success. > >> > 2) We didn't transfer anything on this iov - return -EFAULT. > >> > > >> > Both of these are true regardless of data transferred on previous iovs. > >> > > >> > Why the distinction? If a given iov is interrupted, regardless of > >> > whether it > >> > transferred 0 bytes or 1, shouldn't the error path be the same? > >> > >> This is is the semantics of read(2) / write(2). Quoting the pwrite man > >> page: > >> > >>Note that is not an error for a successful call to > >> transfer fewer bytes than > >>requested (see read(2) and write(2)). > > > > Consider this case: > > > > I have 4 IOVs, each of a full page. The first three transfer their full > > page, > > but on the third we hit an error. > > > > If we transferred 0 bytes in the fourth page, we'll return -EFAULT. > > > > If we transferred 1 byte in the fourth page, we'll return the total length > > transferred, so 3 pages + 1 byte. > > > > Why? pwrite(2) says it returns the number of bytes written, which can be > > less > > than the total requested. Why not just return the length transferred in > > both > > cases, instead of returning -EFAULT for one of them? > > Ah, now I see. Yes, that's a bug. Once we have successfully completed > any iovec we should be returning bytes transferred not -EFAULT. Actually, your code is fine. This is handled by the: return done ? done : ret; at the end of the function. So if we've transferred any data at all, we'll return the number of bytes transferred, and if we didn't we'll return -EFAULT because 0 is the special case which means EOF according to pread(2)/pwrite(2). Looks good, then. Thanks for answering my questions. Reviewed-by: Ross Zwisler
Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()
On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote: > On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler > wrote: > > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote: > >> In preparation for protecting the dax read(2) path from media errors > >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the > >> implementation to report the bytes successfully transferred. > >> > >> Cc: > >> Cc: Ingo Molnar > >> Cc: Borislav Petkov > >> Cc: Tony Luck > >> Cc: Al Viro > >> Cc: Thomas Gleixner > >> Cc: Andy Lutomirski > >> Cc: Peter Zijlstra > >> Cc: Andrew Morton > >> Cc: Linus Torvalds > >> Signed-off-by: Dan Williams > >> --- > >> fs/dax.c | 20 +++- > >> 1 file changed, 11 insertions(+), 9 deletions(-) > >> > >> diff --git a/fs/dax.c b/fs/dax.c > >> index a64afdf7ec0d..34a2d435ae4b 100644 > >> --- a/fs/dax.c > >> +++ b/fs/dax.c > >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > >> loff_t length, void *data, > >> struct iov_iter *iter = data; > >> loff_t end = pos + length, done = 0; > >> ssize_t ret = 0; > >> + size_t xfer; > >> int id; > >> > >> if (iov_iter_rw(iter) == READ) { > >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > >> loff_t length, void *data, > >>* vfs_write(), depending on which operation we are doing. > >>*/ > >> if (iov_iter_rw(iter) == WRITE) > >> - map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr, > >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr, > >> map_len, iter); > >> else > >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr, > >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr, > >> map_len, iter); > >> - if (map_len <= 0) { > >> - ret = map_len ? map_len : -EFAULT; > >> - break; > >> - } > >> > >> - pos += map_len; > >> - length -= map_len; > >> - done += map_len; > >> + pos += xfer; > >> + length -= xfer; > >> + done += xfer; > >> + > >> + if (xfer == 0) > >> + ret = -EFAULT; > >> + if (xfer < map_len) > >> + break; > > > > I'm confused by this error handling. So if we hit an error on a given iov > > and > > we don't transfer the expected number of bytes, we have two cases: > > > > 1) We transferred *something* on this iov but not everything - return > > success. > > 2) We didn't transfer anything on this iov - return -EFAULT. > > > > Both of these are true regardless of data transferred on previous iovs. > > > > Why the distinction? If a given iov is interrupted, regardless of whether > > it > > transferred 0 bytes or 1, shouldn't the error path be the same? > > This is is the semantics of read(2) / write(2). Quoting the pwrite man page: > >Note that is not an error for a successful call to > transfer fewer bytes than >requested (see read(2) and write(2)). Consider this case: I have 4 IOVs, each of a full page. The first three transfer their full page, but on the third we hit an error. If we transferred 0 bytes in the fourth page, we'll return -EFAULT. If we transferred 1 byte in the fourth page, we'll return the total length transferred, so 3 pages + 1 byte. Why? pwrite(2) says it returns the number of bytes written, which can be less than the total requested. Why not just return the length transferred in both cases, instead of returning -EFAULT for one of them?
Re: [PATCH v3 8/9] pmem: switch to copy_to_iter_mcsafe()
On Thu, May 03, 2018 at 05:06:47PM -0700, Dan Williams wrote: > Use the machine check safe version of copy_to_iter() for the > ->copy_to_iter() operation published by the pmem driver. > > Signed-off-by: Dan Williams Sure. Reviewed-by: Ross Zwisler
Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()
On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote: > In preparation for protecting the dax read(2) path from media errors > with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the > implementation to report the bytes successfully transferred. > > Cc: > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Tony Luck > Cc: Al Viro > Cc: Thomas Gleixner > Cc: Andy Lutomirski > Cc: Peter Zijlstra > Cc: Andrew Morton > Cc: Linus Torvalds > Signed-off-by: Dan Williams > --- > fs/dax.c | 20 +++- > 1 file changed, 11 insertions(+), 9 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index a64afdf7ec0d..34a2d435ae4b 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t > length, void *data, > struct iov_iter *iter = data; > loff_t end = pos + length, done = 0; > ssize_t ret = 0; > + size_t xfer; > int id; > > if (iov_iter_rw(iter) == READ) { > @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > loff_t length, void *data, >* vfs_write(), depending on which operation we are doing. >*/ > if (iov_iter_rw(iter) == WRITE) > - map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr, > + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr, > map_len, iter); > else > - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr, > + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr, > map_len, iter); > - if (map_len <= 0) { > - ret = map_len ? map_len : -EFAULT; > - break; > - } > > - pos += map_len; > - length -= map_len; > - done += map_len; > + pos += xfer; > + length -= xfer; > + done += xfer; > + > + if (xfer == 0) > + ret = -EFAULT; > + if (xfer < map_len) > + break; I'm confused by this error handling. So if we hit an error on a given iov and we don't transfer the expected number of bytes, we have two cases: 1) We transferred *something* on this iov but not everything - return success. 2) We didn't transfer anything on this iov - return -EFAULT. Both of these are true regardless of data transferred on previous iovs. Why the distinction? If a given iov is interrupted, regardless of whether it transferred 0 bytes or 1, shouldn't the error path be the same? > } > dax_read_unlock(id);
Re: [PATCH v3 6/9] dax: introduce a ->copy_to_iter dax operation
On Thu, May 03, 2018 at 05:06:37PM -0700, Dan Williams wrote: > Similar to the ->copy_from_iter() operation, a platform may want to > deploy an architecture or device specific routine for handling reads > from a dax_device like /dev/pmemX. On x86 this routine will point to a > machine check safe version of copy_to_iter(). For now, add the plumbing > to device-mapper and the dax core. > > Cc: Ross Zwisler > Cc: Mike Snitzer > Cc: Christoph Hellwig > Signed-off-by: Dan Williams Looks good to me. Reviewed-by: Ross Zwisler
Re: [PATCH 0/5] fix radix tree multi-order iteration race
On Thu, May 10, 2018 at 03:48:44PM -0700, Andrew Morton wrote: > so I think I'll just ignore all that and send this series off to Linus next > week. Great. Thank you, Andrew.
Re: [PATCH 5/5] radix tree: fix multi-order iteration race
On Wed, May 09, 2018 at 02:46:11PM +0200, Jan Kara wrote: > On Thu 03-05-18 13:24:30, Ross Zwisler wrote: > > Fix a race in the multi-order iteration code which causes the kernel to hit > > a GP fault. This was first seen with a production v4.15 based kernel > > (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD > > DAX entries. > > > > The race has to do with how we tear down multi-order sibling entries when > > we are removing an item from the tree. Remember for example that an order > > 2 entry looks like this: > > > > struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] > > > > where 'entry' is in some slot in the struct radix_tree_node, and the three > > slots following 'entry' contain sibling pointers which point back to > > 'entry.' > > > > When we delete 'entry' from the tree, we call : > > radix_tree_delete() > > radix_tree_delete_item() > > __radix_tree_delete() > > replace_slot() > > > > replace_slot() first removes the siblings in order from the first to the > > last, then at then replaces 'entry' with NULL. This means that for a brief > > period of time we end up with one or more of the siblings removed, so: > > > > struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] > > > > This causes an issue if you have a reader iterating over the slots in the > > tree via radix_tree_for_each_slot() while only under > > rcu_read_lock()/rcu_read_unlock() protection. This is a common case in > > mm/filemap.c. > > > > The issue is that when __radix_tree_next_slot() => skip_siblings() tries to > > skip over the sibling entries in the slots, it currently does so with an > > exact match on the slot directly preceding our current slot. Normally this > > works: > > V preceding slot > > struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] > > ^ current slot > > > > This lets you find the first sibling, and you skip them all in order. > > > > But in the case where one of the siblings is NULL, that slot is skipped and > > then our sibling detection is interrupted: > > > >V preceding slot > > struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] > > ^ current slot > > > > This means that the sibling pointers aren't recognized since they point all > > the way back to 'entry', so we think that they are normal internal radix > > tree pointers. This causes us to think we need to walk down to a struct > > radix_tree_node starting at the address of 'entry'. > > > > In a real running kernel this will crash the thread with a GP fault when > > you try and dereference the slots in your broken node starting at 'entry'. > > > > We fix this race by fixing the way that skip_siblings() detects sibling > > nodes. Instead of testing against the preceding slot we instead look for > > siblings via is_sibling_entry() which compares against the position of the > > struct radix_tree_node.slots[] array. This ensures that sibling entries > > are properly identified, even if they are no longer contiguous with the > > 'entry' they point to. > > > > Signed-off-by: Ross Zwisler > > Reported-by: CR, Sapthagirish > > Fixes: commit 148deab223b2 ("radix-tree: improve multiorder iterators") > > Cc: > > Looks good to me. You can add: > > Reviewed-by: Jan Kara Thank you for the review, Jan.
Re: [PATCH 0/5] fix radix tree multi-order iteration race
On Thu, May 03, 2018 at 01:24:25PM -0600, Ross Zwisler wrote: > The following series gets the radix tree test suite compiling again in > the current linux/master, adds a unit test which exposes a race in the > radix tree multi-order iteration code, and then fixes that race. > > This race was initially hit on a v4.15 based kernel and results in a GP > fault. I've described the race in detail in patches 4 and 5. > > The fix is simple and necessary, and I think it should be merged for > v4.17. > > This tree has gotten positive build confirmation from the 0-day bot, > passes the updated radix tree test suite, xfstests, and the original > test that was hitting the race with the v4.15 based kernel. > > Ross Zwisler (5): > radix tree test suite: fix mapshift build target > radix tree test suite: fix compilation issue > radix tree test suite: add item_delete_rcu() > radix tree test suite: multi-order iteration race > radix tree: fix multi-order iteration race > > lib/radix-tree.c | 6 ++-- > tools/include/linux/spinlock.h| 3 +- > tools/testing/radix-tree/Makefile | 6 ++-- > tools/testing/radix-tree/multiorder.c | 63 > +++ > tools/testing/radix-tree/test.c | 19 +++ > tools/testing/radix-tree/test.h | 3 ++ > 6 files changed, 91 insertions(+), 9 deletions(-) > > -- > 2.14.3 > Ping on this series. Does anyone have any feedback? Matthew? I'd really like to get it into v4.17. We can take it through the libnvdimm tree, if that's easiest.
[PATCH 2/5] radix tree test suite: fix compilation issue
Pulled from a patch from Matthew Wilcox entitled "xarray: Add definition of struct xarray": > From: Matthew Wilcox > Signed-off-by: Matthew Wilcox https://patchwork.kernel.org/patch/10341249/ These defines fix this compilation error: In file included from ./linux/radix-tree.h:6:0, from ./linux/../../../../include/linux/idr.h:15, from ./linux/idr.h:1, from idr.c:4: ./linux/../../../../include/linux/idr.h: In function ‘idr_init_base’: ./linux/../../../../include/linux/radix-tree.h:129:2: warning: implicit declaration of function ‘spin_lock_init’; did you mean ‘spinlock_t’? [-Wimplicit-function-declaration] spin_lock_init(&(root)->xa_lock);\ ^ ./linux/../../../../include/linux/idr.h:126:2: note: in expansion of macro ‘INIT_RADIX_TREE’ INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER); ^~~ by providing a spin_lock_init() wrapper for the v4.17-rc* version of the radix tree test suite. Signed-off-by: Ross Zwisler --- tools/include/linux/spinlock.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/include/linux/spinlock.h b/tools/include/linux/spinlock.h index b21b586b9854..1738c0391da4 100644 --- a/tools/include/linux/spinlock.h +++ b/tools/include/linux/spinlock.h @@ -6,8 +6,9 @@ #include #define spinlock_t pthread_mutex_t -#define DEFINE_SPINLOCK(x) pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER; +#define DEFINE_SPINLOCK(x) pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER #define __SPIN_LOCK_UNLOCKED(x) (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER +#define spin_lock_init(x) pthread_mutex_init(x, NULL) #define spin_lock_irqsave(x, f)(void)f, pthread_mutex_lock(x) #define spin_unlock_irqrestore(x, f) (void)f, pthread_mutex_unlock(x) -- 2.14.3
[PATCH 3/5] radix tree test suite: add item_delete_rcu()
Currently the lifetime of "struct item" entries in the radix tree are not controlled by RCU, but are instead deleted inline as they are removed from the tree. In the following patches we add a test which has threads iterating over items pulled from the tree and verifying them in an rcu_read_lock()/rcu_read_unlock() section. This means that though an item has been removed from the tree it could still be being worked on by other threads until the RCU grace period expires. So, we need to actually free the "struct item" structures at the end of the grace period, just as we do with "struct radix_tree_node" items. Signed-off-by: Ross Zwisler --- tools/testing/radix-tree/test.c | 19 +++ tools/testing/radix-tree/test.h | 2 ++ 2 files changed, 21 insertions(+) diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c index 5978ab1f403d..def6015570b2 100644 --- a/tools/testing/radix-tree/test.c +++ b/tools/testing/radix-tree/test.c @@ -75,6 +75,25 @@ int item_delete(struct radix_tree_root *root, unsigned long index) return 0; } +static void item_free_rcu(struct rcu_head *head) +{ + struct item *item = container_of(head, struct item, rcu_head); + + free(item); +} + +int item_delete_rcu(struct radix_tree_root *root, unsigned long index) +{ + struct item *item = radix_tree_delete(root, index); + + if (item) { + item_sanity(item, index); + call_rcu(&item->rcu_head, item_free_rcu); + return 1; + } + return 0; +} + void item_check_present(struct radix_tree_root *root, unsigned long index) { struct item *item; diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h index d9c031dbeb1a..8cf4a7a7f94c 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -5,6 +5,7 @@ #include struct item { + struct rcu_head rcu_head; unsigned long index; unsigned int order; }; @@ -15,6 +16,7 @@ int item_insert(struct radix_tree_root *root, unsigned long index); int item_insert_order(struct radix_tree_root *root, unsigned long index, unsigned order); int item_delete(struct radix_tree_root *root, unsigned long index); +int item_delete_rcu(struct radix_tree_root *root, unsigned long index); struct item *item_lookup(struct radix_tree_root *root, unsigned long index); void item_check_present(struct radix_tree_root *root, unsigned long index); -- 2.14.3
[PATCH 5/5] radix tree: fix multi-order iteration race
Fix a race in the multi-order iteration code which causes the kernel to hit a GP fault. This was first seen with a production v4.15 based kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD DAX entries. The race has to do with how we tear down multi-order sibling entries when we are removing an item from the tree. Remember for example that an order 2 entry looks like this: struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] where 'entry' is in some slot in the struct radix_tree_node, and the three slots following 'entry' contain sibling pointers which point back to 'entry.' When we delete 'entry' from the tree, we call : radix_tree_delete() radix_tree_delete_item() __radix_tree_delete() replace_slot() replace_slot() first removes the siblings in order from the first to the last, then at then replaces 'entry' with NULL. This means that for a brief period of time we end up with one or more of the siblings removed, so: struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] This causes an issue if you have a reader iterating over the slots in the tree via radix_tree_for_each_slot() while only under rcu_read_lock()/rcu_read_unlock() protection. This is a common case in mm/filemap.c. The issue is that when __radix_tree_next_slot() => skip_siblings() tries to skip over the sibling entries in the slots, it currently does so with an exact match on the slot directly preceding our current slot. Normally this works: V preceding slot struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] ^ current slot This lets you find the first sibling, and you skip them all in order. But in the case where one of the siblings is NULL, that slot is skipped and then our sibling detection is interrupted: V preceding slot struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] ^ current slot This means that the sibling pointers aren't recognized since they point all the way back to 'entry', so we think that they are normal internal radix tree pointers. This causes us to think we need to walk down to a struct radix_tree_node starting at the address of 'entry'. In a real running kernel this will crash the thread with a GP fault when you try and dereference the slots in your broken node starting at 'entry'. We fix this race by fixing the way that skip_siblings() detects sibling nodes. Instead of testing against the preceding slot we instead look for siblings via is_sibling_entry() which compares against the position of the struct radix_tree_node.slots[] array. This ensures that sibling entries are properly identified, even if they are no longer contiguous with the 'entry' they point to. Signed-off-by: Ross Zwisler Reported-by: CR, Sapthagirish Fixes: commit 148deab223b2 ("radix-tree: improve multiorder iterators") Cc: --- lib/radix-tree.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index da9e10c827df..43e0cbedc3a0 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1612,11 +1612,9 @@ static void set_iter_tags(struct radix_tree_iter *iter, static void __rcu **skip_siblings(struct radix_tree_node **nodep, void __rcu **slot, struct radix_tree_iter *iter) { - void *sib = node_to_entry(slot - 1); - while (iter->index < iter->next_index) { *nodep = rcu_dereference_raw(*slot); - if (*nodep && *nodep != sib) + if (*nodep && !is_sibling_entry(iter->node, *nodep)) return slot; slot++; iter->index = __radix_tree_iter_add(iter, 1); @@ -1631,7 +1629,7 @@ void __rcu **__radix_tree_next_slot(void __rcu **slot, struct radix_tree_iter *iter, unsigned flags) { unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK; - struct radix_tree_node *node = rcu_dereference_raw(*slot); + struct radix_tree_node *node; slot = skip_siblings(&node, slot, iter); -- 2.14.3
[PATCH 1/5] radix tree test suite: fix mapshift build target
The following commit commit c6ce3e2fe3da ("radix tree test suite: Add config option for map shift") Introduced a phony makefile target called 'mapshift' that ends up generating the file generated/map-shift.h. This phony target was then added as a dependency of the top level 'targets' build target, which is what is run when you go to tools/testing/radix-tree and just type 'make'. Unfortunately, this phony target doesn't actually work as a dependency, so you end up getting: $ make make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'. Stop. make: *** Waiting for unfinished jobs Fix this by making the file generated/map-shift.h our real makefile target, and add this a dependency of the top level build target. Signed-off-by: Ross Zwisler --- tools/testing/radix-tree/Makefile | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile index fa7ee369b3c9..db66f8a0d4be 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -17,7 +17,7 @@ ifeq ($(BUILD), 32) LDFLAGS += -m32 endif -targets: mapshift $(TARGETS) +targets: generated/map-shift.h $(TARGETS) main: $(OFILES) @@ -42,9 +42,7 @@ radix-tree.c: ../../../lib/radix-tree.c idr.c: ../../../lib/idr.c sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ -.PHONY: mapshift - -mapshift: +generated/map-shift.h: @if ! grep -qws $(SHIFT) generated/map-shift.h; then\ echo "#define RADIX_TREE_MAP_SHIFT $(SHIFT)" > \ generated/map-shift.h; \ -- 2.14.3
[PATCH 4/5] radix tree test suite: multi-order iteration race
Add a test which shows a race in the multi-order iteration code. This test reliably hits the race in under a second on my machine, and is the result of a real bug report against kernel a production v4.15 based kernel (4.15.6-300.fc27.x86_64). With a real kernel this issue is hit when using order 9 PMD DAX radix tree entries. The race has to do with how we tear down multi-order sibling entries when we are removing an item from the tree. Remember that an order 2 entry looks like this: struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] where 'entry' is in some slot in the struct radix_tree_node, and the three slots following 'entry' contain sibling pointers which point back to 'entry.' When we delete 'entry' from the tree, we call : radix_tree_delete() radix_tree_delete_item() __radix_tree_delete() replace_slot() replace_slot() first removes the siblings in order from the first to the last, then at then replaces 'entry' with NULL. This means that for a brief period of time we end up with one or more of the siblings removed, so: struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] This causes an issue if you have a reader iterating over the slots in the tree via radix_tree_for_each_slot() while only under rcu_read_lock()/rcu_read_unlock() protection. This is a common case in mm/filemap.c. The issue is that when __radix_tree_next_slot() => skip_siblings() tries to skip over the sibling entries in the slots, it currently does so with an exact match on the slot directly preceding our current slot. Normally this works: V preceding slot struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling] ^ current slot This lets you find the first sibling, and you skip them all in order. But in the case where one of the siblings is NULL, that slot is skipped and then our sibling detection is interrupted: V preceding slot struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling] ^ current slot This means that the sibling pointers aren't recognized since they point all the way back to 'entry', so we think that they are normal internal radix tree pointers. This causes us to think we need to walk down to a struct radix_tree_node starting at the address of 'entry'. In a real running kernel this will crash the thread with a GP fault when you try and dereference the slots in your broken node starting at 'entry'. In the radix tree test suite this will be caught by the address sanitizer: ==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c0008ae400 at pc 0x0040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0 READ of size 8 at 0x60c0008ae400 thread T3 #0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660 #1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567 #2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655 #3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a) #4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e) Signed-off-by: Ross Zwisler --- tools/testing/radix-tree/multiorder.c | 63 +++ tools/testing/radix-tree/test.h | 1 + 2 files changed, 64 insertions(+) diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index 59245b3d587c..7bf405638b0b 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "test.h" @@ -624,6 +625,67 @@ static void multiorder_account(void) item_kill_tree(&tree); } +bool stop_iteration = false; + +static void *creator_func(void *ptr) +{ + /* 'order' is set up to ensure we have sibling entries */ + unsigned int order = RADIX_TREE_MAP_SHIFT - 1; + struct radix_tree_root *tree = ptr; + int i; + + for (i = 0; i < 1; i++) { + item_insert_order(tree, 0, order); + item_delete_rcu(tree, 0); + } + + stop_iteration = true; + return NULL; +} + +static void *iterator_func(void *ptr) +{ + struct radix_tree_root *tree = ptr; + struct radix_tree_iter iter; + struct item *item; + void **slot; + + while (!stop_iteration) { + rcu_read_lock(); + radix_tree_for_each_slot(slot, tree, &iter, 0) { + item = radix_tree_deref_slot(slot); + + if (!item) + continue; + if (radix_tree_deref_retry
[PATCH 0/5] fix radix tree multi-order iteration race
The following series gets the radix tree test suite compiling again in the current linux/master, adds a unit test which exposes a race in the radix tree multi-order iteration code, and then fixes that race. This race was initially hit on a v4.15 based kernel and results in a GP fault. I've described the race in detail in patches 4 and 5. The fix is simple and necessary, and I think it should be merged for v4.17. This tree has gotten positive build confirmation from the 0-day bot, passes the updated radix tree test suite, xfstests, and the original test that was hitting the race with the v4.15 based kernel. Ross Zwisler (5): radix tree test suite: fix mapshift build target radix tree test suite: fix compilation issue radix tree test suite: add item_delete_rcu() radix tree test suite: multi-order iteration race radix tree: fix multi-order iteration race lib/radix-tree.c | 6 ++-- tools/include/linux/spinlock.h| 3 +- tools/testing/radix-tree/Makefile | 6 ++-- tools/testing/radix-tree/multiorder.c | 63 +++ tools/testing/radix-tree/test.c | 19 +++ tools/testing/radix-tree/test.h | 3 ++ 6 files changed, 91 insertions(+), 9 deletions(-) -- 2.14.3
Re: [PATCH v6] fs: dax: Adding new return type vm_fault_t
On Tue, Apr 24, 2018 at 10:17:51PM +0530, Souptick Joarder wrote: > Use new return type vm_fault_t for fault handler. For > now, this is just documenting that the function returns > a VM_FAULT value rather than an errno. Once all instances > are converted, vm_fault_t will become a distinct type. > > commit 1c8f422059ae ("mm: change return type to vm_fault_t") > > There was an existing bug inside dax_load_hole() > if vm_insert_mixed had failed to allocate a page table, > we'd return VM_FAULT_NOPAGE instead of VM_FAULT_OOM. > With new vmf_insert_mixed() this issue is addressed. > > vm_insert_mixed_mkwrite has inefficiency when it returns > an error value, driver has to convert it to vm_fault_t > type. With new vmf_insert_mixed_mkwrite() this limitation > will be addressed. > > Signed-off-by: Souptick Joarder > Reviewed-by: Jan Kara > Reviewed-by: Matthew Wilcox Sure, this looks correct. You can add: Reviewed-by: Ross Zwisler I noticed that we have the following status translation now in 4 places in 2 files: if (err == -ENOMEM) return VM_FAULT_OOM; if (err < 0 && err != -EBUSY) return VM_FAULT_SIGBUS; return VM_FAULT_NOPAGE; This happens in vmf_insert_mixed_mkwrite(), vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn(). I think it'd be a good idea to consolidate this translation into an inline helper, in the spirit of dax_fault_return(). This will ensure that if/when we start changing this status translation, we won't accidentally miss some of the places which would make them get out of sync. No need to fold this into this patch - it should be a separate change.
Re: [PATCH] MAINTAINERS: Adding backup maintainers for libnvdimm and DAX
On Fri, Apr 13, 2018 at 01:47:40PM -0700, Dave Jiang wrote: > Adding additional maintainers to libnvdimm related code and DAX. > > Signed-off-by: Dave Jiang This is fine with me: Acked-by: Ross Zwisler
Re: [PATCH v9 07/61] xarray: Add the xa_lock to the radix_tree_root
On Thu, Apr 12, 2018 at 3:22 PM, Matthew Wilcox wrote: > On Thu, Apr 12, 2018 at 03:16:23PM -0600, Ross Zwisler wrote: >> On Thu, Apr 12, 2018 at 3:10 PM, Matthew Wilcox wrote: >> > On Thu, Apr 12, 2018 at 02:59:32PM -0600, Ross Zwisler wrote: >> >> This is causing build breakage in the radix tree test suite in the >> >> current linux/master: >> >> >> >> ./linux/../../../../include/linux/idr.h: In function ‘idr_init_base’: >> >> ./linux/../../../../include/linux/radix-tree.h:129:2: warning: >> >> implicit declaration of function ‘spin_lock_init’; did you mean >> >> ‘spinlock_t’? [-Wimplicit-function-declaration] >> > >> > Argh. That was added two patches later in >> > "xarray: Add definition of struct xarray": >> > >> > diff --git a/tools/include/linux/spinlock.h >> > b/tools/include/linux/spinlock.h >> > index b21b586b9854..4ec4d2cbe27a 100644 >> > --- a/tools/include/linux/spinlock.h >> > +++ b/tools/include/linux/spinlock.h >> > @@ -6,8 +6,9 @@ >> > #include >> > >> > #define spinlock_t pthread_mutex_t >> > -#define DEFINE_SPINLOCK(x) pthread_mutex_t x = >> > PTHREAD_MUTEX_INITIALIZER; >> > +#define DEFINE_SPINLOCK(x) pthread_mutex_t x = >> > PTHREAD_MUTEX_INITIALIZER >> > #define __SPIN_LOCK_UNLOCKED(x) >> > (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER >> > +#define spin_lock_init(x) pthread_mutex_init(x, NULL) >> > >> > #define spin_lock_irqsave(x, f)(void)f, >> > pthread_mutex_lock(x) >> > #define spin_unlock_irqrestore(x, f) (void)f, pthread_mutex_unlock(x) >> > >> > I didn't pick up that it was needed this early on in the patch series. >> >> Hmmm..I don't know if it's a patch ordering issue, because this >> happens with the current linux/master where presumably all the patches >> are present? > > No, Andrew only merged the first 8 or so because of lack of review of > the remaining patches. Even though I cc'd people as hard as I could. > Including you. :-P > > You could, for example, review the DAX patches ... Fair enough. Let's get the radix tree working, and in the mean time I'll throw it into my xfstests testing setup & take a look at the DAX patches.
Re: [PATCH v9 07/61] xarray: Add the xa_lock to the radix_tree_root
On Thu, Apr 12, 2018 at 3:10 PM, Matthew Wilcox wrote: > On Thu, Apr 12, 2018 at 02:59:32PM -0600, Ross Zwisler wrote: >> This is causing build breakage in the radix tree test suite in the >> current linux/master: >> >> ./linux/../../../../include/linux/idr.h: In function ‘idr_init_base’: >> ./linux/../../../../include/linux/radix-tree.h:129:2: warning: >> implicit declaration of function ‘spin_lock_init’; did you mean >> ‘spinlock_t’? [-Wimplicit-function-declaration] > > Argh. That was added two patches later in > "xarray: Add definition of struct xarray": > > diff --git a/tools/include/linux/spinlock.h b/tools/include/linux/spinlock.h > index b21b586b9854..4ec4d2cbe27a 100644 > --- a/tools/include/linux/spinlock.h > +++ b/tools/include/linux/spinlock.h > @@ -6,8 +6,9 @@ > #include > > #define spinlock_t pthread_mutex_t > -#define DEFINE_SPINLOCK(x) pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER; > +#define DEFINE_SPINLOCK(x) pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER > #define __SPIN_LOCK_UNLOCKED(x) > (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER > +#define spin_lock_init(x) pthread_mutex_init(x, NULL) > > #define spin_lock_irqsave(x, f)(void)f, pthread_mutex_lock(x) > #define spin_unlock_irqrestore(x, f) (void)f, pthread_mutex_unlock(x) > > I didn't pick up that it was needed this early on in the patch series. Hmmm..I don't know if it's a patch ordering issue, because this happens with the current linux/master where presumably all the patches are present?
[PATCH] radix tree test suite: fix mapshift build target
The following commit commit c6ce3e2fe3da ("radix tree test suite: Add config option for map shift") Introduced a phony makefile target called 'mapshift' that ends up generating the file generated/map-shift.h. This phony target was then added as a dependency of the top level 'targets' build target, which is what is run when you go to tools/testing/radix-tree and just type 'make'. Unfortunately, this phony target doesn't actually work as a dependency, so you end up getting: $ make make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'. Stop. make: *** Waiting for unfinished jobs Fix this by making the file generated/map-shift.h our real makefile target, and add this a dependency of the top level build target. Signed-off-by: Ross Zwisler --- This allows the radix tree test suite that shipped with v4.16 build and run successfully. The radix tree test suite in the current linux/master during the v4.17 merge window is broken, as I've reported here: https://marc.info/?l=linux-fsdevel&m=152356678901014&w=2 But this patch is necessary to even get to that breakage. I've sent this to Matthew twice with no response, so Andrew, I'm trying you next. --- tools/testing/radix-tree/Makefile | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile index fa7ee369b3c9..db66f8a0d4be 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -17,7 +17,7 @@ ifeq ($(BUILD), 32) LDFLAGS += -m32 endif -targets: mapshift $(TARGETS) +targets: generated/map-shift.h $(TARGETS) main: $(OFILES) @@ -42,9 +42,7 @@ radix-tree.c: ../../../lib/radix-tree.c idr.c: ../../../lib/idr.c sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ -.PHONY: mapshift - -mapshift: +generated/map-shift.h: @if ! grep -qws $(SHIFT) generated/map-shift.h; then\ echo "#define RADIX_TREE_MAP_SHIFT $(SHIFT)" > \ generated/map-shift.h; \ -- 2.14.3
Re: [PATCH v9 07/61] xarray: Add the xa_lock to the radix_tree_root
On Tue, Mar 13, 2018 at 7:25 AM, Matthew Wilcox wrote: > From: Matthew Wilcox > > This results in no change in structure size on 64-bit machines as it > fits in the padding between the gfp_t and the void *. 32-bit machines > will grow the structure from 8 to 12 bytes. Almost all radix trees > are protected with (at least) a spinlock, so as they are converted from > radix trees to xarrays, the data structures will shrink again. > > Initialising the spinlock requires a name for the benefit of lockdep, > so RADIX_TREE_INIT() now needs to know the name of the radix tree it's > initialising, and so do IDR_INIT() and IDA_INIT(). > > Also add the xa_lock() and xa_unlock() family of wrappers to make it > easier to use the lock. If we could rely on -fplan9-extensions in > the compiler, we could avoid all of this syntactic sugar, but that > wasn't added until gcc 4.6. > > Signed-off-by: Matthew Wilcox > Reviewed-by: Jeff Layton This is causing build breakage in the radix tree test suite in the current linux/master: 5d1365940a68 (linux/master) Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Here's the first breakage: cc -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address -c -o idr.o idr.c In file included from ./linux/radix-tree.h:6:0, from ./linux/../../../../include/linux/idr.h:15, from ./linux/idr.h:1, from idr.c:4: ./linux/../../../../include/linux/idr.h: In function ‘idr_init_base’: ./linux/../../../../include/linux/radix-tree.h:129:2: warning: implicit declaration of function ‘spin_lock_init’; did you mean ‘spinlock_t’? [-Wimplicit-function-declaration] spin_lock_init(&(root)->xa_lock);\ ^ ./linux/../../../../include/linux/idr.h:126:2: note: in expansion of macro ‘INIT_RADIX_TREE’ INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER); ^~~
Re: [PATCH] libnvdimm, region: hide persistence_domain when unknown
On Tue, Mar 20, 2018 at 09:23:21PM -0700, Dan Williams wrote: > Similar to other region attributes, do not emit the persistence_domain > attribute if its contents are empty. > > Fixes: 96c3a239054a ("libnvdimm: expose platform persistence attr...") > Cc: Dave Jiang > Signed-off-by: Dan Williams Looks good to me. Reviewed-by: Ross Zwisler
[PATCH] MAINTAINERS: add coverage for drivers/block
To help folks like me that use scripts/get_maintainer.pl. Signed-off-by: Ross Zwisler --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 4623caf8d72d..7ff83f4c9aeb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2685,6 +2685,7 @@ L:linux-bl...@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git S: Maintained F: block/ +F: drivers/block/ F: kernel/trace/blktrace.c F: lib/sbitmap.c -- 2.14.3
Re: [PATCH] loop: Fix lost writes caused by missing flag
On Fri, Mar 09, 2018 at 08:38:57AM -0700, Jens Axboe wrote: > On 3/9/18 8:38 AM, Jens Axboe wrote: > > On 3/8/18 5:20 PM, Ross Zwisler wrote: > >> This has gotten Reviewed-by tags from Christoph and Ming Lei. > >> > >> Al, are you the right person to merge this? Or is the correct person Jens, > >> whom I accidentally didn't include when I sent this out? > >> > >> Just wanted to make sure this got merged, and to see whether it was > >> targeting > >> v4.16 or v4.17. > > > > I'm the right guy, but I don't see patches if I'm not cc'ed on them... > > I have queued this up for 4.16, thanks. > > I do see patches sent to linux-block as well, but you didn't CC that one > either. I figure out who to CC on my patches by using scripts/get_maintainer.pl, and it didn't know about you or linux-block because drivers/block wasn't covered in MAINTAINERS. I'll send a patch to fix this. As it was I just CC'd the folks involved in the original patch, and that one went up through Al. Thanks for picking this up.