Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Tue, Jun 6, 2023 at 5:48 AM Andreas Gruenbacher wrote: > > - Don't get stuck writing page onto itself under direct I/O. Btw, is there a test for this DIO case? We've had the deadlock issue on t page lock (or for inode locks or whatever) for normal IO when faulting in the same page that is written to, and we have as pattern for solving that and I think there are filesystem tests that trigger this. But the DIO pattern is a bit different, with the whole "invalidate page cache: issue, and the fact that you send this patch now (rather than years ago) makes me wonder about test coverage for this all? Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix for v6.3-rc4
On Thu, Mar 23, 2023 at 3:22 PM Andreas Grünbacher wrote: > > I've pushed the tag out now; should I resend the pull request? No, all good, I have the changes, Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix for v6.3-rc4
On Thu, Mar 23, 2023 at 11:45 AM Andreas Gruenbacher wrote: > > From: Linus Torvalds Wat? > git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git > gfs2-v6.3-rc3-fix -ENOSUCHTAG > for you to fetch changes up to 260595b439776c473cc248f0de63fe78d964d849: .. and no such commit available in any other tag or branch either. Did you forget to push out? Linus
Re: [Cluster-devel] [GIT PULL] gfs2 writepage fix
On Sun, Jan 22, 2023 at 1:01 AM Andreas Gruenbacher wrote: > > gfs2 writepage fix > > - Fix a regression introduced by commit "gfs2: stop using > generic_writepages in gfs2_ail1_start_one". Hmm. I'm adding a few more people and linux-fsdevel to the reply, because we had a number of filesystems remove writepages use lately, including some that did it as a fix after the merge window. Everybody involved seemed to claim it was just a no-brainer switch-over, and I just took that on faith. Now it looks like that wasn't true at least for gfs2 due to different semantics. Maybe the gfs2 issue is purely because of how gfs2 did the conversion (generic_writepages -> filemap_fdatawrite_wbc), but let's make people look at their own cases. Linus
Re: [Cluster-devel] [GIT PULL] dlm updates for 6.0
On Mon, Aug 1, 2022 at 8:56 AM David Teigland wrote: > > At risk of compounding your trouble, I could have added that you can use > the original merge and have me send the fixup. Well, I only had something like 4 other merges on top, none of them *that* complicated. And I hadn't pushed out, so I just redid them. I did notice your "v2" pull request fairly quickly, because I try to do my pulls (when I have lots of them, like at the beginning of the merge window) in some kind of "order of areas", so I was doing filesystem and vfs stuff, and your v2 ended up showing up in that bunch too, and I hadn't done too much. I doubt I would have needed help with the conflicts, but I decided to not even look at them, because even with them resolved it would just have been ugly to have that pointless duplication from the rebase when I could just start from scratch again. But again: please don't rebase stuff you have already exposed to others. It causes real issues. This was just one example of it. And if you *do* have to rebase for a real technical reason ("Oh, that was a disaster, it absolutely *must* be fixed"), please let involved people know asap. And a version number change is not a "huge disaster, must rebase". Linus
Re: [Cluster-devel] [GIT PULL] dlm updates for 6.0
On Mon, Aug 1, 2022 at 7:43 AM David Teigland wrote: > > (You can ignore the premature 5.20 pull request from some weeks ago.) Gaah. That was the first thing I pulled this morning because it was the earliest. And apparently you've rebased, so I can't even just pull on top. Gaah. Now I'll have to go back and re-do everything I've done this morning. PLEASE don't do things like this to me. If you know you are going to re-do a pull request, let me know early, so that I don't pull the old one. Linus
Re: [Cluster-devel] [RFC 0/2] refcount: attempt to avoid imbalance warnings
On Thu, Jun 30, 2022 at 6:59 AM Alexander Aring wrote: > > I send this patch series as RFC because it was necessary to do a kref > change after adding __cond_lock() to refcount_dec_and_lock() > functionality. Can you try something like this instead? This is two separate patches - one for sparse, and one for the kernel. This is only *very* lightly tested (ie I tested it on a single kernel file that used refcount_dec_and_lock()) Linus linearize.c | 24 ++-- validation/context.c | 15 +++ 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/linearize.c b/linearize.c index d9aed61b..8dd005af 100644 --- a/linearize.c +++ b/linearize.c @@ -1537,6 +1537,8 @@ static pseudo_t linearize_call_expression(struct entrypoint *ep, struct expressi add_one_insn(ep, insn); if (ctype) { + struct basic_block *post_call = NULL; + FOR_EACH_PTR(ctype->contexts, context) { int in = context->in; int out = context->out; @@ -1547,8 +1549,21 @@ static pseudo_t linearize_call_expression(struct entrypoint *ep, struct expressi in = 0; } if (out < 0) { -check = 0; -out = 0; +struct basic_block *set_context; +if (retval == VOID) { + warning(expr->pos, "nonsensical conditional output context with no condition"); + break; +} +if (check || in) { + warning(expr->pos, "nonsensical conditional input context"); + break; +} +if (!post_call) + post_call = alloc_basic_block(ep, expr->pos); +set_context = alloc_basic_block(ep, expr->pos); +add_branch(ep, retval, set_context, post_call); +set_activeblock(ep, set_context); +out = -out; } context_diff = out - in; if (check || context_diff) { @@ -1560,6 +1575,11 @@ static pseudo_t linearize_call_expression(struct entrypoint *ep, struct expressi } } END_FOR_EACH_PTR(context); + if (post_call) { + add_goto(ep, post_call); + set_activeblock(ep, post_call); + } + if (ctype->modifiers & MOD_NORETURN) add_unreachable(ep); } diff --git a/validation/context.c b/validation/context.c index b9500dc7..f8962f19 100644 --- a/validation/context.c +++ b/validation/context.c @@ -10,6 +10,10 @@ static void r(void) __attribute__((context(1,0))) __context__(-1); } +// Negative output means "conditional positive output" +extern int cond_get(void) __attribute((context(0,-1))); +extern void nonsensical_cond_get(void) __attribute((context(0,-1))); + extern int _ca(int fail); #define ca(fail) __cond_lock(_ca(fail)) @@ -19,6 +23,17 @@ static void good_paired1(void) r(); } +static void good_conditional(void) +{ + if (cond_get()) + r(); +} + +static void nonsensical_conditional(void) +{ + nonsensical_cond_get(); +} + static void good_paired2(void) { a(); include/linux/compiler_types.h | 2 ++ include/linux/refcount.h | 6 +++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index d08dfcb0ac68..4f2a819fd60a 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -24,6 +24,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { } /* context/locking */ # define __must_hold(x) __attribute__((context(x,1,1))) # define __acquires(x) __attribute__((context(x,0,1))) +# define __cond_acquires(x) __attribute__((context(x,0,-1))) # define __releases(x) __attribute__((context(x,1,0))) # define __acquire(x) __context__(x,1) # define __release(x) __context__(x,-1) @@ -50,6 +51,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { } /* context/locking */ # define __must_hold(x) # define __acquires(x) +# define __cond_acquires(x) # define __releases(x) # define __acquire(x) (void)0 # define __release(x) (void)0 diff --git a/include/linux/refcount.h b/include/linux/refcount.h index b8a6e387f8f9..a62fcca97486 100644 --- a/include/linux/refcount.h +++ b/include/linux/refcount.h @@ -361,9 +361,9 @@ static inline void refcount_dec(refcount_t *r) extern __must_check bool refcount_dec_if_one(refcount_t *r); extern __must_check bool refcount_dec_not_one(refcount_t *r); -extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock); -extern __must_check bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock); +extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock) __cond_acquires(lock); +extern __must_check bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock) __cond_acquires(lock); extern __must_check bool refcount_dec_and_lock_irqsave(refcount_t *r, spinlock_t *lock, - unsigned long *flags); + unsigned long *flags) __cond_acquires(lock); #endif /* _LINUX_REFCOUNT_H */
Re: [Cluster-devel] sparse warnings related to kref_put_lock() and refcount_dec_and_lock()
On Tue, Jun 28, 2022 at 1:58 AM Luc Van Oostenryck wrote: > > I would certainly not recommend this but ... > if it's OK to cheat and lie then you can do: > + bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock) > __acquires(lock); Actually, we have "__cond_lock()" in the kernel to actually document that something takes a lock only in certain conditions. It needs to be declared as a macro in the header file, with this as an example: #define raw_spin_trylock(lock) __cond_lock(lock, _raw_spin_trylock(lock)) ie that says that "raw_spin_trylock() takes 'lock' when _raw_spin_trylock() returned true". That then makes it possible to write code like if (raw_spin_trylock(lock)) {.. raw_spin_unlock(lock)); } and sparse will get the nesting right. But that does *not* solve the issue of people then writing this as locked = raw_spin_trylock(lock); .. do_something .. if (locked) raw_spin_unlock(lock)); and you end up with the same thing again. Anyway, for things like refcount_dec_and_lock(), I suspect that first pattern should work, because you really shouldn't have that second pattern. If you just decremented without any locking, the actions are completely different from the "oh, got the last decrement and now it's locked and I need to free things" or whatever. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fixes
On Fri, May 13, 2022 at 2:07 PM Andreas Gruenbacher wrote: > > Would you like to still pull these fixes for v5.18, or should we send them in > the next merge window? I've pulled them. Any filesystem corruption issue sounds scary enough that it doesn't sound very sane to delay fixes. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher wrote: > > More testing still ongoing, but the following patch seems to fix the > data corruption. Fingers crossed. > + truncate_pagecache_range(inode, hstart, hend - 1); > + if (hstart < hend) > + punch_hole(ip, hstart, hend - hstart); Why doesn't that "hstart < hend" condition cover both the truncate and the hole punch? Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Tue, May 3, 2022 at 9:41 AM Andreas Gruenbacher wrote: > > That's happening in iomap_file_buffered_write() and iomap_iter(): > > while ((ret = iomap_iter(, ops)) > 0) > iter.processed = iomap_write_iter(, i); > > Here, iomap_write_iter() returns how much progress it has made, which > is stored in iter.processed, and iomap_iter() -> iomap_iter_advance() > then updates iter.pos and iter.len based on iter.processed. Ahh. I even had that file open in my editor in another window, but missed it. So much for that theory. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Mon, May 2, 2022 at 11:31 AM Linus Torvalds wrote: > > NOTE! This patch is entirely untested. I also didn't actually yet go > look at what gfs2 does when 'bytes' and 'copied' are different. Oh, it's a lot of generic iomap_write_end() code, so I guess it's just as well that I brought in the iomap people. And the iomap code has many different cases. Some of them do if (unlikely(copied < len && !folio_test_uptodate(folio))) return 0; to force the whole IO to be re-done (looks sane - that's the "the underlying folio wasn't uptodate, because we expected the write to make it so"). And that might not have happened before, but it looks like gfs2 does actually try to deal with that case. But since Andreas said originally that the IO wasn't aligned, I don't think that "not uptodate" case is what is going on, and it's more about some "partial write in the middle of a buffer succeeded" And the code also has things like if (ret < len) iomap_write_failed(iter->inode, pos, len); which looks very wrong - it's not that the write failed, it's just incomplete because it was done with page faults disabled. It seems to try to do some page cache truncation based on the original 'len', but not taking the successful part into account. Which all sounds horrifically wrong. But I don't know the code well enough to really judge. It just makes me uncomfortable, and I do suspect this code may be quite buggy if the copy of the full 'len' doesn't succeed. Again, the patch I sent only _hides_ any issues and makes them practically impossible to see. It doesn't really _fix_ anything, since - as mentioned - regardless of fault_in_iov_iter_readable() succeeding, racing with page-out could then cause the later copy_page_from_iter_atomic() to have a partial copy anyway. And hey, maybe there's something entirely different going on, and my "Heureka! It might be explained by that partial write_end that generally didn't happen before" is only my shouting at windmills. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds wrote: > > I'll look at that copy_page_to_iter_iovec() some more regardless, but > doing that "let's double-check it's not somethign else" would be good. Oh, and as I do that, it strikes me: What platform do you see the problem on? Because the code for that function is very different if HIGHMEM is enabled, so if you see this on x86-32 but not x86-64, for example, that is relevant. I *assume* nobody uses x86-32 any more, but just checking... Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher wrote: > > The data corruption we've been getting unfortunately didn't have to do > with lock contention (we already knew that); it still occurs. I'm > running out of ideas on what to try there. Hmm. I don't see the bug, but I do have a suggestion on something to try. In particular, you said the problem started with commit 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks for buffered I/O"). And to me, I see two main things that are going on (a) the obvious "calling generic IO functions with pagefault disabled" thing (b) the "allow demotion" thing And I wonder if you could at least pinpoint which of the cases it is that triggers it. So I'd love to see you try three things: (1) just remove the "allow demotion" cases. This will re-introduce the deadlock the commit is trying to fix, but that's such a special case that I assume you can run your test-suite that shows the problem even without that fix in place? This would just pinpoint whether it's due to some odd locking issue or not. Honestly, from how you describe the symptoms, I don't think (1) is the cause, but I think making sure is good. It sounds much more likely that it's one of those generic vfs functions that screws up when a page fault happens and it gets a partial result instead of handling the fault. Which gets us to (2) remove the pagefault_disable/enable() around just the generic_file_read_iter() case in gfs2_file_read_iter(). and (3) finally, remove the pagefault_disable/enable() around the iomap_file_buffered_write() case in gfs2_file_buffered_write() Yeah, yeah, you say it's just the read that fails, but humor me on (3), just in case it's an earlier write in your test-suite and the read just then uncovered it. But I put it as (3) so that you'd do the obvious (2) case first, and narrow it down (ie if (1) still shows the bug, then do (2), and if that fixes the bug it will be fairly well pinpointed to generic_file_read_iter(). Looking around, gfs2 is the only thing that obviously calls generic_file_read_iter() with pagefoaults disabled, so it does smell like filemap_read() might have some issue, but the only thing that does is basically that copied = copy_folio_to_iter(folio, offset, bytes, iter); which should just become copy_page_to_iter_iovec(), which you'd hope would get things right. But it would be good to just narrow things down a bit. I'll look at that copy_page_to_iter_iovec() some more regardless, but doing that "let's double-check it's not somethign else" would be good. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds wrote: > > So I really think > > (a) you are mis-reading the standard by attributing too strong logic > to paperwork that is English prose and not so exact > > (b) documenting Linux as not doing what you are mis-reading it for is > only encouraging others to mis-read it too > > The whole "arbitrary writes have to be all-or-nothing wrt all other > system calls" is simply not realistic, and has never been. Not just > not in Linux, but in *ANY* operating system that POSIX was meant to > describe. Side note: a lot of those "atomic" things in that documentation have come from a history of signal handling atomicity issues, and from all the issues people had with (a) user-space threading implementations and (b) emulation layers from non-Unixy environments. So when they say that things like "rename()" has to be all-or-nothing, it's to clarify that you can't emulate it as a "link and delete original" kind of operation (which old UNIX *did* do) and claim to be POSIX. Because while the end result of rename() and link()+unlink()might be similar, people did rely on that whole "use rename as a way to create an atomic marker in the filesystem" (which is a very traditional UNIX pattern). So "rename()" has to be atomic, and the legacy behavior of link+unlink is not valid in POSIX. Similarly, you can't implement "pread()" as a "lseek+read+lseek back", because that doesn't work if somebody else is doing another "pread()" on the same file descriptor concurrently. Again, people *did* implement exactly those kinds of implementations of "pread()", and yes, they were broken for both signals and for threading. So there's "atomicity" and then there is "atomicity". That "all or nothing" can be a very practical thing to describe *roughly* how it must work on a higher level, or it can be a theoretical "transactional" thing that works literally like a database where the operation happens in full and you must not see any intermediate state. And no, "write()" and friends have never ever been about some transactional operation where you can't see how the file grows as it is being written to. That kind of atomicity has simply never existed, not even in theory. So when you see POSIX saying that a "read()" system call is "atomic", you should *not* see it as a transaction thing, but see it in the historical context of "people used to do threading libraries in user space, and since they didn't want a big read() to block all other threads, they'd split it up into many smaller reads and now another thread *also* doing 'read()' system calls would see the data it read being not one contiguous region, but multiple regions where the file position changed in the middle". Similarly, a "read()" system call will not be interrupted by a signal in the middle, where the signal handler would do a "lseek()" or another "read()", and now the original "read()" data suddenly is affected. That's why things like that whole "f_pos is atomic" is a big deal. Because there literally were threading libraries (and badly emulated environments) where that *WASN'T* the case, and _that_ is why POSIX then talks about it. So think of POSIX not as some hard set of "this is exactly how things work and we describe every detail". Instead, treat it a bit like historians treat Herodotus - interpreting his histories by taking the issues of the time into account. POSIX is trying to clarify and document the problems of the time it was written, and taking other things for granted. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Wed, Apr 27, 2022 at 2:26 PM Andreas Gruenbacher wrote: > > Well, POSIX explicitly mentions those atomicity expectations, e.g., > for read [1]: Yes. I'm aware. And my point is that we've never done _that_ kind of atomicity. It's also somewhat ambiguous what it actually means, since what it then talks about is "all bytes that started out together ends together" and "interleaving". That all implies that it's about the *position* of the reads and writes being atomic, not the *data* of the reads and writes. That, btw, was something we honored even before we had the locking around f_pos accesses - a read or write system call would get its own local *copy* the file position, the read or write would then do that IO based on that copied position - so that things that "started out together ends together" - and then after the operation is done it would *update* the file position atomically. Note that that is exactly so that data would end up "together". But it would mean that two concurrent reads using the same file position might read the *same* area of the file. Which still honors that "the read is atomic wrt the range", but obviously the actual values of "f_pos" is basically random after the read (ie is it the end of the first read, or the end of the second read?). The same paragraph also explicitly mentions pipes and FIFOs, despite an earlier paragraph dismissing them, which is all just a sign of things being very confused. Anyway, I'm not objecting very sternously to making it very clear in some documentation that this "data atomicity" is not what Linux has ever done. If you do overlapping IO, you get what you deserve. But I do have objections. On one hand, it's not all that different from some of the other notes we have in the man-pages (ie documenting that whole "just under 2GB" limit on the read size, although that's actually using the wrong constant: it's not 0x7000 bytes, it's MAX_RW_COUNT, which is "INT_MAX & PAGE_MASK" and that constant in the man-page is as such only true on a system with 4kB page sizes) BUT! I'm 100% convinced that NOBODY HAS EVER given the kind of atomicity guarantees that you would see from reading that document as a language-lawyer. For example, that section "2.9.7 Thread Interactions with Regular File Operations" says that "fstat()" is atomic wrt "write()", and that you should see "all or nothing". I *GUARANTEE* that no operating system ever has done that, and I further claim that reading it the way you read it is not only against reality, it's against sanity. Example: if I do a big write to a file that I just created, do you really want "fstat()" in another thread or process to not even be able to see how the file grows as the write happens? It's not what anybody has *EVER* done, I'm pretty sure. So I really think (a) you are mis-reading the standard by attributing too strong logic to paperwork that is English prose and not so exact (b) documenting Linux as not doing what you are mis-reading it for is only encouraging others to mis-read it too The whole "arbitrary writes have to be all-or-nothing wrt all other system calls" is simply not realistic, and has never been. Not just not in Linux, but in *ANY* operating system that POSIX was meant to describe. And equally importantly: if some crazy person were to actually try to implement such "true atomicity" things, the end result would be objectively worse. Because you literally *want* to see a big write() updating the file length as the write happens. The fact that the standard then doesn't take those kinds of details into account is simply because the standard isn't meant to be read as a language lawyer, but as a "realistically .." Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher wrote: > > I wonder if this could be documented in the read and write manual > pages. Or would that be asking too much? I don't think it would be asking too much, since it's basically just describing what Linux has always done in all the major filesystems. Eg look at filemap_read(), which is basically the canonical read function, and note how it doesn't take a single lock at that level. We *do* have synchronization at a page level, though, ie we've always had that page-level "uptodate" bit, of course (ok, so "always" isn't true - back in the distant past it was the 'struct buffer_head' that was the synchronization point). That said, even that is not synchronizing against "new writes", but only against "new creations" (which may, of course, be writers, but is equally likely to be just reading the contents from disk). That said: (a) different filesystems can and will do different things. Not all filesystems use filemap_read() at all, and even the ones that do often have their own wrappers. Such wrappers *can* do extra serialization, and have their own rules. But ext4 does not, for example (see ext4_file_read_iter()). And as mentioned, I *think* XFS honors that old POSIX rule for historical reasons. (b) we do have *different* locking for example, we these days do actually serialize properly on the file->f_pos, which means that a certain *class* of read/write things are atomic wrt each other, because we actually hold that f_pos lock over the whole operation and so if you do file reads and writes using the same file descriptor, they'll be disjoint. That, btw, hasn't always been true. If you had multiple threads using the same file pointer, I think we used to get basically random results. So we have actually strengthened our locking in this area, and made it much better. But note how even if you have the same file descriptor open, and then do pread/pwrite, those can and will happen concurrently. And mmap accesses and modifications are obviously *always* concurrent, even if the fault itself - but not the accesses - might end up being serialized due to some filesystem locking implementation detail. End result: the exact serialization is complex, depends on the filesystem, and is just not really something that should be described or even relied on (eg that f_pos serialization is something we do properly now, but didn't necessarily do in the past, so ..) Is it then worth pointing out one odd POSIX rule that basically nobody but some very low-level filesystem people have ever heard about, and that no version of Linux has ever conformed to in the main default filesystems, and that no user has ever cared about? Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher wrote: > > Regular (buffered) reads and writes are expected to be atomic with > respect to each other. Linux has actually never honored that completely broken POSIX requirement, although I think some filesystems (notably XFS) have tried. It's a completely broken concept. It's not possible to honor atomicity with mmap(), and nobody has *ever* cared. And it causes huge amounts of problems and basically makes any sane locking entirely impossible. The fact that you literally broke regular file writes in ways that are incompatible with (much MUCH more important) POSIX file behavior to try to get that broken read/write atomicity is only one example among many for why that alleged rule just has to be ignored. We do honor the PIPE_BUF atomicity on pipes, which is a completely different kind of atomicity wrt read/write, and doesn't have the fundamental issues that arbitrary regular file reads/writes have. There is absolutely no sane way to do that file atomicity wrt arbitrary read/write calls (*), and you shouldn't even try. That rule needs to be forgotten about, and buried 6ft deep. So please scrub any mention of that idiotic rule from documentation, and from your brain. And please don't break "partial write means disk full or IO error" due to trying to follow this broken rule, which was apparently what you did. Because that "regular file read/write is done in full" is a *MUCH* more important rule, and there is a shitton of applications that most definitely depend on *that* rule. Just go to debian code search, and look for "if (write(" and you'll get thousands of hits, and on the first page of hits 9 out of 10 of the hits are literally about that "partial write is an error", eg code like this: if (write(fd,,sizeof(triple)) != sizeof(triple)) reporterr(1,NULL); from libreoffice. Linus (*) Yeah, if you never care about performance(**) of mixed read/write, and you don't care about mmap, and you have no other locking issues, it's certainly possible. The old rule came about from original UNIX literally taking an inode lock around the whole IO access, because that was simple, and back in the days you'd never have multiple concurrent readers/writers anyway. (**) It's also instructive how O_DIRECT literally throws that rule away, and then some direct-IO people said for years that direct-IO is superior and used this as one of their arguments. Probably the same people who thought that "oh, don't report partial success", because we can't deal with it.
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher wrote: > > Btrfs has a comment in that place that reads: > > /* No increment (+=) because iomap returns a cumulative value. */ What a truly horrid interface. But you are triggering repressed childhood memories: > That's so that we can complete the tail of an asynchronous write > asynchronously after a failed page fault and subsequent fault-in. yeah, that makes me go "I remember something like that". > That would be great, but applications don't seem to be able to cope > with short direct writes, so we must turn partial failure into total > failure here. There's at least one xfstest that checks for that as > well. What a complete crock. You're telling me that you did some writes, but then you can't tell user space that writes happened because that would confuse things. Direct-IO is some truly hot disgusting garbage. Happily it's only used for things like databases that nobody sane would care about anyway. Anyway, none of that makes any sense, since you do this: ret = gfs2_file_direct_write(iocb, from, ); if (ret < 0 || !iov_iter_count(from)) goto out_unlock; iocb->ki_flags |= IOCB_DSYNC; buffered = gfs2_file_buffered_write(iocb, from, ); so you're saying that the direct write will never partially succeed, but then in gfs2_file_write_iter() it very much looks like it's falling back to buffered writes for that case. Hmm. Very odd. I assume this is all due to that /* Silently fall back to buffered I/O when writing beyond EOF */ if (iocb->ki_pos + iov_iter_count(from) > i_size_read(>i_inode)) thing, so this all makes some perverse kind of sense, but it still looks like this code just needs some serious serious commentary. So you *can* have a partial write if you hit the end of the file, and then you'll continue that partial write with the buffered code. But an actual _failure_ will not do that, and instead return an error even if the write was partially done. But then *some* failures aren't failures at all, and without any comments do this if (ret == -ENOTBLK) ret = 0; And remind me again - this all is meant for applications that supposedly care about consistency on disk? And the xfs tests actually have a *test* for that case, to make sure that nobody can sanely *really* know how much of the write succeeded if it was a DIO write? Gotcha. > > The reason I think I'm full of sh*t is that you say that the problem > > occurs in gfs2_file_buffered_write(), not in that > > gfs2_file_direct_write() case. > > Right, we're having that issue with buffered writes. I have to say, compared to all the crazy things I see in the DIO path, the buffered write path actually looks almost entirely sane. Of course, gfs2_file_read_iter() counts how many bytes it has read in a variable called 'written', and gfs2_file_buffered_write() counts the bytes it has written in a variable called 'read', so "entirely sane" is all very very relative. I'm sure there's some good reason (job security?) for all this insanity. But I now have to go dig my eyes out with a rusty spoon. But before I do that, I have one more question (I'm going to regret this, aren't I?): In gfs2_file_buffered_write(), when it has done that fault_in_iov_iter_readable(), and it decides that that succeeded, it does if (gfs2_holder_queued(gh)) goto retry_under_glock; if (read && !(iocb->ki_flags & IOCB_DIRECT)) goto out_uninit; goto retry; so if it still has that lock (if I understand correctly), it will always retry. But if it *lost* the lock, it will retry only if was a direct write, and return a partial success for a regular write() rather than continue the write. Whaa? I'm assuming that this is more of that "direct writes have to succeed fully or not at all", but according to POSIX *regular* writes should also succeed fully, unless some error happens. Losing the lock doesn't sound like an error to me. And there are a lot of applications that do assume "write to a regular file didn't complete fully means that the disk is full" etc. Because that's the traditional meaning. This doesn't seem to be related to any data corruption issue, but it does smell. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fix
On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher wrote: > > Also, note that we're fighting with a rare case of data corruption that > seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap + > page fault deadlocks for buffered I/O"; merged in v5.16). The > corruption seems to occur in gfs2_file_buffered_write() when > fault_in_iov_iter_readable() is involved. We do end up with data that's > written at an offset, as if after a fault-in, the position in the iocb > got out of sync with the position in the iov_iter. The user buffer the > iov_iter points at isn't page aligned, so the corruption also isn't page > aligned. The code all seems correct and we couldn't figure out what's > going on so far. So this may be stupid, but looking at gfs2_file_direct_write(), I see what looks like two different obvious bugs. This looks entirely wrong: if (ret > 0) read = ret; because this code is *repeated*. I think it should use "read += ret;" so that if we have a few successful reads, they add up. And then at the end: if (ret < 0) return ret; return read; looks wrong too, since maybe the *last* iteration failed, but an earlier succeeded, so I think it should be if (read) return read; return ret; But hey, what do I know. I say "looks like obvious bugs", but I don't really know the code, and I may just be completely full of sh*t. The reason I think I'm full of sh*t is that you say that the problem occurs in gfs2_file_buffered_write(), not in that gfs2_file_direct_write() case. And the *buffered* case seems to get both of those "obvious bugs" right. So I must be wrong, but I have to say, that looks odd to me. Now hopefully somebody who knows the code will explain to me why I'm wrong, and in the process go "Duh, but.." and see what's up. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 fixes
On Fri, Feb 11, 2022 at 9:05 AM Andreas Gruenbacher wrote: > > * Revert debug commit that causes unexpected data corruption. Well, apparently not just unexpected, but unexplained too. That's a bit worrisome. It sounds like the corruption cause is still there, just hidden by the lack of __cond_resched()? Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Fri, Oct 29, 2021 at 10:50 AM Catalin Marinas wrote: > > First of all, a uaccess in interrupt should not force such signal as it > had nothing to do with the interrupted context. I guess we can do an > in_task() check in the fault handler. Yeah. It ends up being similar to the thread flag in that you still end up having to protect against NMI and other users of asynchronous page faults. So the suggestion was more of a "mindset" difference and modified version of the task flag rather than anything fundamentally different. > Second, is there a chance that we enter the fault-in loop with a SIGSEGV > already pending? Maybe it's not a problem, we just bail out of the loop > early and deliver the signal, though unrelated to the actual uaccess in > the loop. If we ever run in user space with a pending per-thread SIGSEGV, that would already be a fairly bad bug. The intent of "force_sig()" is not only to make sure you can't block the signal, but also that it targets the particular thread that caused the problem: unlike other random "send signal to process", a SIGSEGV caused by a bad memory access is really local to that _thread_, not the signal thread group. So somebody else sending a SIGSEGV asynchronsly is actually very different - it goes to the thread group (although you can specify individual threads too - but once you do that you're already outside of POSIX). That said, the more I look at it, the more I think I was wrong. I think the "we have a SIGSEGV pending" could act as the per-thread flag, but the complexity of the signal handling is probably an argument against it. Not because a SIGSEGV could already be pending, but because so many other situations could be pending. In particular, the signal code won't send new signals to a thread if that thread group is already exiting. So another thread may have already started the exit and core dump sequence, and is in the process of killing the shared signal threads, and if one of those threads is now in the kernel and goes through the copy_from_user() dance, that whole "thread group is exiting" will mean that the signal code won't add a new SIGSEGV to the queue. So the signal could conceptually be used as the flag to stop looping, but it ends up being such a complicated flag that I think it's probably not worth it after all. Even if it semantically would be fairly nice to use pre-existing machinery. Could it be worked around? Sure. That kernel loop probably has to check for fatal_signal_pending() anyway, so it would all work even in the presense of the above kinds of issues. But just the fact that I went and looked at just how exciting the signal code is made me think "ok, conceptually nice, but we take a lot of locks and we do a lot of special things even in the 'simple' force_sig() case". > Third is the sigcontext.pc presented to the signal handler. Normally for > SIGSEGV it points to the address of a load/store instruction and a > handler could disable MTE and restart from that point. With a syscall we > don't want it to point to the syscall place as it shouldn't be restarted > in case it copied something. I think this is actually independent of the whole "how to return errors". We'll still need to return an error from the system call, even if we also have a signal pending. Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Thu, Oct 28, 2021 at 2:21 PM Catalin Marinas wrote: > > They do look fairly similar but we should have the information in the > fault handler to distinguish: not a page fault (pte permission or p*d > translation), in_task(), user address, fixup handler. But I agree the > logic looks fragile. So thinking about this a bit more, I think I have a possible suggestion for how to handle this.. The pointer color fault (or whatever some other architecture may do to generate sub-page faults) is not only not recoverable in the sense that we can't fix it up, it also ends up being a forced SIGSEGV (ie it can't be blocked - it has to either be caught or cause the process to be killed). And the thing is, I think we could just make the rule be that kernel code that has this kind of retry loop with fault_in_pages() would force an EFAULT on a pending SIGSEGV. IOW, the pending SIGSEGV could effectively be exactly that "thread flag". And that means that fault_in_xyz() wouldn't need to worry about this situation at all: the regular copy_from_user() (or whatever flavor it is - to/from/iter/whatever) would take the fault. And if it's a regular page fault,. it would act exactly like it does now, so no changes. If it's a sub-page fault, we'd just make the rule be that we send a SIGSEGV even if the instruction in question has a user exception fixup. Then we just need to add the logic somewhere that does "if active pending SIGSEGV, return -EFAULT". Of course, that logic might be in fault_in_xyz(), but it migth also be a separate function entirely. So this does effectively end up being a thread flag, but it's also slightly more than that - it's that a sub-page fault from kernel mode has semantics that a regular page fault does not. The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT instead" has always been an odd and somewhat wrong-headed thing. Of course it should cause a SIGSEGV, but that's not how Unix traditionall worked. We would just say "color faults always raise a signal, even if the color fault was triggered in a system call". (And I didn't check: I say "SIGSEGV" above, but maybe the pointer color faults are actually SIGBUS? Doesn't change the end result). Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas wrote: > > As an alternative, you mentioned earlier that a per-thread fault status > was not feasible on x86 due to races. Was this only for the hw poison > case? I think the uaccess is slightly different. It's not x86-specific, it's very generic. If we set some flag in the per-thread status, we'll need to be careful about not overwriting it if we then have a subsequent NMI that _also_ takes a (completely unrelated) page fault - before we then read the per-thread flag. Think 'perf' and fetching backtraces etc. Note that the NMI page fault can easily also be a pointer coloring fault on arm64, for exactly the same reason that whatever original copy_from_user() code was. So this is not a "oh, pointer coloring faults are different". They have the same re-entrancy issue. And both the "pagefault_disable" and "fault happens in interrupt context" cases are also the exact same 'faulthandler_disabled()' thing. So even at fault time they look very similar. So we'd have to have some way to separate out only the one we care about. Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Tue, Oct 26, 2021 at 11:50 AM Linus Torvalds wrote: > > Because for _most_ cases of "copy_to/from_user()" and friends by far, > the only thing we look for is "zero for success". Gaah. Looking around, I almost immediately found some really odd exceptions to this. Like parse_write_buffer_into_params() in amdgpu_dm_debugfs.c, which does r = copy_from_user(wr_buf_ptr, buf, wr_buf_size); /* r is bytes not be copied */ if (r >= wr_buf_size) { DRM_DEBUG_DRIVER("user data not be read\n"); return -EINVAL; } and allows a partial copy to justy silently succeed, because all the callers have pre-cleared the wr_buf_ptr buffer. I have no idea why the code does that - it seems to imply that user space could give an invalid 'size' parameter and mean to write only the part that didn't succeed. Adding AMD GPU driver people just to point out that this code not only has odd whitespace, but that the pattern for "couldn't copy from user space" should basically always be if (copy_from_user(wr_buf_ptr, buf, wr_buf_size)) return -EFAULT; because if user-space passes in a partially invalid buffer, you generally really shouldn't say "ok, I got part of it and will use that part" There _are_ exceptions. We've had situations where user-space only passes in the pointer to the buffer, but not the size. Bad interface, but it historically happens for the 'mount()' system call 'data' pointer. So then we'll copy "up to a page size". Anyway, there are clearly some crazy users, and converting them all to also check for negative error returns might be more painful than I thought it would be. Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas wrote: > > While more intrusive, I'd rather change copy_page_from_iter_atomic() > etc. to take a pointer where to write back an error code. I absolutely hate this model. The thing is, going down that rat-hole, you'll find that you'll need to add it to *all* the "copy_to/from_user()" cases, which isn't acceptable. So then you start doing some duplicate versions with different calling conventions, just because of things like this. So no, I really don't want a "pass down a reference to an extra error code" kind of horror. That said, the fact that these sub-page faults are always non-recoverable might be a hint to a solution to the problem: maybe we could extend the existing return code with actual negative error numbers. Because for _most_ cases of "copy_to/from_user()" and friends by far, the only thing we look for is "zero for success". We could extend the "number of bytes _not_ copied" semantics to say "negative means fatal", and because there are fairly few places that actually look at non-zero values, we could have a coccinelle script that actually marks those places. End result: no change in calling conventions, no change to most users, and the (relatively few) cases where we look at the "what about partial results", we just add a .. existing code .. ret = copy_from_user(..); +if (ret < 0) +break; // or whatever "fatal error" situation .. existing code .. kind of thing that just stops the re-try. (The coccinelle script couldn't actually do that, but it could add some comment marker or something so that it's easy to find and then manually fix up the places it finds). Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas wrote: > > Probing only the first byte(s) in fault_in() would be ideal, no need to > go through all filesystems and try to change the uaccess/probing order. Let's try that. Or rather: probing just the first page - since there are users like that btrfs ioctl, and the direct-io path. Linus
Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()
On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher wrote: > > But probing the entire memory range in fault domain granularity in the > page fault-in functions still doesn't actually make sense. Those > functions really only need to guarantee that we'll be able to make > progress eventually. From that point of view, it should be enough to > probe the first byte of the requested memory range That's probably fine. Although it should be more than one byte - "copy_from_user()" might do word-at-a-time optimizations, so you could have an infinite loop of (a) copy_from_user() fails because the chunk it tried to get failed partly (b) fault_in() probing succeeds, because the beginning part is fine so I agree that the fault-in code doesn't need to do the whole area, but it needs to at least do some thing, to handle the situation where the copy_to/from_user requires more than a single byte. Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas wrote: > > However, with MTE doing both get_user() every 16 bytes and > gup can get pretty expensive. So I really think that anything that is performance-critical had better only do the "fault_in_write()" code path in the cold error path where you took a page fault. IOW, the loop structure should be repeat: take_locks(); pagefault_disable(); ret = do_things_with_locks(); pagefault_enable(); release_locks(); // Failure path? if (unlikely(ret == -EFAULT)) { if (fault_in_writable(..)) goto repeat; return -EFAULT; } rather than have it be some kind of "first do fault_in_writable(), then do the work" model. So I wouldn't worry too much about the performance concerns. It simply shouldn't be a common or hot path. And yes, I've seen code that does that "fault_in_xyz()" before the critical operation that cannot take page faults, and does it unconditionally. But then it isn't the "fault_in_xyz()" that should be blamed if it is slow, but the caller that does things the wrong way around. Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Wed, Oct 20, 2021 at 6:37 AM Catalin Marinas wrote: > > The atomic "add zero" trick isn't that simple for MTE since the arm64 > atomic or exclusive instructions run with kernel privileges and > therefore with the kernel tag checking mode. Are there any instructions that are useful for "probe_user_write()" kind of thing? We could always just add that as an arch function, with a fallback to using the futex "add zero" if the architecture doesn't need anything special. Although at least for MTE, I think the solution was to do a regular read, and that checks the tag, and then we could use the gup machinery for the writability checks. Linus
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher wrote: > > From my point of view, the following questions remain: > > * I hope these patches will be merged for v5.16, but what process >should I follow for that? The patch queue contains mm and iomap >changes, so a pull request from the gfs2 tree would be unusual. Oh, I'd much rather get these as one pull request from the author and from the person that actually ended up testing this. It might be "unusual", but it's certainly not unheard of, and trying to push different parts of the series through different maintainers would just cause lots of extra churn. Yes, normally I'd expect filesystem changes to have a diffstat that clearly shows that "yes, it's all local to this filesystem", and when I see anything else it raises red flags. But it raises red flags not because it would be wrong to have changes to other parts, but simply because when cross-subsystem development happens, it needs to be discussed and cleared with people. And you've done that. So I'd take this as one pull request from you. You've been doing the work, you get the questionable glory of being in charge of it all. You'll get the blame too ;) > * Will Catalin Marinas's work for supporting arm64 sub-page faults >be queued behind these patches? We have an overlap in >fault_in_[pages_]readable fault_in_[pages_]writeable, so one of >the two patch queues will need some adjustments. I think that on the whole they should be developed separately, I don't think it's going to be a particularly difficult conflict. That whole discussion does mean that I suspect that we'll have to change fault_in_iov_iter_writeable() to do the "every 16 bytes" or whatever thing, and make it use an actual atomic "add zero" or whatever rather than walk the page tables. But that's a conceptually separate discussion from this one, I wouldn't actually want to mix up the two issues too much. Sure, they touch the same code, so there is _that_ overlap, but one is about "the hardware rules are a-changing" and the other is about filesystem use of - and expansion of - the things we do. Let's keep them separate until ready, and then fix up the fallout at that point (either as a merge resolution, or even partly after-the-fact). Linus
Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()
On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas wrote: > > Apart from fault_in_pages_*(), there's also fault_in_user_writeable() > called from the futex code which uses the GUP mechanism as the write > would be destructive. It looks like it could potentially trigger the > same infinite loop on -EFAULT. Hmm. I think the reason we do fault_in_user_writeable() using GUP is that (a) we can avoid the page fault overhead (b) we don't have any good "atomic_inc_user()" interface or similar that could do a write with a zero increment or something like that. We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's all kinds of crazy, but we *could* do arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, , uaddr); instead of doing the fault_in_user_writeable(). That might be a good idea anyway. I dunno. But I agree other options exist: > I wonder whether we should actually just disable tag checking around the > problematic accesses. What these callers seem to have in common is using > pagefault_disable/enable(). We could abuse this to disable tag checking > or maybe in_atomic() when handling the exception to lazily disable such > faults temporarily. Hmm. That would work for MTE, but possibly be very inconvenient for other situations. > A more invasive change would be to return a different error for such > faults like -EACCESS and treat them differently in the caller. That's _really_ hard for things like "copy_to_user()", that isn't a single operation, and is supposed to return the bytes left. Adding another error return would be nasty. We've had hacks like "squirrel away the actual error code in the task structure", but that tends to be unmaintainable because we have interrupts (and NMI's) doing their own possibly nested atomics, so even disabling preemption won't actually fix some of the nesting issues. All of these things make me think that the proper fix ends up being to make sure that our "fault_in_xyz()" functions simply should always handle all faults. Another option may be to teach the GUP code to actually check architecture-specific sub-page ranges. Linus
Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()
On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas wrote: > > +#ifdef CONFIG_ARM64_MTE > +#define FAULT_GRANULE_SIZE (16) > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) [...] > If this looks in the right direction, I'll do some proper patches > tomorrow. Looks fine to me. It's going to be quite expensive and bad for caches, though. That said, fault_in_writable() is _supposed_ to all be for the slow path when things go south and the normal path didn't work out, so I think it's fine. I do wonder how the sub-page granularity works. Is it sufficient to just read from it? Because then a _slightly_ better option might be to do one write per page (to catch page table writability) and then one read per "granule" (to catch pointer coloring or cache poisoning issues)? That said, since this is all preparatory to us wanting to write to it eventually anyway, maybe marking it all dirty in the caches is only good. Linus
Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()
On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas wrote: > > I cleaned up this patch [1] but I realised it still doesn't solve it. > The arm64 __copy_to_user_inatomic(), while ensuring progress if called > in a loop, it does not guarantee precise copy to the fault position. That should be ok., We've always allowed the user copy to return early if it does word copies and hits a page crosser that causes a fault. Any user then has the choice of: - partial copies are bad - partial copies are handled and then you retry from the place copy_to_user() failed at and in that second case, the next time around, you'll get the fault immediately (or you'll make some more progress - maybe the user copy loop did something different just because the length and/or alignment was different). If you get the fault immediately, that's -EFAULT. And if you make some more progress, it's again up to the caller to rinse and repeat. End result: user copy functions do not have to report errors exactly. It is the caller that has to handle the situation. Most importantly: "exact error or not" doesn't actually _matter_ to the caller. If the caller does the right thing for an exact error, it will do the right thing for an inexact one too. See above. > The copy_to_sk(), after returning an error, starts again from the previous > sizeof(sh) boundary rather than from where the __copy_to_user_inatomic() > stopped. So it can get stuck attempting to copy the same search header. That seems to be purely a copy_to_sk() bug. Or rather, it looks like a bug in the caller. copy_to_sk() itself does if (copy_to_user_nofault(ubuf + *sk_offset, , sizeof(sh))) { ret = 0; goto out; } and the comment says * 0: all items from this leaf copied, continue with next but that comment is then obviously not actually true in that it's not "continue with next" at all. But this is all very much a bug in the btrfs search_ioctl()/copy_to_sk() code: it simply doesn't do the proper thing for a partial result. Because no, "just retry the whole thing" is by definition not the proper thing. That said, I think that if we can have faults at non-page-aligned boundaries, then we just need to make fault_in_pages_writeable() check non-page boundaries. > An ugly fix is to fall back to byte by byte copying so that we can > attempt the actual fault address in fault_in_pages_writeable(). No, changing the user copy machinery is wrong. The caller really has to do the right thing with partial results. And I think we need to make fault_in_pages_writeable() match the actual faulting cases - maybe remote the "pages" part of the name? That would fix the btrfs code - it's not doing the right thing as-is, but it's "close enough' to right that I think fixing fault_in_pages_writeable() should fix it. No? Linus
Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw
On Thu, Sep 9, 2021 at 4:31 AM Christoph Hellwig wrote: > > What about just passing done_before as an argument to > iomap_dio_complete? gfs2 would have to switch to __iomap_dio_rw + > iomap_dio_complete instead of iomap_dio_rw for that, and it obviously > won't work for async completions, but you force sync in this case > anyway, right? I think you misunderstand. Or maybe I do. It very much doesn't force sync in this case. It did the *first* part of it synchronously, but then it wants to continue with that async part for the rest, and very much do that async completion. And that's why it wants to add that "I already did X much of the work", exactly so that the async completion can report the full end result. But maybe now it's me who is misunderstanding. Linus
Re: [Cluster-devel] [PATCH v7 17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults
On Thu, Sep 9, 2021 at 4:36 AM Christoph Hellwig wrote: > > On Fri, Aug 27, 2021 at 06:49:24PM +0200, Andreas Gruenbacher wrote: > > Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return > > -EFAULT when it would otherwise trigger a page fault. This is roughly > > similar to FOLL_FAST_ONLY but available on all architectures, and less > > fragile. > > So, FOLL_FAST_ONLY only has one single user through > get_user_pages_fast_only (pin_user_pages_fast_only is entirely unused, > which makes totally sense given that give up on fault and pin are not > exactly useful semantics). So I think we should treat FOLL_FAST_ONLY as a special "internal to gup.c" flag, and perhaps not really compare it to the new FOLL_NOFAULT. In fact, maybe we could even just make FOLL_FAST_ONLY be the high bit, and not expose it in and make it entirely private as a name in gup.c. Because FOLL_FAST_ONLY really is meant more as a "this way we can share code easily inside gup.c, by having the internal helpers that *can* do everything, but not do it all when the user is one of the limited interfaces". Because we don't really expect people to use FOLL_FAST_ONLY externally - they'll use the explicit interfaces we have instead (ie "get_user_pages_fast()"). Those use-cases that want that fast-only thing really are so special that they need to be very explicitly so. FOLL_NOFAULT is different, in that that is something an external user _would_ use. Admittedly we'd only have one single case for now, but I think we may end up with other filesystems - or other cases entirely - having that same kind of "I am holding locks, so I can't fault into the MM, but I'm otherwise ok with the immediate mmap_sem lock usage and sleeping". End result: FOLL_FAST_ONLY and FOLL_NOFAULT have some similarities, but at the same time I think they are fundamentally different. The FAST_ONLY is the very very special "I can't sleep, I can't even take the fundamental MM lock, and we export special interfaces because it's _so_ special and can be used in interrupts etc". In contrast, NOFAULT is not _that_ special. It's just another flag, and has generic use. Linus
Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks
On Fri, Sep 3, 2021 at 11:28 AM Al Viro wrote: > > FWIW, my objections regarding the calling conventions are still there. So I'm happy to further change the calling conventions, but by now _that_ part is most definitely a "not this merge window". The need for that ternary state is still there. It might go away in the future, but I think that's literally that: a future cleanup. Not really related to the problem at hand. Linus
Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks
On Wed, Sep 1, 2021 at 12:53 PM Andreas Gruenbacher wrote: > > So there's a minor merge conflict between Christoph's iomap_iter > conversion and this patch queue now, and I should probably clarify the > description of "iomap: Add done_before argument to iomap_dio_rw" that > Darrick ran into. Then there are the user copy issues that Al has > pointed out. Fixing those will create superficial conflicts with this > patch queue, but probably nothing serious. > > So how should I proceed: do you expect a v8 of this patch queue on top > of the current mainline? So if you rebase for fixes, it's going to be a "next merge window" thing again. Personally, I'm ok with the series as is, and the conflict isn't an issue. So I'd take it as is, and then people can fix up niggling issues later. But if somebody screams loudly.. Linus
Re: [Cluster-devel] [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
On Fri, Aug 27, 2021 at 1:56 PM Kari Argillander wrote: > > At least this patch will break ntfs3 which is in next. It has been there > just couple weeks so I understand. I added Konstantin and ntfs3 list so > that we know what is going on. Can you please info if and when do we > need rebase. No need to rebase. It just makes it harder for me to pick one pull over another, since it would mix the two things together. I'll notice the semantic conflict as I do my merge build test, and it's easy for me to fix as part of the merge - whichever one I merge later. It's good if both sides remind me about the issue, but these kinds of conflicts are not a problem. And yes, it does happen that I miss conflicts like this if I merge while on the road and don't do my full build tests, or if it's some architecture-specific thing or a problem that doesn't happen on my usual allmodconfig testing. But neither of those cases should be present in this situation. Linus
Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw
On Fri, Aug 27, 2021 at 2:32 PM Darrick J. Wong wrote: > > No, because you totally ignored the second question: > > If the directio operation succeeds even partially and the PARTIAL flag > is set, won't that push the iov iter ahead by however many bytes > completed? > > We already finished the IO for the first page, so the second attempt > should pick up where it left off, i.e. the second page. Darrick, I think you're missing the point. It's the *return*value* that is the issue, not the iovec. The iovec is updated as you say. But the return value from the async part is - without Andreas' patch - only the async part of it. With Andreas' patch, the async part will now return the full return value, including the part that was done synchronously. And the return value is returned from that async part, which somehow thus needs to know what predated it. Could that pre-existing part perhaps be saved somewhere else? Very possibly. That 'struct iomap_dio' addition is kind of ugly. So maybe what Andreas did could be done differently. But I think you guys are arguing past each other. Linus
Re: [Cluster-devel] [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable
On Fri, Aug 27, 2021 at 12:23 PM Al Viro wrote: > > Could you show the cases where "partial copy, so it's OK" behaviour would > break anything? Absolutely. For example, i t would cause an infinite loop in restore_fpregs_from_user() if the "buf" argument is a situation where the first page is fine, but the next page is not. Why? Because __restore_fpregs_from_user() would take a fault, but then fault_in_pages_readable() (renamed) would succeed, so you'd just do that "retry" forever and ever. Probably there are a number of other places too. That was literally the *first* place I looked at. Seriously. The current semantics are "check the whole area". THOSE MUST NOT CHANGE. Linus
Re: [Cluster-devel] [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable
On Fri, Aug 27, 2021 at 11:52 AM Al Viro wrote: > > Again, the calling conventions are wrong. Make it success/failure or > 0/-EFAULT. And it's inconsistent for iovec and non-iovec cases as it is. Al, the 0/-EFAULT thing DOES NOT WORK. The whole "success vs failure" model is broken. Because "success" for some people is "everything worked". And for other people it is "at least _part_ of it worked". So no, 0/-EFAULT fundamentally cannot work, because the return needs to be able to handle that ternary situation (ie "nothing" vs "something" vs "everything"). This is *literally* the exact same thing that we have for copy_to/from_user(). And Andreas' solution (based on my suggestion) is the exact same one that we have had for that code since basically day #1. The whole "0/-EFAULT" is simpler, yes. And it's what "{get|put}_user()" uses, yes. And it's more common to a lot of other functions that return zero or an error. But see above. People *need* that ternary result, and "bytes/pages uncopied" is not only the traditional one we use elsewhere in similar situations, it's the one that has the easiest error tests for existing users (because zero remains "everything worked"). Andreas originally had that "how many bytes/pages succeeded" return value instead, and yes, that's also ternary. But it means that now the common "complete success" test ends up being a lot uglier, and the semantics of the function changes completely where "0" no longer means success, and that messes up much more. So I really think you are barking entirely up the wrong tree. If there is any inconsistency, maybe we should make _more_ cases use that "how many bytes/pages not copied" logic, but in a lot of cases you don't actually need the ternary decision value. So the inconsistency is EXACTLY the same as the one we have always had for get|put_user() vs copy_to|from_user(), and it exists for the EXACT same reason. IOW, please explain how you'd solve the ternary problem without making the code a lot uglier. Linus
Re: [Cluster-devel] [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
On Fri, Aug 27, 2021 at 11:53 AM Al Viro wrote: > > I really disagree with these calling conventions. "Number not faulted in" > is bloody useless It's what we already have for copy_to/from_user(), so it's actually consistent with that. And it avoids changing all the existing tests where people really cared only about the "everything ok" case. Andreas' first patch did that changed version, and was ugly as hell. But if you have a version that avoids the ugliness... Linus
Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks
On Fri, Aug 27, 2021 at 9:49 AM Andreas Gruenbacher wrote: > > here's another update on top of v5.14-rc7. Changes: > > * Some of the patch descriptions have been improved. > > * Patch "gfs2: Eliminate ip->i_gh" has been moved further to the front. > > At this point, I'm not aware of anything that still needs fixing, >From a quick scan, I didn't see anything that raised my hackles. But I skipped all the gfs2-specific changes in the series, since that's all above my paygrade. Linus
Re: [Cluster-devel] [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher wrote: > > Hmm, what if GUP is made to skip VM_IO vmas without adding anything to > the pages array? That would match fault_in_iov_iter_writeable, which > is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP > vmas. I don't understand what you mean.. GUP already skips VM_IO (and VM_PFNMAP) pages. It just returns EFAULT. We could make it return another error. We already have DAX and FOLL_LONGTERM returning -EOPNOTSUPP. Of course, I think some code ends up always just returning "number of pages looked up" and might return 0 for "no pages" rather than the error for the first page. So we may end up having interfaces that then lose that explanation error code, but I didn't check. But we couldn't make it just say "skip them and try later addresses", if that is what you meant. THAT makes no sense - that would just make GUP look up some other address than what was asked for. > > I also do still think that even regardless of that, we want to just > > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(), > > and then you can use the regular get_user_pages(). > > > > That at least gives us the full _normal_ page handling stuff. > > And it does fix the generic/208 failure. Good. So I think the approach is usable, even if we might have corner cases left. So I think the remaining issue is exactly things like VM_IO and VM_PFNMAP. Do the fstests have test-cases for things like this? It _is_ quite specialized, it might be a good idea to have that. Of course, doing direct-IO from special memory regions with zerocopy might be something special people actually want to do. But I think we've had that VM_IO flag testing there basically forever, so I don't think it has ever worked (for some definition of "ever"). Linus
Re: [Cluster-devel] [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
[ Sorry for the delay, I was on the road and this fell through the cracks ] On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher wrote: > > On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds > wrote: > > > > Hmm. Have you tried to figure out why that "still returns 0" happens? > > The call stack is: > > gup_pte_range > gup_pmd_range > gup_pud_range > gup_p4d_range > gup_pgd_range > lockless_pages_from_mm > internal_get_user_pages_fast > get_user_pages_fast > iov_iter_get_pages > __bio_iov_iter_get_pages > bio_iov_iter_get_pages > iomap_dio_bio_actor > iomap_dio_actor > iomap_apply > iomap_dio_rw > gfs2_file_direct_write > > In gup_pte_range, pte_special(pte) is true and so we return 0. Ok, so that is indeed something that the fast-case can't handle, because some of the special code wants to have the mm_lock so that it can look at the vma flags (eg "vm_normal_page()" and friends. That said, some of these cases even the full GUP won't ever handle, simply because a mapping doesn't necessarily even _have_ a 'struct page' associated with it if it's a VM_IO mapping. So it turns out that you can't just always do fault_in_iov_iter_readable() and then assume that you can do iov_iter_get_pages() and repeat until successful. We could certainly make get_user_pages_fast() handle a few more cases, but I get the feeling that we need to have separate error cases for EFAULT - no page exists - and the "page exists, but cannot be mapped as a 'struct page'" case. I also do still think that even regardless of that, we want to just add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(), and then you can use the regular get_user_pages(). That at least gives us the full _normal_ page handling stuff. Linus
Re: [Cluster-devel] [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher wrote: > > > With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c) > endlessly spins in gfs2_file_direct_write. It looks as if there's a bug > in get_user_pages_fast when called with FOLL_FAST_ONLY: > > (1) The test case performs an aio write into a 32 MB buffer. > > (2) The buffer is initially not in memory, so when iomap_dio_rw() -> > ... -> bio_iov_iter_get_pages() is called with the iter->noio flag > set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set. > get_user_pages_fast() returns 0, which causes > bio_iov_iter_get_pages to return -EFAULT. > > (3) Then gfs2_file_direct_write faults in the entire buffer with > fault_in_iov_iter_readable(), which succeeds. > > (4) With the buffer in memory, we retry the iomap_dio_rw() -> > ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast(). > This should succeed now, but get_user_pages_fast() still returns 0. Hmm. Have you tried to figure out why that "still returns 0" happens? One option - for debugging only - would be to introduce a new flag to get_user_pages_fast() tyhat says "print out reason if failed" and make the retry (but not the original one) have that flag set. There are a couple of things of note when it comes to "get_user_pages_fast()": (a) some architectures don't even enable it (b) it can be very picky about the page table contents, and wants the accessed bit to already be set (or the dirty bit, in the case of a write). but (a) shouldn't be an issue on any common platform and (b) shouldn't be an issue with fault_in_iov_iter_readable() that actually does a __get_user() so it will access through the page tables. (It might be more of an issue with fault_in_iov_iter_writable() due to walking the page tables by hand - if we don't do the proper access/dirty setting, I could see get_user_pages_fast() failing). Anyway, for reason (a) I do think that eventually we should probably introduce FOLL_NOFAULT, and allow the full "slow" page table walk - just not calling down to handle_mm_fault() if it fails. But (a) should be a non-issue in your test environment, and so it would be interesting to hear what it is that fails. Because scanning through the patches, they all _look_ fine to me (apart from the one comment about return values, which is more about being consistent with copy_to/from_user() and making the code simpler - not about correctness) Linus
Re: [Cluster-devel] [PATCH v5 03/12] Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable}
On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher wrote: > > Turn fault_in_pages_{readable,writeable} into versions that return the number > of bytes faulted in instead of returning a non-zero value when any of the > requested pages couldn't be faulted in. Ugh. This ends up making most users have horribly nasty conditionals. I think I suggested (or maybe that was just my internal voice and I never actually vocalized it) using the same semantics as "copy_to/from_user()" for fault_in_pages_{readable|writable}(). Namely to return the number of bytes *not* faulted in. That makes it trivial to test "did I get everything" - becasue zero means "nothing failed" and remains the "complete success" marker. And it still allows for the (much less common) case of "ok, I need to know about partial failures". So instead of this horror: - if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) + if (fault_in_writeable(buf_fx, fpu_user_xstate_size) == + fpu_user_xstate_size) you'd just have - if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) + if (!fault_in_writeable(buf_fx, fpu_user_xstate_size)) because zero would continue to be a marker of success. Linus
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Tue, Jul 27, 2021 at 4:14 AM Andreas Gruenbacher wrote: > > On Tue, Jul 27, 2021 at 11:30 AM David Laight wrote: > > > > Is it actually worth doing any more than ensuring the first byte > > of the buffer is paged in before entering the block that has > > to disable page faults? > > We definitely do want to process as many pages as we can, especially > if allocations are involved during a write. Yeah, from an efficiency standpoint, once you start walking page tables, it's probably best to just handle as much as you can. But once you get an error, I don't think it should be "everything is bad". This is a bit annoying, because while *most* users really just want that "everything is good", *some* users might just want to handle the partial success case. It's why "copy_to/from_user()" returns the number of bytes *not* written, rather than -EFAULT like get/put_user(). 99% of all users just want to know "did I write all bytes" (and then checking for a zero return is a simple and cheap verification of "everything was ok"). But then very occasionally, you hit a case where you actually want to know how much of a copy worked. It's rare, but it happens, and the read/write system calls tend to be the main user of it. And yes, the fact that "copy_to/from_user()" doesn't return an error (like get/put_user() does) has confused people many times over the years. It's annoying, but it's required by those (few) users that really do want to handle that partial case. I think this iov_iter_fault_in_readable/writeable() case should do the same. And no, it's not new to Andreas' patch. iov_iter_fault_in_readable() is doing the "everything has to be good" thing already. Which maybe implies that nobody cares about partial reads/writes. Or it's very very rare - I've seen code that handles page faults in user space, but it's admittedly been some very special CPU simulator/emulator checkpointing stuff. Linus
Re: [Cluster-devel] [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
On Mon, Jul 26, 2021 at 9:33 AM Jan Kara wrote: > > On Fri 23-07-21 22:58:34, Andreas Gruenbacher wrote: > > + gup_flags = FOLL_TOUCH | FOLL_POPULATE; > > I don't think FOLL_POPULATE makes sense here. It makes sense only with > FOLL_MLOCK and determines whether mlock(2) should fault in missing pages or > not. Yeah, it won't hurt, but FOLL_POPULATE doesn't actually do anything unless FOLL_MLOCK is set. It is, as you say, a magic flag just for mlock. The only ones that should matter are FOLL_WRITE (for obvious reasons) and FOLL_TOUCH (to set the accessed and dirty bits, rather than just th protection bits) Linus
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 1:26 PM Al Viro wrote: > > On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote: > > > So I think the code needs to return 0 if _any_ fault was successful. > > s/any/the first/... Yes, but as long as we do them in order, and stop when it fails, "any" and "first" end up being the same thing ;) > The same goes for fault-in for read Yeah. That said, a partial write() (ie "read from user space") might be something that a filesystem is not willing to touch for other reasons, so I think returning -EFAULT in that case is, I think, slightly more reasonable. Things like "I have to prepare buffers to be filled" etc. The partial read() case is at least something that a filesystem should be able to handle fairly easily. And I don't think returning -EFAULT early is necessarily wrong - we obviously do it anyway if you give really invalid addresses. But we have generally strived to allow partial IO for missing pages, because people sometimes play games with unmapping things dynamically or using mprotect() to catch modifications (ie doing things like catch SIGSEGV/SIGBUS, and remap it). But those kinds of uses tend to have to catch -EFAULT anyway, so I guess it's not a big deal either way. Linus
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher wrote: > > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes) > +{ ... > + if (fault_in_user_pages(start, len, true) != len) > + return -EFAULT; Looking at this once more, I think this is likely wrong. Why? Because any user can/should only care about at least *part* of the area being writable. Imagine that you're doing a large read. If the *first* page is writable, you should still return the partial read, not -EFAULT. So I think the code needs to return 0 if _any_ fault was successful. Or perhaps return how much it was able to fault in. Because returning -EFAULT if any of it failed seems wrong, and doesn't allow for partial success being reported. The other reaction I have is that you now only do the iov_iter_fault_in_writeable, but then you make fault_in_user_pages() still have that "bool write" argument. We already have 'fault_in_pages_readable()', and that one is more efficient (well, at least if the fault isn't needed it is). So it would make more sense to just implement fault_in_pages_writable() instead of that "fault_in_user_pages(, bool write)". Linus
Re: [Cluster-devel] [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
On Fri, Jul 23, 2021 at 1:58 PM Andreas Gruenbacher wrote: > > Introduce a new fault_in_iov_iter helper for manually faulting in an iterator. > Other than fault_in_pages_writeable(), this function is non-destructive. Again, as I pointed out in the previous version, "Other than" is not sensible language. You mean "Unlike". Same issue in the comment: > + * Other than fault_in_pages_writeable(), this function is non-destructive > even > + * when faulting in pages for writing. It really should be "Unlike fault_in_pages_writeable(), this function .." to parse correctly. I understand what you mean, but only because I know what fault_in_pages_writeable() does and what the issue was. And in a year or two, I might have forgotten, and wonder what you meant. Linus
Re: [Cluster-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks
On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher wrote: > > here's an update to the gfs2 mmap + page faults fix that implements > Linus's suggestion of disabling page faults while we're holding the > inode glock. Apart from some wording/naming issues, I think this looks a _lot_ better, and should fix the fundamental and underlying deadlock properly. So Ack from me with the trivial suggestions I sent to the individual patches. Linus
Re: [Cluster-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag
On Sun, Jul 18, 2021 at 3:40 PM Andreas Gruenbacher wrote: > > Introduce a new ITER_FLAG_FAST_ONLY flag I think the code is fine, but I think it might be best to call this "ITER_FLAG_NOIO" or something like that. The "FAST_ONLY" name makes sense in the context of "get_user_pages_fast()" where we have that "fast" naming (and the long history too). But I don't think it makes much sense as a name in the context of iov_iter. Linus
Re: [Cluster-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper
On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher wrote: > > Introduce a new fault_in_iov_iter helper for manually faulting in an iterator. > Other than fault_in_pages_writeable(), this function is non-destructive. You mean "Unlike" rather than "Other than" (also in the comment of the patch). This is fairly inefficient, but as long as it's the exceptional case, that's fine. It might be worth making that very explicit, so that people don't try to use it normally. Linus
Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks
On Thu, Jul 1, 2021 at 5:20 PM Andreas Gruenbacher wrote: > > On Thu, Jul 1, 2021 at 11:41 PM Linus Torvalds > wrote: > > Also, I have to say that I think the direct-IO code is fundamentally > > mis-designed. Why it is doing the page lookup _during_ the IO is a > > complete mystery to me. Why wasn't that done ahead of time before the > > filesystem took the locks it needed? > > That would be inconvenient for reads, when the number of bytes read is > much smaller than the buffer size and we won't need to page in the > entire buffer. What? A file read will READ THE WHOLE BUFFER. We're not talking pipes or ttys here. If you ask for X bytes, you'll get X bytes. Of course, if you ask for more data than the file has, that's another thing, but who really does that with direct-IO? And if they do, why should we care about their silly behavior? Face it, right now direct-IO is *BUGGY* because of this, and you can deadlock filesystems with it. So tell me again how it's "inconvenient" to fix this bug, and fix the bad direct-IO design? Linus
Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks
On Thu, Jul 1, 2021 at 5:30 PM Linus Torvalds wrote: > > > Of course, if you ask for more data than the file has, that's another > thing, but who really does that with direct-IO? And if they do, why > should we care about their silly behavior? Now, if the issue is that people do IO for bigger areas than you have memory for, then I think that's a chunking issue. I don't think the ITER_IOVEC necessarily needs to be converted to an ITER_BVEC in one single go. That would indeed be painful if somebody tries to do some huge direct-IO when they just don't have the memory for it. But the fact is, direct-IO has been an incredible pain-point for decades, because it's (a) unusual, (b) buggy and (c) has very little overall design and common code. The three issues are likely intimately tied together. The iomap code at least has tried to make for much more common code, but I really think some direct-IO people should seriously reconsider how they are doing things when there are fundamental deadlocks in the design. And I do think that a ITER_IOVEC -> ITER_BVEC conversion function might be a really really good idea to solve this problem. There's even a very natural chunking algorithm: try to do as much as possible with get_user_pages_fast() - so that if you already *have* the memory, you can do the full IO (or at least a big part of it). And if get_user_pages_fast() only gives you a small area - or nothing at all - you chunk it down aggressively, and realize that "oh, doing direct-IO when user space is paged out might not be the most optimal case". Linus
Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks
On Thu, Jul 1, 2021 at 2:41 PM Linus Torvalds wrote: > > So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a > ITER_KVEC by doing the page lookup ahead of time Actually, an ITER_BVEC, not ITER_KVEC. It wants a page array, not a kernel pointer array. But I hope people understood what I meant.. Linus
Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks
On Thu, Jul 1, 2021 at 1:43 PM Andreas Gruenbacher wrote: > > here's another attempt at fixing the mmap + page fault deadlocks we're > seeing on gfs2. Still not ideal because get_user_pages_fast ignores the > current->pagefault_disabled flag Of course get_user_pages_fast() ignores the pagefault_disabled flag, because it doesn't do any page faults. If you don't want to fall back to the "maybe do IO" case, you should use the FOLL_FAST_ONLY flag - or get_user_pages_fast_only(), which does that itself. > For getting get_user_pages_fast changed to fix this properly, I'd need > help from the memory management folks. I really don't think you need anything at all from the mm people, because we already support that whole "fast only" case. Also, I have to say that I think the direct-IO code is fundamentally mis-designed. Why it is doing the page lookup _during_ the IO is a complete mystery to me. Why wasn't that done ahead of time before the filesystem took the locks it needed? So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a ITER_KVEC by doing the page lookup ahead of time, and none of these issues should even exist, and then the whole pagefault_disabled and/or FOLL_FAST_ONLY would be a complete non-issue. Is there any reason why that isn't what it does (other than historical baggage)? Linus
Re: [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
On Sat, Jun 12, 2021 at 2:47 PM Al Viro wrote: > > O_DIRECT case is a PITA - there we use GUP and there's no way > to tell GUP that in the current situation we do *NOT* want to hit > ->fault()/->page_mkwrite()/etc. pagefault_disable() won't be even > noticed there... Well, we could change that. And we do have get_user_pages_fast_only() these days. Linus
Re: [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
On Sat, Jun 12, 2021 at 2:12 PM Al Viro wrote: > > Actually, is there any good way to make sure that write fault is triggered > _without_ modification of the data? On x86 lock xadd (of 0, that is) would > probably do it and some of the other architectures could probably get away > with using cmpxchg and its relatives, but how reliable it is wrt always > triggering a write fault if the page is currently read-only? I wouldn't worry about the CPU optimizing a zero 'add' away (extra work for no gain in any normal situation). But any architecture using 'ldl/stc' to do atomics would do it in software for at least cmpxchg (ie just abort after the "doesn't match"). Even on x86, it's certainly _possible_ that a non-matching cmpxchg might not have done the writability check, although I would find that unlikely (ie I would expect it to do just one TLB lookup, and just one permission check, whether it then writes or not). And if the architecture does atomic operations using that ldl/stc model, I could (again) see the software loop breaking out early (before the stc) on the grounds of "value didn't change". Although it's a lot less likely outside of cmpxchg. I suspect an "add zero" would work just fine even on a ldl/stc model. That said, reads are obviously much easier, and I'd probably prefer the model for writes to be to not necessarily pre-fault anything at all, but just write to user space with page faults disabled. And then only if that fails do you do anything special. And at that point, even walking the page tables by hand might be perfectly acceptable - since we know it's going to fault anyway, and it might actually be cheaper to just do it by hand with GUP or whatever. Linus
Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher wrote: > > Fix that by recognizing the self-recursion case. Hmm. I get the feeling that the self-recursion case should never have been allowed to happen in the first place. IOW, is there some reason why you can't make the user accesses always be doen with page faults disabled (ie using the "atomic" user space access model), and then if you get a partial read (or write) to user space, at that point you drop the locks in read/write, do the "try to make readable/writable" and try again. IOW, none of this "detect recursion" thing. Just "no recursion in the first place". That way you'd not have these odd rules at fault time at all, because a fault while holding a lock would never get to the filesystem at all, it would be aborted early. And you'd not have any odd "inner/outer" locks, or lock compatibility rules or anything like that. You'd literally have just "oh, I didn't get everything at RW time while I held locks, so let's drop the locks, try to access user space, and retry". Wouldn't that be a lot simpler and more robust? Because what if the mmap is something a bit more complex, like overlayfs or usefaultfd, and completing the fault isn't about gfs2 handling it as a "fault", but about some *other* entity calling back to gfs2 and doing a read/write instead? Now all your "inner/outer" lock logic ends up being entirely pointless, as far as I can tell, and you end up deadlocking on the lock you are holding over the user space access _anyway_. So I literally think that your approach is (a) too complicated (b) doesn't actually fix the issue in the more general case But maybe I'm missing something. Linus Linus
Re: [Cluster-devel] [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2)
On Mon, May 31, 2021 at 7:02 AM Andreas Gruenbacher wrote: > > @@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb > *iocb, struct iov_iter *to, > [...] > ret = iomap_dio_rw(iocb, to, _iomap_ops, NULL, 0); > gfs2_glock_dq(gh); > + if (unlikely(current_needs_retry())) { > + set_current_needs_retry(false); > + if (ret == -EFAULT && > + !iov_iter_fault_in_writeable(to, PAGE_SIZE)) > + goto retry; > + } Hmm. I haven't walked through this all, but is that "ret == -EFAULT" test the right thing to do? Can iomap_dio_rw() not instead just return a partial success if it hit a missing page half-way? Shouldn't you retry for that case too? Linus
Re: [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write
Sorry, I'm on a boat right now, with only a cellphone. Which is why this html mess email, and quick reply. Due to the html, this may get a bounce from the mailing list, and only make it to the personal email recipients. Feel free to quote more just in case others didn't get my original email through the lists. I'll be out most of the day, but I'll try to take a deeper look this evening. I'm the meantime, a couple of questions and comments.. On Mon, May 31, 2021, 07:01 Andreas Gruenbacher wrote: > > here's a set of fixes for how gfs2 handles page faults during read and > write syscalls. So how much of this is due to the confusion you just introduced where you pointlessly and incorrectly take an exclusive luck for write faults? See my reply to that pull request for why it's wrong and pointless. The patch queue is ready for merging except for two > open issues. > There is no way this series is acceptable for 5.13. This kind of change is very much a merge window thing. Much much too late to make fundamental locking changes. Maybe it can then be backported to stable (including at that point 5.13 of course) if it's been shown to be ok. This deadlock is not new, we've very much had the same kind of thing when writing to a file in the generic filemap_write() function, where we take the page lock and then copy from user space. If that copy faults, and needs the same page for the source due to an odd mmap issue (usually malicious), you get a deadlock on the page lock it you aren't careful. I'm surprised that gfs2 hasn't seen this, I thought we had fstests for it. And I'd have expected that case to also trigger any internal gfs2 issues, although it's possible that the generic code just does such a good job at avoiding the issue that we'd need another test for your case. Linus >
Re: [Cluster-devel] [GIT PULL] gfs2 fixes for v5.13-rc5
[ Adding fsdevel, because this is not limited to gfs2 ] On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher wrote: > > Andreas Gruenbacher (2): > gfs2: Fix mmap locking for write faults This is bogus. I've pulled it, but this is just wrong. A write fault on a mmap IS NOT A WRITE to the filesystem. It's a read. Yes, it will later then allow writes to the page, but that's entirely immaterial - because the write is going to happen not at fault time, but long *after* the fault, and long *after* the filesystem has installed the page. The actual write will happen when the kernel returns from the user space. And no, the explanation in that commit makes no sense either: "When a write fault occurs, we need to take the inode glock of the underlying inode in exclusive mode. Otherwise, there's no guarantee that the dirty page will be written back to disk" the thing is, FAULT_FLAG_WRITE only says that the *currently* pending access that triggered the fault was a write. That's entirely irrelevant to a filesystem, because (a) it might be a private mapping, and a write to a page will cause a COW by the VM layer, and it's not actually a filesystem write at all AND (b) if it's a shared mapping, the first access that paged things in is likely a READ, and the page will be marked writable (because it's a shared mapping!) and subsequent writes will not cause any faults at all. In other words, a filesystem that checks for FAULT_FLAG_WRITE is _doubly_ wrong. It's absolutely never the right thing to do. It *cannot* be the right thing to do. And yes, some other filesystems do this crazy thing too. If your friends jumped off a bridge, would you jump too? The xfs and ext3/ext4 cases are wrong too - but at least they spent five seconds (but no more) thinking about it, and they added the check for VM_SHARED. So they are only wrong for reason (b) But wrong is wrong. The new code is not right in gfs2, and the old code in xfs/ext4 is not right either. Yeah, yeah, you can - and people do - do things like "always mark the page readable on initial page fault, use mkwrite to catch when it becomes writable, and do timestamps carefully, at at least have full knowledge of "something might become dirty" But even then it is ENTIRELY BOGUS to do things like write locking. Really. Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think it protects anything at all, you're wrong. So don't do write locking. At an absolute most, you can do things like - update file times (and honestly, that's quite questionable - because again - THE WRITE HASN'T HAPPENED YET - so any tests that depend on exact file access times to figure out when the last write was done is not being helped by your extra code, because you're setting the WRONG time. - set some "this inode will have dirty pages" flag just for some possible future use. But honestly, if this is about consistency etc, you need to do it not for a fault, but across the whole mmap/munmap. So some things may be less bogus - but still very very questionable. But locking? Bogus. Reads and writes aren't really any different from a timestamp standpoint (if you think you need to mtime for write accesses, you should do atime for reads, so from a filesystem timestamp standpoint read and write faults are exactly the same - and both are bogus, because by definition for a mmap both the reads and the writes can then happen long long long afterwards, and repeatedly). And if that "set some flag" thing needs a write lock, but a read doesn't, you're doing something wrong and odd. Look at the VM code. The VM code does this right. The mmap_sem is taken for writing for mmap and for munmap. But a page fault is always a read lock, even if the access that caused the page fault is a write. The actual real honest-to-goodness *write* happens much later, and the only time the filesystem really knows when it is done is at writeback time. Not at fault time. So if you take locks, logically you should take them when the fault happens, and release them when the writeback is done. Are you doing that? No. So don't do the write lock over the read portion of the page fault. It is not a sensible operation. Linus
Re: [Cluster-devel] GFS2 changes for the 5.7 merge window
On Tue, Mar 31, 2020 at 9:41 AM Bob Peterson wrote: > > Could you please pull the following changes for gfs2? So I've pulled it, but I note that you didn't get the automated notification of that like almost everybody else does. The reason seems to be pretty simple: the pr-tracker-bot looks for emails to lkml (and other mailing lists) with variations of the subject line "[GIT PULL]" in it. And while lkml was cc'd for your pull request, you don't use that subject line prefix, and so pr-tracker-bot ignores your email. I don't personally care - you do have the markers that _I_ look for, with both "git" and "pull" in the message body, so your pull request doesn't get lost in the chaos that is my inbox during the merge window. So this is purely about that notification bot. IOW, if you care, then you should change your scripting (or manual habits, I don't know) to have that "[GIT PULL]" in the subject line, and then you will get that nice timely notification when I've pulled (and pushed out). And if you don't want that notification, you can just continue doing what you're doing. Linus
Re: [Cluster-devel] [GIT PULL] iomap: small cleanups for 5.5
On Tue, Dec 3, 2019 at 8:09 AM Darrick J. Wong wrote: > Please pull this series containing some more new iomap code for 5.5. > There's not much this time -- just removing some local variables that > don't need to exist in the iomap directio code. Hmm. The tag message (which was also in the email thanks to git request-pull) is very misleading. Pulled, but please check these things. Linus
Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
On Wed, Nov 27, 2019 at 7:42 AM Steven Whitehouse wrote: > > I'll leave the finer details to Andreas here, since it is his patch, and > hopefully we can figure out a good path forward. As mentioned, I don't _hate_ that patch (ok, I seem to have typoed it and said that I don't "gate" it ;), so if that's what you guys really want to do, I'm ok with it. But.. I do think you already get the data with the current case, from the "short read" thing. So just changing the current generic read function to check against the size first: --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2021,9 +2021,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, unsigned int prev_offset; int error = 0; - if (unlikely(*ppos >= inode->i_sb->s_maxbytes)) + if (unlikely(*ppos >= inode->i_size)) return 0; - iov_iter_truncate(iter, inode->i_sb->s_maxbytes); + iov_iter_truncate(iter, inode->i_size); index = *ppos >> PAGE_SHIFT; prev_index = ra->prev_pos >> PAGE_SHIFT; and you're done. Nice and clean. Then in gfs2 you just notice the short read, and check at that point. Sure, you'll also cut read-ahead to the old size boundary, but does anybody _seriously_ believe that read-ahead matters when you hit the "some other node write more data, we're reading past the old end" case? I don't think that's the case. But I _can_ live with the patch that adds the extra "cached only" bit. It just honestly feels pointless. Linus
Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse wrote: > > Linus, is that roughly what you were thinking of? So the concept looks ok, but I don't really like the new flags as they seem to be gfs2-specific rather than generic. That said, I don't _gate_ them either, since they aren't in any critical code sequence, and it's not like they do anything really odd. I still think the whole gfs2 approach is broken. You're magically ok with using stale data from the cache just because it's cached, even if another client might have truncated the file or something. So you're ok with saying "the file used to be X bytes in size, so we'll just give you this data because we trust that the X is correct". But you're not ok to say "oh, the file used to be X bytes in size, but we don't want to give you a short read because it might not be correct any more". See the disconnect? You trust the size in one situation, but not in another one. I also don't really see that you *need* the new flag at all. Since you're doing to do a speculative read and then a real read anyway, and since the only thing that you seem to care about is the file size (because the *data* you will trust if it is cached), then why don't you just use the *existing* generic read, and *IFF* you get a truncated return value, then you go and double-check that the file hasn't changed in size? See what I'm saying? I think gfs2 is being very inconsistent in when it trusts the file size, and I don't see that you even need the new behavior that patch gives, because you might as well just use the existing code (just move the i_size check earlier, and then teach gfs2 to double-check the "I didn't get as much as I expected" case). Linus
Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
On Wed, Oct 30, 2019 at 11:35 AM Steven Whitehouse wrote: > > NFS may be ok here, but it will break GFS2. There may be others too... > OCFS2 is likely one. Not sure about CIFS either. Does it really matter > that we might occasionally allocate a page and then free it again? Why are gfs2 and cifs doing things wrong? "readpage()" is not for synchrionizing metadata. Never has been. You shouldn't treat it that way, and you shouldn't then make excuses for filesystems that treat it that way. Look at mmap, for example. It will do the SIGBUS handling before calling readpage(). Same goes for the copyfile code. A filesystem that thinks "I will update size at readpage" is already fundamentally buggy. We do _recheck_ the inode size under the page lock, but that's to handle the races with truncate etc. Linus
Re: [Cluster-devel] [GIT PULL] gfs2 changes
On Sat, Sep 21, 2019 at 9:51 AM Andreas Gruenbacher wrote: > > please consider pulling the following changes for gfs2. Merged. But I notice that you're not getting any pr-tracker replies. I'd suggest you cc lkml to get the automatic notification in case you care, Linus
Re: [GIT PULL] iomap: new code for 5.4
On Thu, Sep 19, 2019 at 10:07 AM Christoph Hellwig wrote: > > I personally surived with it, but really hated writing the open coded > list_for_each_entry + list_del or while list_first_entry_or_null + > ┐ist_del when I could have a nice primitive for it. I finally decided > to just add that helper.. You realize that the list helper is even mis-named? Generally a "pop()" should pop the last entry, not the first one. Or course, which one is last and which one is first is entirely up to you since you can add at the tail or the beginning. So even the name is ambiguous. So I really think the whole thing was badly implemented (and yes, part of that was the whole implementation thing). Doing list_del() by hand also means that you can actually decide if you want list_del_init() or not. So it's But again - keep that helper if you want it. Just don't put a badly implemented and badly named "helper" it in a global header file that _everybody_ has to look at. Linus
Re: [GIT PULL] iomap: new code for 5.4
On Thu, Sep 19, 2019 at 10:03 AM Linus Torvalds wrote: > > So inside of xfs, that "pop ioend from the list" model really may make > perfect sense, and you could just do > > static inline struct xfs_ioend *xfs_pop_ioend(struct list_head *head) Note that as usual, my code in emails is written in the MUA, entirely untested, and may be completely broken. So take it just as a "maybe something like this works for you". Linus
Re: [Cluster-devel] [GIT PULL] iomap: new code for 5.4
On Wed, Sep 18, 2019 at 8:45 PM Darrick J. Wong wrote: > > I propose the following (assuming Linus isn't cranky enough to refuse > the entire iomap patchpile forever): Since I don't think the code was _wrong_, it was just that I didn't want to introduce a new pattern for something we already have, I'd be ok with just a fix patch on top too. And if the xfs people really want to use the "pop" model, that's fine too - just make it specific to just the iomap_ioend type, which is all you seem to really want to pop, and make the typing (and naming) be about that particular thing. As mentioned, I don't think that whole "while (pop)" model is _wrong_ per se. But I also don't like proliferating new random list users in general, just because some subsystem has some internal preference. See? And I notice that one of the users actually does keep the list around and modifies it, ie this one: while ((ioend = list_pop_entry(, struct xfs_ioend, io_list))) { xfs_ioend_try_merge(ioend, ); xfs_end_ioend(ioend); } actually seems to _work_ on the remaining part of the list in that xfs_ioend_try_merge() function. So inside of xfs, that "pop ioend from the list" model really may make perfect sense, and you could just do static inline struct xfs_ioend *xfs_pop_ioend(struct list_head *head) { struct xfs_ioend *ioend = list_first_entry_or_null(head, struct xfs_ioend *, io_list); if (ioend) list_del(>io_list); return ioend; } and I don't think it's wrong. It's just that we've survived three decades without that list_pop(), and I don't want to make our header files even bigger and more complex unless we actually have multiple real users. Linus
Re: [Cluster-devel] [GIT PULL] iomap: new code for 5.4
On Wed, Sep 18, 2019 at 6:31 PM Linus Torvalds wrote: > > Why would anybody use that odd "list_pop()" thing in a loop, when what > it really seems to just want is that bog-standard > "list_for_each_entry_safe()" Side note: I do agree that the list_for_each_entry_safe() thing isn't exactly beautiful, particularly since you need that extra variable for the temporary "next" pointer. It's one of the C++ features I'd really like to use in the kernel - the whole "declare new variable in a for (;;) statement" thing. In fact, it made it into C - it's there in C99 - but we still use "-std=gnu89" because of other problems with the c99 updates. Anyway, I *would* be interested in cleaning up list_for_each_entry_safe() if somebody has the energy and figures out what we could do to get the c99 behavior without the breakage from other sources. For some background: the reason we use "gnu89" is because we use the GNU extension with type cast initializers quite a bit, ie things like #define __RAW_SPIN_LOCK_UNLOCKED(lockname) \ (raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname) and that broke in c99 and gnu99, which considers those compound literals and you can no longer use them as initializers. See https://lore.kernel.org/lkml/20141019231031.gb9...@node.dhcp.inet.fi/ for some of the historical discussion about this. It really _is_ sad, because variable declarations inside for-loops are very useful, and would have the potential to make some of our "for_each_xyz()" macros a lot prettier (and easier to use too). So our list_for_each_entry_safe() thing isn't perfect, but that's no reason to try to then make up completely new things. Linus
Re: [GIT PULL] iomap: new code for 5.4
On Tue, Sep 17, 2019 at 8:21 AM Darrick J. Wong wrote: > > Please pull this series containing all the new iomap code for 5.4. So looking at the non-iomap parts of it, I react to the new "list_pop() code. In particular, this: struct list_head *pos = READ_ONCE(list->next); is crazy to begin with.. It seems to have come from "list_empty()", but the difference is that it actually makes sense to check for emptiness of a list outside whatever lock that protects the list. It can be one of those very useful optimizations where you don't even bother taking the lock if you can optimistically check that the list is empty. But the same is _not_ true of an operation like "list_pop()". By definition, the list you pop something off has to be stable, so the READ_ONCE() makes no sense here. Anyway, if that was the only issue, I wouldn't care. But looking closer, the whole thing is just completely wrong. All the users seem to do some version of this: struct list_head tmp; list_replace_init(>io_list, ); iomap_finish_ioend(ioend, error); while ((ioend = list_pop_entry(, struct iomap_ioend, io_list))) iomap_finish_ioend(ioend, error); which is completely wrong and pointless. Why would anybody use that odd "list_pop()" thing in a loop, when what it really seems to just want is that bog-standard "list_for_each_entry_safe()" struct list_head tmp; struct iomap_ioend *next; list_replace_init(>io_list, ); iomap_finish_ioend(ioend, error); list_for_each_entry_safe(struct iomap_ioend, next, , io_list) iomap_finish_ioend(ioend, error); which is not only the common pattern, it's more efficient and doesn't pointlessly re-write the list for each entry, it just walks it (and the "_safe()" part is because it looks up the next entry early, so that the entry that it's walking can be deleted). So I pulled it. But then after looking at it, I unpulled it again because I don't want to see this kind of insanity in one of THE MOST CORE header files we have in the whole kernel. If xfs and iomap want to think they are "popping" a list, they can do so. In the privacy of your own home, you can do stupid and pointless things. But no, we don't pollute core kernel code with those stupid and pointless things. Linus
Re: [Cluster-devel] [PATCH] gfs2: Fix error path kobject memory leak
On Mon, May 13, 2019 at 3:37 PM Andreas Gruenbacher wrote: > > Sorry, I should have been more explicit. Would you mind taking this > patch, please? If it's more convenient or more appropriate, I'll send > a pull request instead. Done. However,I'd like to point out that when I see patches from people who I normally get a pull request from, I usually ignore them. Particularly when they are in some thread with discussion, I'll often just assume that th epatch is part of the thread, not really meant for me in particular. In this case I happened to notice that suddenly my participation status changed, which is why I asked, but in general I might hav ejust archived the thread with the assumption that I'll be getting the patch later as a git pull. Just so you'll be aware of this in the future, in case I don't react... Linus
Re: [PATCH] gfs2: Fix error path kobject memory leak
Andreas, did you intend for me to take this as a patch from the email, or will I get a pull request with this later? Linus
Re: [Cluster-devel] GFS2: Pull Request
On Wed, May 8, 2019 at 1:58 PM Jonathan Corbet wrote: > > I think this certainly belongs in the maintainer manual, but probably not > in pull-requests.rst. There are a lot of things about repository > management that seem to trip up even experienced maintainers; pre-pull > merges is just one of those. I would love to see a proper guide on when > and how to do merges in general. We had another pull request issue today, about a situation that I definitely know you've written about in the past, because I linked to lwn in my email: https://lore.kernel.org/lkml/CAHk-=wiKoePP_9CM0fn_Vv1bYom7iB5N=ulallz7yost3k+...@mail.gmail.com/ and while I suspect people don't actually read documentation (_particularly_ maintainers that have already been around for a long time but still do this), maybe that part could be in the same documentation? Linus
Re: [Cluster-devel] GFS2: Pull Request
On Wed, May 8, 2019 at 1:17 PM Andreas Gruenbacher wrote: > > Would it make sense to describe how to deal with merge conflicts in > Documentation/maintainer/pull-requests.rst to stop people from getting > this wrong over and over again? Probably. I do think it got written up at some point (lwn or something like that), but it's possible it never got made into an actual documentation file.. Anybody want to try to massage that email into the appropriate doc file? Linus
Re: [Cluster-devel] GFS2: Pull Request
On Wed, May 8, 2019 at 10:55 AM Linus Torvalds wrote: > > See what I'm saying? You would ask me to pull the un-merged state, but > then say "I noticed a few merge conflicts when I did my test merge, > and this is what I did" kind of aside. Side note: this is more important if you know of a merge issue that doesn't cause a conflict, and that I won't see in my simple build tests. For example, in this case, the merge issue doesn't cause a conflict (because it's a totally new user of bio_for_each_segment_all() and the syntax changed in another branch), but I see it trivially when I do a test build, since the compiler spews out lots of warnings, and so I can trivially fix it up (and you _mentioning_ the issue gives me the heads up that you knew about it and what it's all about). But if it's other architectures, or only happens under special config options etc, I might not have seen the merge issue at all. And then it's really good if the maintainer talks about it and shows that yes, the maintainer knows what he's doing. Now I'm in the situation where I have actually done the merge the way I *like* doing them, and without your superfluous merge commit. But if I use my merge, I'll lose the signature from your tag, because you signed *your* merge that I didn't actually want to use at all. See? Your "helpful" merge actually caused me extra work, and made me have to pick one of two *worse* situations than if you had just tagged your own development tree. Either my tree has a extra pointless merge commit, or my tree lacks your signature on your work. I'll just undo my pull, and delay it all in the hope that you'll just send me a new signed tag of the non-merged state. Linus
Re: [Cluster-devel] GFS2: Pull Request
On Wed, May 8, 2019 at 4:49 AM Andreas Gruenbacher wrote: > > There was a conflict with commit 2b070cfe582b ("block: remove the i > argument to bio_for_each_segment_all") on Jens's block layer changes > which you've already merged. I've resolved that by merging those block > layer changes; please let me know if you want this done differently. PLEASE. I say this to somebody pretty much every single merge window: don't do merges for me. You are actually just hurting, not helping. I want to know what the conflicts are, not by being told after-the-fact, but by just seeing them and resolving them. Yes, I like being _warned_ ahead of time - partly just as a heads up to me, but partly also to show that the maintainers are aware of the notifications from linux-next, and that linux-next is working as intended, and people aren't just ignoring what it reports. But I do *NOT* want to see maintainers cross-merging each others trees. It can cause nasty problems, ranging from simply mis-merges to causing me to not pull a tree at all because one side of the development effort had done something wrong. And yes, mis-merges happen - and they happen to me too. It's fairly rare, but it can be subtle and painful when it does happen. But (a) I do a _lot_ of merges, so I'm pretty good at them, and (b) if _I_ do the merge, at least I know about the conflict and am not as taken by surprise by possible problems due to a mis-merge. And that kind of thing is really really important to me as an upstream maintainer. I *need* to know when different subtrees step on each others toes. As a result, even when there's a conflict and a merge is perfectly fine, I want to know about it and see it, and I want to have the option to pull the maintainer trees in different orders (or not pull one tree at all), which means that maintainers *MUST NOT* do cross-tree merges. See? And I don't want to see back-merges (ie merges from my upstream tree, as opposed to merges between different maintainer trees) either, even as a "let me help Linus, he's already merged the other tree, I'll do the merge for him". That's not helping, that's just hiding the issue. Now, very very occasionally I will hit a merge that is so hard that I will go "Hmm, I really need the involved parties to sort this out". Honestly, I can't remember the last time that happened, but it _has_ happened. Much more commonly, I'll do the merge, but ask for verification, saying "ok, this looked more subtle than I like, and I can't test any of it, so can you check out my merge". Even that isn't all that common, but it definitely happens. There is _one_ very special kind of merge that I like maintainers doing: the "test merge". That test merge wouldn't be sent to me in the pull request as the primary signed pull, but it's useful for the maintainer to do to do a final *check* before doing the pull request, so that you as a maintainer know what's going on, and perhaps to warn me about conflicts. If you do a test merge, and you think the test merge was complex, you might then point to your resolution in the pull request as a "this is how I would do it". But you should not make that merge be *the* pull request. One additional advantage of a test merge is that it actually gives a "more correct" diffstat for the pull request. Particularly if the pull request is for something with complex history (ie you as a maintainer have sub-maintainers, and have pulled other peoples work), a test-merge can get a much better diffstat. I don't _require_ that better diffstat, though - I can see how you got the "odd" diffstat if you don't do a test merge - but it's can be a "quality of pull request" thing. See what I'm saying? You would ask me to pull the un-merged state, but then say "I noticed a few merge conflicts when I did my test merge, and this is what I did" kind of aside. Linus
Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"
On Thu, Jan 31, 2019 at 11:12 AM Andreas Gruenbacher wrote: > > That patch just hasn't seen enough testing to make me comfortable > sending it yet. Ok, I'll just apply the revert. Thanks, Linus
Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"
On Wed, Jan 30, 2019 at 12:30 PM Andreas Gruenbacher wrote: > > This reverts commit 2d29f6b96d8f80322ed2dd895bca590491c38d34. > > It turns out that the fix can lead to a ~20 percent performance regression > in initial writes to the page cache according to iozone. Let's revert this > for now to have more time for a proper fix. Ugh. I think part of the problem is that the n += (rbm->bii - initial_bii); doesn't make any sense when we just set rbm->bii to 0 a couple of lines earlier. So it's basically a really odd way to write n -= initial_bii; which in turn really doesn't make any sense _either_. So I'l all for reverting the commit because it caused a performance regression, but the end result really is all kinds of odd. Is the bug as simple as "we incremented the iterator counter twice"? Because in the -E2BIG case, we first increment it by the difference in bii, but then we *also* increment it in res_covered_end_of_rgrp() (which we do *not* do for the "ret > 0" case, which goes to next_iter). So if somebody can run the performance test again, how does it look if *instead* of the revert, we do what looks at least _slightly_ more sensible, and move the "increment iteration count" up to the next-bitmap case instead? At that point, it will actually match the bii updates (because next_bitmap also updates rbm->bii). Hmm? Something like the whitespace-damaged thinig below. Completely untested. But *if* this works, it would make more sense than the revert.. Hmm? Linus --- snip snip --- diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 831d7cb5a49c..5b1006d5344f 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1796,10 +1796,10 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext, rbm->bii++; if (rbm->bii == rbm->rgd->rd_length) rbm->bii = 0; + n++; res_covered_end_of_rgrp: if ((rbm->bii == 0) && nowrap) break; - n++; next_iter: if (n >= iters) break;
Re: [Cluster-devel] [GIT PULL] gfs2: 4.20 fixes
On Thu, Nov 15, 2018 at 2:44 PM Andreas Gruenbacher wrote: > > Ok, I've changed the merge commit as follows now: > > Merge tag 'v4.20-rc1' > > Pull in the gfs2 fixes that went into v4.19-rc8: > > gfs2: Fix iomap buffered write support for journaled files > gfs2: Fix iomap buffered write support for journaled files (2) > > Without these two commits, the following fix would cause conflicts. > > So merging v4.19-rc8 would have been sufficient. v4.20-rc1 is what I > ended up testing, though. > > Are you okay with that now? If the only reason for this was the trivial one-liner conflict, the real fix is to just not do the merge at all, and not worry about conflicts. I handle simple conflicts like this in my sleep. It's actually much better and clearer if the development trees just do their own development, and not worry about "Oh, Linus will get a small conflict due to other changes he has pulled". So what I did was to actually try to generate the tree I think it *should* have been, and just merge that with the simple conflict resolution. You might want to double-check it - not only because I rewrote your pull to not have the merge at all, and in the process modified things, but just to see what I think the right approach would have been. Now, there *are* reasons to do back-merges, but no, "Oh, there's a trivial conflict" is not one of them. Linus
Re: [Cluster-devel] [GIT PULL] gfs2: 4.20 fixes
On Thu, Nov 15, 2018 at 12:20 PM Andreas Gruenbacher wrote: > > I guess rebasing the for-next branch onto something more recent to > avoid the back-merge in the first place will be best, resulting in a > cleaner history. Rebases aren't really any better at all. If you have a real *reason* for a merge, do the merge. But then the reason should be clearly stated in the merge commit. Not just some random undocumented merge message. Tell why some other branch was relevant to your fix and needed to be pulled in. Better yet, don't do either merges or rebases. Linus
Re: [Cluster-devel] [GIT PULL] gfs2: 4.20 fixes
On Thu, Nov 15, 2018 at 6:00 AM Andreas Gruenbacher wrote: > > could you please pull the following gfs2 fixes for 4.20? No. I'm not pulling this useless commit message: "Merge tag 'v4.20-rc1'" with absolutely _zero_ explanation for why that merge was done. Guys, stop doing this. Because I will stop pulling them. If you can't be bothered to explain exactly why you're doing a merge, I can't be bothered to pull the result. Commit messages are important. They explain why something was done. Merge commits are to some degree *more* important, because they do odd things and the reason is not obvious from the code ("Oh, it's a oneliner obvious fix"). So merge commits without a reason for them are simply not acceptable. Linus
Re: [Cluster-devel] GFS2: Pull request (merge window)
On Tue, Oct 23, 2018 at 8:26 PM Bob Peterson wrote: > > Please consider pulling the following changes for the GFS2 file system. Pulled. Just FYI - I end up editing your notes to look a bit more like most other people send them. No need to change anything, I just thought I'd mention it. Linus
Re: [Cluster-devel] GFS2: Pull request (merge window)
On Fri, May 5, 2017 at 1:28 PM, Bob Petersonwrote: > > I asked around, but nobody could tell me what went wrong. Strangely, > this command: > > git log --oneline --right-only origin/master...FETCH_HEAD --stat > > doesn't show this, but this one does: > > git diff --stat --right-only origin/master...FETCH_HEAD So the fundamental difference between "git log" and "git diff" is that one is a "ser operation" on the commits in question, and the other is fundamentally a "operation between two endpoints". And that's why "git log" will always show the "right" thing - because even in the presense of complex history, there's no ambiguity about which commits are part of the new set, and which are in the old set. So "git log" just does a set difference, and shows the commits in one set but not the other. But "git diff", because it is fundamentally about two particular points in history, can have a hard time once you have complex history: what are the two points? In particular, what "git diff origin/master...FETCH_HEAD" means is really: - find the common point (git calls it "merge base" because the common point is also used for merging) between the two commits (origin/master and FETCH_HEAD) - do the diff from that common point to the end result (FETCH_HEAD) and for linear history that is all very obvious and unambiguous. But once you have non-linear history, and particularly once you have back-merges (ie you're not just merging work that is uniquely your own from multiple of your *own* branches, but you're also doing merges of upstream code), the notion of that "common case" is no longer unambiguous. There is not necessarily any *one* common base, there can be multiple points in history that are common between the two branches, but are distinct points of history (ie one is not an ancestor of another). And since a diff is fundamentally about just two end-points ("what are the differences between these two points in the history"), "git diff" fundamentally cannot handle that case without help. So "git diff" will pick the first of the merge bases it finds, and just use that. Which even in the presense of more complex history will often work by luck, but more often just means that you'll see differences that aren't all from your tree, but some of them came from the *other* common point(s). For example, after doing the pull, I can then do: git merge-base --all HEAD^ HEAD^2 to see the merge bases of the merge in HEAD. In this case, because of your back-merge, there's two of them (with more complex history, there can be more still): f9fe1c12d126 rhashtable: Add rhashtable_lookup_get_insert_fast 69eea5a4ab9c Merge branch 'for-linus' of git://git.kernel.dk/linux-block and because "git diff" will just pick the first one, you will basically have done git diff f9fe1c12d126..FETCH_HEAD and if you then look at the *set* of changes (with "git log" of that range), you'll see why that diff also ends up containing those block changes (because they came on from that other merge base: commit 69eea5a4ab9c that had that linux-block merge). Now, doing a *merge* in git will take _all_ of those merge bases into account, and do something *much* more complicated than just a two-way diff. It will internally first create a single merge base (by recursively merging up all the other merge bases into a new internal commit), and then using that single merge base it will then do a normal three-way merge of the two branches. "git diff' doesn't do that kind of complicated operation, and although it *could* do that merge base recursive merging dance, the problem would be what to do about conflicts (which "git merge" obviously can also have, but with git merge you have that whole manual conflict resolution case). So once you have complex history that isn't just about merging your own local changes from other local branches, you'll start hitting this situation. Visualizing the history with "gitk" for those cases is often a great way to see why there's no single point that can be diffed against. But once you *do* have that kind of complex history, you're also expected to have the expertise to handle it: > So I created a temporary local branch and used git merge to > generate a correct diffstat. That's the correct thing to do. Hopefully the above explains *why* it's the correct thing to do. (Although to be honest, I also am pretty used to parsing the wrong diffs, and people sometimes just send me the garbage diffstat and say "I don't know what happened", and I'll figure it out and can still validate that the garbage diffstat they sent me is what I too get if I do just a silly "git diff" without taking merge bases into account). Linus
Re: [Cluster-devel] [PATCH 04/21] fs: Replace CURRENT_TIME with current_fs_time() for inode timestamps
On Thu, Jun 9, 2016 at 1:38 PM, Deepa Dinamaniwrote: > > 1. There are a few link, rename functions which assign times like this: > > - inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; > + inode->i_ctime = dir->i_ctime = dir->i_mtime = > current_fs_time(dir->i_sb); So I think you should just pass one any of the two inodes and just add a comment. Then, if we hit a filesystem that actually wants to have different granularity for different inodes, we'll split it up, but even then we'd be better off than with the superblock, since then we *could* easily split this one case up into "get directory time" and "get inode time". > 2. Also, this means that we will make it an absolute policy that any > filesystem > timestamp that is not directly connected to an inode would have to use > ktime_get_* apis. The thing is, those kinds of things are all going to be inside the filesystem itself. At that point, the *filesystem* already knows what the timekeeping rules for that filesystem is. I think we should strive to design the "current_fs_time()" not for internal filesystem use, but for actual generic use where we *don't* know a priori what the rules are, and we have to go to this helper function to figure it out. Inside a filesystem, why *shouldn't* the low-level filesystem already use the normal "get time" functions? See what I'm saying? The primary value-add to "current_fs_time()" is for layers like the VFS and security layer that don't know what the filesystem itself does. At the low-level filesystem layer, you may just know that "ok, I only have 32-bit timestamps anyway, so I should just use a 32-bit time function". > 3. Even if the filesystem inode has extra timestamps and these are not > part of vfs inode, we still use > vfs inode to get the timestamps from current_fs_time(): Eg: ext4 create time But those already have an inode. In fact, ext4 is a particularly bad example, since it uses the ext4_current_time() function to get the time. And that one gets an inode pointer. So at least one filesystem that already does this, already uses a inode-based model. Everything I see just says "times are about inodes". Anything else almost has to be filesystem-internal anyway, since the only thing that is ever visible outside the filesystem (time-wise) is the inode. And as mentioned, once it's internal to the low-level filesystem, it's not obvious at all that you'd have to use "currenf_fs_time()" anyway. The internal filesystem code might very well decide to use other timekeeping functions. Linus
Re: [Cluster-devel] [PATCH 04/21] fs: Replace CURRENT_TIME with current_fs_time() for inode timestamps
On Wed, Jun 8, 2016 at 10:04 PM, Deepa Dinamaniwrote: > CURRENT_TIME macro is not appropriate for filesystems as it > doesn't use the right granularity for filesystem timestamps. > Use current_fs_time() instead. Again - using the inode instead fo the syuperblock in tghis patch would have made the patch much more obvious (it could have been 99% generated with the sed-script I sent out a week or two ago), and it would have made it unnecessary to add these kinds of things: > diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c > index e9f5043..85c12f0 100644 > --- a/drivers/usb/core/devio.c > +++ b/drivers/usb/core/devio.c > @@ -2359,6 +2359,7 @@ static long usbdev_do_ioctl(struct file *file, unsigned > int cmd, > { > struct usb_dev_state *ps = file->private_data; > struct inode *inode = file_inode(file); > + struct super_block *sb = inode->i_sb; > struct usb_device *dev = ps->dev; > int ret = -ENOTTY; where we add a new variable just because the calling convention was wrong. It's not even 100% obvious that a filesystem has to have one single time representation, so making the time function about the entity whose time is set is also conceptually a much better model, never mind that it is just what every single user seems to want anyway. So I'd *much* rather see + inode->i_atime = inode->i_mtime = inode->i_ctime = current_fs_time(inode); over seeing either of these two variants:: + inode->i_atime = inode->i_mtime = inode->i_ctime = current_fs_time(inode->i_sb); + ret->i_atime = ret->i_mtime = ret->i_ctime = current_fs_time(sb); because the first of those variants (grep for current_fs_time() in the current git tree, and notice that it's the common one) we have the pointless "let's chase a pointer in every caller" And while it's true that the second variant is natural for *some* situations, I've yet to find one where it wasn't equally sane to just pass in the inode instead. Linus
Re: [Cluster-devel] GFS2: Pull request (merge window)
On Tue, Apr 14, 2015 at 10:47 AM, Bob Peterson rpete...@redhat.com wrote: There's another that adds me as a GFS2 co-maintainer [...] So generally, when I start getting pull requests from different people, I'd like to see a previous separate heads-up or confirmation from the previous person just so that I'm not taken by surprise. Rather than this kind of as part of the pull request, make the thing official. Yes, yes, I can see that you use the same script that Steven did, and I can see that you've been the main developer lately, but still.. Anyway. Pulled. Linus
Re: [Cluster-devel] GFS2: Pull request (merge window)
On Tue, Apr 14, 2015 at 10:47 AM, Bob Peterson rpete...@redhat.com wrote: 12 files changed, 184 insertions(+), 95 deletions(-) Oh, and this was incorrect. You had apparently limited the statistics to the fs/gfs2 directory, and thus missed the changes to the MAINTAINERS file. Don't do that - include all the changes. It's what I see and compare against when I actually do the pull anyway. Linus
Re: [Cluster-devel] GFS2: Pull request (fixes)
On Mon, Aug 19, 2013 at 2:26 AM, Steven Whitehouse swhit...@redhat.com wrote: Oops, forgot to sign this when I sent it first, so here is a signed copy of it. Please sign the git tag instead of the email. There are no sane tools to check email signatures etc, but git signed tags not only check your pgp key signature, they also allow you to just write the message into the tag. Also, please don't do crap like this: are available in the git repository at: https://git.kernel.org/cgit/linux/kernel/git/steve/gfs2-3.0-fixes.git master Yeah, no. The proper address is git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes not that cgit web page. Linus
Re: [Cluster-devel] [patch for-3.8] fs, dlm: fix build error when EXPERIMENTAL is disabled
On Tue, Feb 12, 2013 at 1:50 AM, Steven Whitehouse swhit...@redhat.com wrote: That doesn't seem right to me... DLM has not been experimental for a long time now. Why not just select CRC32 in addition to IP_SCTP ? Hmm. IP_SCTP already does a select libcrc32c. So why doesn't that end up working? Linus
Re: [Cluster-devel] GFS2: Pull request (merge window)
On Fri, Aug 3, 2012 at 3:18 AM, Steven Whitehouse swhit...@redhat.com wrote: I've been writing summaries when I send out the pre-pull patch posting so I could just copy paste from that if it thats the kind of thing that you are after? https://lkml.org/lkml/2012/7/23/59 Yup, exactly something like that. And then depending on whether there are any new and interesting things, please elaborate.. Thanks, Linus
Re: [Cluster-devel] GFS2: Pull request (merge window)
[ For some reason this didn't go out, and was in my drafts folder. Better late than never, I guess ] On Mon, Jul 23, 2012 at 7:59 AM, Steven Whitehouse swhit...@redhat.com wrote: Please consider pulling the following patches. There have been no changes since they were posted for review, It would be lovely to get a short overview of what the changes you ask me to pull actually do, so that I can populate the merge message with relevant information. (Not just a message for you, this is true in general. Although I'm very pleased with the merge window so far - I think most pull requests have had at least *some* general overview in tags and/or emails) Linus
Re: [Cluster-devel] GFS2: Use kmalloc when possible for -readdir()
On Wed, Jul 28, 2010 at 4:15 AM, Steven Whitehouse swhit...@redhat.com wrote: We may be able to eliminate vmalloc entirely at some stage, but this is easy to do right away. Quite frankly, I'd much rather see this abstracted out a bit. Why not just do a void *memalloc(unsigned int size) { if (size KMALLOC_MAX_SIZE) { void *ptr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN); if (ptr) return ptr; } return vmalloc(size); } void memfree(void *ptr) { unsigned long addr = (unsigned long) ptr; if (is_vmalloc_addr(addr)) { vfree(ptr); return; } kfree(ptr); } wouldn't that be much nicer? No need for that explicit flag, and you don't mess up an already way-too-ugly function even more. Also, I do notice that you used GFP_NOFS, but you didn't use that for the vmalloc() thing. If there really are lock reentrancy reasons, you _could_ use __vmalloc(size, GFP_NOFS, PAGE_KERNEL). But since you've been using vmalloc() for a long time, I suspect GFP_KERNEL works fine. Yes/no? Linus
[Cluster-devel] Re: [PATCH] SLOW_WORK: Move slow_work's proc file to debugfs
On Tue, 1 Dec 2009, Ingo Molnar wrote: Nice - thanks for doing this so quickly! It might sound like nitpicking but /proc ABIs tend to be a lot harder to get rid of than debugfs interfaces. Ok, I applied it, so we'll not switch interfaces (even if they are just for debugging) across releases. Btw David, for things like this, it's _really_ nice to use git rename detection. The diffstat (w/ summary) with rename detection looks like this: Documentation/slow-work.txt |4 ++-- include/linux/slow-work.h|8 init/Kconfig |8 kernel/Makefile |2 +- kernel/{slow-work-proc.c = slow-work-debugfs.c} |4 ++-- kernel/slow-work.c | 18 -- kernel/slow-work.h |6 +++--- 7 files changed, 28 insertions(+), 22 deletions(-) rename kernel/{slow-work-proc.c = slow-work-debugfs.c} (97%) which makes it obvious that the changes were really just about renaming. Compare to the non-rename-aware one: Documentation/slow-work.txt |4 +- include/linux/slow-work.h |8 +- init/Kconfig|8 +- kernel/Makefile |2 +- kernel/slow-work-debugfs.c | 227 +++ kernel/slow-work-proc.c | 227 --- kernel/slow-work.c | 18 +++- kernel/slow-work.h |6 +- 8 files changed, 253 insertions(+), 247 deletions(-) create mode 100644 kernel/slow-work-debugfs.c delete mode 100644 kernel/slow-work-proc.c where you can kind of guess that slow-work-[proc|debugfs].c are largely the same, but you don't actually _see_ that it only has four lines of changes (and the patch then shows that the changes are just to comments). Linus