Re: [Cluster-devel] [GIT PULL] gfs2 fix

2023-06-06 Thread Linus Torvalds
On Tue, Jun 6, 2023 at 5:48 AM Andreas Gruenbacher  wrote:
>
> - Don't get stuck writing page onto itself under direct I/O.

Btw, is there a test for this DIO case?

We've had the deadlock issue on t page lock (or for inode locks or
whatever) for normal IO when faulting in the same page that is written
to, and we have as pattern for solving that and I think there are
filesystem tests that trigger this.

But the DIO pattern is a bit different, with the whole "invalidate
page cache: issue, and the fact that you send this patch now (rather
than years ago) makes me wonder about test coverage for this all?

Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix for v6.3-rc4

2023-03-23 Thread Linus Torvalds
On Thu, Mar 23, 2023 at 3:22 PM Andreas Grünbacher
 wrote:
>
> I've pushed the tag out now; should I resend the pull request?

No, all good, I have the changes,

Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix for v6.3-rc4

2023-03-23 Thread Linus Torvalds
On Thu, Mar 23, 2023 at 11:45 AM Andreas Gruenbacher
 wrote:
>
> From: Linus Torvalds 

Wat?

>   git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
> gfs2-v6.3-rc3-fix

-ENOSUCHTAG

> for you to fetch changes up to 260595b439776c473cc248f0de63fe78d964d849:

.. and no such commit available in any other tag or branch either.

Did you forget to push out?

Linus



Re: [Cluster-devel] [GIT PULL] gfs2 writepage fix

2023-01-22 Thread Linus Torvalds
On Sun, Jan 22, 2023 at 1:01 AM Andreas Gruenbacher  wrote:
>
> gfs2 writepage fix
>
> - Fix a regression introduced by commit "gfs2: stop using
>   generic_writepages in gfs2_ail1_start_one".

Hmm. I'm adding a few more people and linux-fsdevel to the reply,
because we had a number of filesystems remove writepages use lately,
including some that did it as a fix after the merge window.

Everybody involved seemed to claim it was just a no-brainer
switch-over, and I just took that on faith. Now it looks like that
wasn't true at least for gfs2 due to different semantics.

Maybe the gfs2 issue is purely because of how gfs2 did the conversion
(generic_writepages -> filemap_fdatawrite_wbc), but let's make people
look at their own cases.

 Linus



Re: [Cluster-devel] [GIT PULL] dlm updates for 6.0

2022-08-01 Thread Linus Torvalds
On Mon, Aug 1, 2022 at 8:56 AM David Teigland  wrote:
>
> At risk of compounding your trouble, I could have added that you can use
> the original merge and have me send the fixup.

Well, I only had something like 4 other merges on top, none of them
*that* complicated.

And I hadn't pushed out, so I just redid them.

I did notice your "v2" pull request fairly quickly, because I try to
do my pulls (when I have lots of them, like at the beginning of the
merge window) in some kind of "order of areas", so I was doing
filesystem and vfs stuff, and  your v2 ended up showing up in that
bunch too, and I hadn't done too much.

I doubt I would have needed help with the conflicts, but I decided to
not even look at them, because even with them resolved it would just
have been ugly to have that pointless duplication from the rebase when
I could just start from scratch again.

But again: please don't rebase stuff you have already exposed to
others. It causes real issues. This was just one example of it.

And if you *do* have to rebase for a real technical reason ("Oh, that
was a disaster, it absolutely *must* be fixed"), please let involved
people know asap.

And a version number change is not a "huge disaster, must rebase".

   Linus



Re: [Cluster-devel] [GIT PULL] dlm updates for 6.0

2022-08-01 Thread Linus Torvalds
On Mon, Aug 1, 2022 at 7:43 AM David Teigland  wrote:
>
> (You can ignore the premature 5.20 pull request from some weeks ago.)

Gaah. That was the first thing I pulled this morning because it was
the earliest.

And apparently you've rebased, so I can't even just pull on top.

Gaah. Now I'll have to go back and re-do everything I've done this morning.

PLEASE don't do things like this to me. If you know you are going to
re-do a pull request, let me know early, so that I don't pull the old
one.

  Linus



Re: [Cluster-devel] [RFC 0/2] refcount: attempt to avoid imbalance warnings

2022-06-30 Thread Linus Torvalds
On Thu, Jun 30, 2022 at 6:59 AM Alexander Aring  wrote:
>
> I send this patch series as RFC because it was necessary to do a kref
> change after adding __cond_lock() to refcount_dec_and_lock()
> functionality.

Can you try something like this instead?

This is two separate patches - one for sparse, and one for the kernel.

This is only *very* lightly tested (ie I tested it on a single kernel
file that used refcount_dec_and_lock())

Linus
 linearize.c  | 24 ++--
 validation/context.c | 15 +++
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/linearize.c b/linearize.c
index d9aed61b..8dd005af 100644
--- a/linearize.c
+++ b/linearize.c
@@ -1537,6 +1537,8 @@ static pseudo_t linearize_call_expression(struct entrypoint *ep, struct expressi
 	add_one_insn(ep, insn);
 
 	if (ctype) {
+		struct basic_block *post_call = NULL;
+
 		FOR_EACH_PTR(ctype->contexts, context) {
 			int in = context->in;
 			int out = context->out;
@@ -1547,8 +1549,21 @@ static pseudo_t linearize_call_expression(struct entrypoint *ep, struct expressi
 in = 0;
 			}
 			if (out < 0) {
-check = 0;
-out = 0;
+struct basic_block *set_context;
+if (retval == VOID) {
+	warning(expr->pos, "nonsensical conditional output context with no condition");
+	break;
+}
+if (check || in) {
+	warning(expr->pos, "nonsensical conditional input context");
+	break;
+}
+if (!post_call)
+	post_call = alloc_basic_block(ep, expr->pos);
+set_context = alloc_basic_block(ep, expr->pos);
+add_branch(ep, retval, set_context, post_call);
+set_activeblock(ep, set_context);
+out = -out;
 			}
 			context_diff = out - in;
 			if (check || context_diff) {
@@ -1560,6 +1575,11 @@ static pseudo_t linearize_call_expression(struct entrypoint *ep, struct expressi
 			}
 		} END_FOR_EACH_PTR(context);
 
+		if (post_call) {
+			add_goto(ep, post_call);
+			set_activeblock(ep, post_call);
+		}
+
 		if (ctype->modifiers & MOD_NORETURN)
 			add_unreachable(ep);
 	}
diff --git a/validation/context.c b/validation/context.c
index b9500dc7..f8962f19 100644
--- a/validation/context.c
+++ b/validation/context.c
@@ -10,6 +10,10 @@ static void r(void) __attribute__((context(1,0)))
 	__context__(-1);
 }
 
+// Negative output means "conditional positive output"
+extern int cond_get(void) __attribute((context(0,-1)));
+extern void nonsensical_cond_get(void) __attribute((context(0,-1)));
+
 extern int _ca(int fail);
 #define ca(fail) __cond_lock(_ca(fail))
 
@@ -19,6 +23,17 @@ static void good_paired1(void)
 	r();
 }
 
+static void good_conditional(void)
+{
+	if (cond_get())
+		r();
+}
+
+static void nonsensical_conditional(void)
+{
+	nonsensical_cond_get();
+}
+
 static void good_paired2(void)
 {
 	a();
 include/linux/compiler_types.h | 2 ++
 include/linux/refcount.h   | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index d08dfcb0ac68..4f2a819fd60a 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -24,6 +24,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
 /* context/locking */
 # define __must_hold(x)	__attribute__((context(x,1,1)))
 # define __acquires(x)	__attribute__((context(x,0,1)))
+# define __cond_acquires(x) __attribute__((context(x,0,-1)))
 # define __releases(x)	__attribute__((context(x,1,0)))
 # define __acquire(x)	__context__(x,1)
 # define __release(x)	__context__(x,-1)
@@ -50,6 +51,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
 /* context/locking */
 # define __must_hold(x)
 # define __acquires(x)
+# define __cond_acquires(x)
 # define __releases(x)
 # define __acquire(x)	(void)0
 # define __release(x)	(void)0
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index b8a6e387f8f9..a62fcca97486 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -361,9 +361,9 @@ static inline void refcount_dec(refcount_t *r)
 
 extern __must_check bool refcount_dec_if_one(refcount_t *r);
 extern __must_check bool refcount_dec_not_one(refcount_t *r);
-extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock);
-extern __must_check bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock);
+extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock) __cond_acquires(lock);
+extern __must_check bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock) __cond_acquires(lock);
 extern __must_check bool refcount_dec_and_lock_irqsave(refcount_t *r,
 		   spinlock_t *lock,
-		   unsigned long *flags);
+		   unsigned long *flags) __cond_acquires(lock);
 #endif /* _LINUX_REFCOUNT_H */


Re: [Cluster-devel] sparse warnings related to kref_put_lock() and refcount_dec_and_lock()

2022-06-28 Thread Linus Torvalds
On Tue, Jun 28, 2022 at 1:58 AM Luc Van Oostenryck
 wrote:
>
> I would certainly not recommend this but ...
> if it's OK to cheat and lie then you can do:
> +   bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock) 
> __acquires(lock);

Actually, we have "__cond_lock()" in the kernel to actually document
that something takes a lock only in certain conditions.

It needs to be declared as a macro in the header file, with this as an example:

   #define raw_spin_trylock(lock)  __cond_lock(lock, _raw_spin_trylock(lock))

ie that says that "raw_spin_trylock() takes 'lock' when
_raw_spin_trylock() returned true".

That then makes it possible to write code like

if (raw_spin_trylock(lock)) {..
 raw_spin_unlock(lock));
}

and sparse will get the nesting right.

But that does *not* solve the issue of people then writing this as

locked = raw_spin_trylock(lock);
.. do_something ..
if (locked)
 raw_spin_unlock(lock));

and you end up with the same thing again.

Anyway, for things like refcount_dec_and_lock(), I suspect that first
pattern should work, because you really shouldn't have that second
pattern. If you just decremented without any locking, the actions are
completely different from the "oh, got the last decrement and now it's
locked and I need to free things" or whatever.

Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fixes

2022-05-13 Thread Linus Torvalds
On Fri, May 13, 2022 at 2:07 PM Andreas Gruenbacher  wrote:
>
> Would you like to still pull these fixes for v5.18, or should we send them in
> the next merge window?

I've pulled them. Any filesystem corruption issue sounds scary enough
that it doesn't sound very sane to delay fixes.

Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-05-03 Thread Linus Torvalds
On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher  wrote:
>
> More testing still ongoing, but the following patch seems to fix the
> data corruption.

Fingers crossed.

> +   truncate_pagecache_range(inode, hstart, hend - 1);
> +   if (hstart < hend)
> +   punch_hole(ip, hstart, hend - hstart);

Why doesn't that "hstart < hend" condition cover both the truncate and
the hole punch?

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-05-03 Thread Linus Torvalds
On Tue, May 3, 2022 at 9:41 AM Andreas Gruenbacher  wrote:
>
> That's happening in iomap_file_buffered_write() and iomap_iter():
>
> while ((ret = iomap_iter(, ops)) > 0)
> iter.processed = iomap_write_iter(, i);
>
> Here, iomap_write_iter() returns how much progress it has made, which
> is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
> then updates iter.pos and iter.len based on iter.processed.

Ahh. I even had that file open in my editor in another window, but missed it.

So much for that theory.

  Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-05-02 Thread Linus Torvalds
On Mon, May 2, 2022 at 11:31 AM Linus Torvalds
 wrote:
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different.

Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
as well that I brought in the iomap people.

And the iomap code has many different cases. Some of them do

if (unlikely(copied < len && !folio_test_uptodate(folio)))
return 0;

to force the whole IO to be re-done (looks sane - that's the "the
underlying folio wasn't uptodate, because we expected the write to
make it so").

And that might not have happened before, but it looks like gfs2 does
actually try to deal with that case.

But since Andreas said originally that the IO wasn't aligned, I don't
think that "not uptodate" case is what is going on, and it's more
about some "partial write in the middle of a buffer succeeded"

And the code also has things like

if (ret < len)
iomap_write_failed(iter->inode, pos, len);

which looks very wrong - it's not that the write failed, it's just
incomplete because it was done with page faults disabled. It seems to
try to do some page cache truncation based on the original 'len', but
not taking the successful part into account. Which all sounds
horrifically wrong.

But I don't know the code well enough to really judge. It just makes
me uncomfortable, and I do suspect this code may be quite buggy if the
copy of the full 'len' doesn't succeed.

Again, the patch I sent only _hides_ any issues and makes them
practically impossible to see. It doesn't really _fix_ anything, since
- as mentioned - regardless of fault_in_iov_iter_readable()
succeeding, racing with page-out could then cause the later
copy_page_from_iter_atomic() to have a partial copy anyway.

And hey, maybe there's something entirely different going on, and my
"Heureka! It might be explained by that partial write_end that
generally didn't happen before" is only my shouting at windmills.

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-28 Thread Linus Torvalds
On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds
 wrote:
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

Oh, and as I do that, it strikes me: What platform do you see the problem on?

Because the code for that function is very different if HIGHMEM is
enabled, so if you see this on x86-32 but not x86-64, for example,
that is relevant.

I *assume* nobody uses x86-32 any more, but just checking...

   Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-28 Thread Linus Torvalds
On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher  wrote:
>
> The data corruption we've been getting unfortunately didn't have to do
> with lock contention (we already knew that); it still occurs. I'm
> running out of ideas on what to try there.

Hmm.

I don't see the bug, but I do have a suggestion on something to try.

In particular, you said the problem started with commit 00bfe02f4796
("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

And to me, I see two main things that are going on

 (a) the obvious "calling generic IO functions with pagefault disabled" thing

 (b) the "allow demotion" thing

And I wonder if you could at least pinpoint which of the  cases it is
that triggers it.

So I'd love to see you try three things:

 (1) just remove the "allow demotion" cases.

 This will re-introduce the deadlock the commit is trying to fix,
but that's such a special case that I assume you can run your
test-suite that shows the problem even without that fix in place?

 This would just pinpoint whether it's due to some odd locking issue or not.

Honestly, from how you describe the symptoms, I don't think (1) is the
cause, but I think making sure is good.

It sounds much more likely that it's one of those generic vfs
functions that screws up when a page fault happens and it gets a
partial result instead of handling the fault.

Which gets us to

 (2) remove the pagefault_disable/enable() around just the
generic_file_read_iter() case in gfs2_file_read_iter().

and

 (3) finally, remove the pagefault_disable/enable() around the
iomap_file_buffered_write() case in gfs2_file_buffered_write()

Yeah, yeah, you say it's just the read that fails, but humor me on
(3), just in case it's an earlier write in your test-suite and the
read just then uncovered it.

But I put it as (3) so that you'd do the obvious (2) case first, and
narrow it down (ie if (1) still shows the bug, then do (2), and if
that fixes the bug it will be fairly well pinpointed to
generic_file_read_iter().

Looking around, gfs2 is the only thing that obviously calls
generic_file_read_iter() with pagefoaults disabled, so it does smell
like filemap_read() might have some issue, but the only thing that
does is basically that

copied = copy_folio_to_iter(folio, offset, bytes, iter);

which should just become copy_page_to_iter_iovec(), which you'd hope
would get things right.

But it would be good to just narrow things down a bit.

I'll look at that copy_page_to_iter_iovec() some more regardless, but
doing that "let's double-check it's not somethign else" would be good.

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-27 Thread Linus Torvalds
On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds
 wrote:
>
> So I really think
>
>  (a) you are mis-reading the standard by attributing too strong logic
> to paperwork that is English prose and not so exact
>
>  (b) documenting Linux as not doing what you are mis-reading it for is
> only encouraging others to mis-read it too
>
> The whole "arbitrary writes have to be all-or-nothing wrt all other
> system calls" is simply not realistic, and has never been. Not just
> not in Linux, but in *ANY* operating system that POSIX was meant to
> describe.

Side note: a lot of those "atomic" things in that documentation have
come from a history of signal handling atomicity issues, and from all
the issues people had with (a) user-space threading implementations
and (b) emulation layers from non-Unixy environments.

So when they say that things like "rename()" has to be all-or-nothing,
it's to clarify that you can't emulate it as a "link and delete
original" kind of operation (which old UNIX *did* do) and claim to be
POSIX.

Because while the end result of rename() and link()+unlink()might be
similar, people did rely on that whole "use rename as a way to create
an atomic marker in the filesystem" (which is a very traditional UNIX
pattern).

So "rename()" has to be atomic, and the legacy behavior of link+unlink
is not valid in POSIX.

Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
because that doesn't work if somebody else is doing another "pread()"
on the same file descriptor concurrently.

Again, people *did* implement exactly those kinds of implementations
of "pread()", and yes, they were broken for both signals and for
threading.

So there's "atomicity" and then there is "atomicity".

That "all or nothing" can be a very practical thing to describe
*roughly* how it must work on a higher level, or it can be a
theoretical "transactional" thing that works literally like a database
where the operation happens in full and you must not see any
intermediate state.

And no, "write()" and friends have never ever been about some
transactional operation where you can't see how the file grows as it
is being written to. That kind of atomicity has simply never existed,
not even in theory.

So when you see POSIX saying that a "read()" system call is "atomic",
you should *not* see it as a transaction thing, but see it in the
historical context of "people used to do threading libraries in user
space, and since they didn't want a big read() to block all other
threads, they'd split it up into many smaller reads and now another
thread *also* doing 'read()' system calls would see the data it read
being not one contiguous region, but multiple regions where the file
position changed in the middle".

Similarly, a "read()" system call will not be interrupted by a signal
in the middle, where the signal handler would do a "lseek()" or
another "read()", and now the original "read()" data suddenly is
affected.

That's why things like that whole "f_pos is atomic" is a big deal.

Because there literally were threading libraries (and badly emulated
environments) where that *WASN'T* the case, and _that_ is why POSIX
then talks about it.

So think of POSIX not as some hard set of "this is exactly how things
work and we describe every detail".

Instead, treat it a bit like historians treat Herodotus - interpreting
his histories by taking the issues of the time into account.  POSIX is
trying to clarify and document the problems of the time it was
written, and taking other things for granted.

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-27 Thread Linus Torvalds
On Wed, Apr 27, 2022 at 2:26 PM Andreas Gruenbacher  wrote:
>
> Well, POSIX explicitly mentions those atomicity expectations, e.g.,
> for read [1]:

Yes. I'm aware. And my point is that we've never done _that_ kind of atomicity.

It's also somewhat ambiguous what it actually means, since what it
then talks about is "all bytes that started out together ends
together" and "interleaving".

That all implies that it's about the *position* of the reads and
writes being atomic, not the *data* of the reads and writes.

That, btw, was something we honored even before we had the locking
around f_pos accesses - a read or write system call would get its own
local *copy* the file position, the read or write would then do that
IO based on that copied position - so that things that "started out
together ends together" - and then after the operation is done it
would *update* the file position atomically.

Note that that is exactly so that data would end up "together". But it
would mean that two concurrent reads using the same file position
might read the *same* area of the file.

Which still honors that "the read is atomic wrt the range", but
obviously the actual values of "f_pos" is basically random after the
read (ie is it the end of the first read, or the end of the second
read?).

The same paragraph also explicitly mentions pipes and FIFOs, despite
an earlier paragraph dismissing them, which is all just a sign of
things being very confused.

Anyway, I'm not objecting very sternously to making it very clear in
some documentation that this "data atomicity" is not what Linux has
ever done. If you do overlapping IO, you get what you deserve.

But I do have objections.

On one hand, it's not all that different from some of the other notes
we have in the man-pages (ie documenting that whole "just under 2GB"
limit on the read size, although that's actually using the wrong
constant: it's not 0x7000 bytes, it's MAX_RW_COUNT, which is
"INT_MAX & PAGE_MASK" and that constant in the man-page is as such
only true on a system with 4kB page sizes)

BUT! I'm 100% convinced that NOBODY HAS EVER given the kind of
atomicity guarantees that you would see from reading that document as
a language-lawyer.

For example, that section "2.9.7 Thread Interactions with Regular File
Operations" says that "fstat()" is atomic wrt "write()", and that you
should see "all or nothing".

I *GUARANTEE* that no operating system ever has done that, and I
further claim that reading it the way you read it is not only against
reality, it's against sanity.

Example: if I do a big write to a file that I just created, do you
really want "fstat()" in another thread or process to not even be able
to see how the file grows as the write happens?

It's not what anybody has *EVER* done, I'm pretty sure.

So I really think

 (a) you are mis-reading the standard by attributing too strong logic
to paperwork that is English prose and not so exact

 (b) documenting Linux as not doing what you are mis-reading it for is
only encouraging others to mis-read it too

The whole "arbitrary writes have to be all-or-nothing wrt all other
system calls" is simply not realistic, and has never been. Not just
not in Linux, but in *ANY* operating system that POSIX was meant to
describe.

And equally importantly: if some crazy person were to actually try to
implement such "true atomicity" things, the end result would be
objectively worse. Because you literally *want* to see a big write()
updating the file length as the write happens.

The fact that the standard then doesn't take those kinds of details
into account is simply because the standard isn't meant to be read as
a language lawyer, but as a "realistically .."

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-27 Thread Linus Torvalds
On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher
 wrote:
>
> I wonder if this could be documented in the read and write manual
> pages. Or would that be asking too much?

I don't think it would be asking too much, since it's basically just
describing what Linux has always done in all the major filesystems.

Eg look at filemap_read(), which is basically the canonical read
function, and note how it doesn't take a single lock at that level.

We *do* have synchronization at a page level, though, ie we've always
had that page-level "uptodate" bit, of course (ok, so "always" isn't
true - back in the distant past it was the 'struct buffer_head' that
was the synchronization point).

That said, even that is not synchronizing against "new writes", but
only against "new creations" (which may, of course, be writers, but is
equally likely to be just reading the contents from disk).

That said:

 (a) different filesystems can and will do different things.

Not all filesystems use filemap_read() at all, and even the ones that
do often have their own wrappers. Such wrappers *can* do extra
serialization, and have their own rules. But ext4 does not, for
example (see ext4_file_read_iter()).

And as mentioned, I *think* XFS honors that old POSIX rule for
historical reasons.

 (b) we do have *different* locking

for example, we these days do actually serialize properly on the
file->f_pos, which means that a certain *class* of read/write things
are atomic wrt each other, because we actually hold that f_pos lock
over the whole operation and so if you do file reads and writes using
the same file descriptor, they'll be disjoint.

That, btw, hasn't always been true. If you had multiple threads using
the same file pointer, I think we used to get basically random
results. So we have actually strengthened our locking in this area,
and made it much better.

But note how even if you have the same file descriptor open, and then
do pread/pwrite, those can and will happen concurrently.

And mmap accesses and modifications are obviously *always* concurrent,
even if the fault itself - but not the accesses - might end up being
serialized due to some filesystem locking implementation detail.

End result: the exact serialization is complex, depends on the
filesystem, and is just not really something that should be described
or even relied on (eg that f_pos serialization is something we do
properly now, but didn't necessarily do in the past, so ..)

Is it then worth pointing out one odd POSIX rule that basically nobody
but some very low-level filesystem people have ever heard about, and
that no version of Linux has ever conformed to in the main default
filesystems, and that no user has ever cared about?

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-27 Thread Linus Torvalds
On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher  wrote:
>
> Regular (buffered) reads and writes are expected to be atomic with
> respect to each other.

Linux has actually never honored that completely broken POSIX
requirement, although I think some filesystems (notably XFS) have
tried.

It's a completely broken concept. It's not possible to honor atomicity
with mmap(), and nobody has *ever* cared.

And it causes huge amounts of problems and basically makes any sane
locking entirely impossible.

The fact that you literally broke regular file writes in ways that are
incompatible with (much MUCH more important) POSIX file behavior to
try to get that broken read/write atomicity is only one example among
many for why that alleged rule just has to be ignored.

We do honor the PIPE_BUF atomicity on pipes, which is a completely
different kind of atomicity wrt read/write, and doesn't have the
fundamental issues that arbitrary regular file reads/writes have.

There is absolutely no sane way to do that file atomicity wrt
arbitrary read/write calls (*), and you shouldn't even try.

That rule needs to be forgotten about, and buried 6ft deep.

So please scrub any mention of that idiotic rule from documentation,
and from your brain.

And please don't break "partial write means disk full or IO error" due
to trying to follow this broken rule, which was apparently what you
did.

Because that "regular file read/write is done in full" is a *MUCH*
more important rule, and there is a shitton of applications that most
definitely depend on *that* rule.

Just go to debian code search, and look for

   "if (write("

and you'll get thousands of hits, and on the first page of hits 9 out
of 10 of the hits are literally about that "partial write is an
error", eg code like this:

if (write(fd,,sizeof(triple)) != sizeof(triple))
reporterr(1,NULL);

from libreoffice.

Linus

(*) Yeah, if you never care about performance(**) of mixed read/write,
and you don't care about mmap, and you have no other locking issues,
it's certainly possible. The old rule came about from original UNIX
literally taking an inode lock around the whole IO access, because
that was simple, and back in the days you'd never have multiple
concurrent readers/writers anyway.

(**) It's also instructive how O_DIRECT literally throws that rule
away, and then some direct-IO people said for years that direct-IO is
superior and used this as one of their arguments. Probably the same
people who thought that "oh, don't report partial success", because we
can't deal with it.



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-26 Thread Linus Torvalds
On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher  wrote:
>
> Btrfs has a comment in that place that reads:
>
> /* No increment (+=) because iomap returns a cumulative value. */

What a truly horrid interface. But you are triggering repressed
childhood memories:

> That's so that we can complete the tail of an asynchronous write
> asynchronously after a failed page fault and subsequent fault-in.

yeah, that makes me go "I remember something like that".

> That would be great, but applications don't seem to be able to cope
> with short direct writes, so we must turn partial failure into total
> failure here. There's at least one xfstest that checks for that as
> well.

What a complete crock. You're telling me that you did some writes, but
then you can't tell user space that writes happened because that would
confuse things.

Direct-IO is some truly hot disgusting garbage.

Happily it's only used for things like databases that nobody sane
would care about anyway.

Anyway, none of that makes any sense, since you do this:

ret = gfs2_file_direct_write(iocb, from, );
if (ret < 0 || !iov_iter_count(from))
goto out_unlock;

iocb->ki_flags |= IOCB_DSYNC;
buffered = gfs2_file_buffered_write(iocb, from, );

so you're saying that the direct write will never partially succeed,
but then in gfs2_file_write_iter() it very much looks like it's
falling back to buffered writes for that case.

Hmm. Very odd.

I assume this is all due to that

/* Silently fall back to buffered I/O when writing beyond EOF */
if (iocb->ki_pos + iov_iter_count(from) > i_size_read(>i_inode))

thing, so this all makes some perverse kind of sense, but it still
looks like this code just needs some serious serious commentary.

So you *can* have a partial write if you hit the end of the file, and
then you'll continue that partial write with the buffered code.

But an actual _failure_ will not do that, and instead return an error
even if the write was partially done.

But then *some* failures aren't failures at all, and without any
comments do this

if (ret == -ENOTBLK)
ret = 0;


And remind me again - this all is meant for applications that
supposedly care about consistency on disk?

And the xfs tests actually have a *test* for that case, to make sure
that nobody can sanely *really* know how much of the write succeeded
if it was a DIO write?

Gotcha.

> > The reason I think I'm full of sh*t is that you say that the problem
> > occurs in gfs2_file_buffered_write(), not in that
> > gfs2_file_direct_write() case.
>
> Right, we're having that issue with buffered writes.

I have to say, compared to all the crazy things I see in the DIO path,
the buffered write path actually looks almost entirely sane.

Of course, gfs2_file_read_iter() counts how many bytes it has read in
a variable called 'written', and gfs2_file_buffered_write() counts the
bytes it has written in a variable called 'read', so "entirely sane"
is all very very relative.

I'm sure there's some good reason (job security?) for all this insanity.

But I now have to go dig my eyes out with a rusty spoon.

But before I do that, I have one more question (I'm going to regret
this, aren't I?):

In gfs2_file_buffered_write(), when it has done that
fault_in_iov_iter_readable(), and it decides that that succeeded, it
does

if (gfs2_holder_queued(gh))
goto retry_under_glock;
if (read && !(iocb->ki_flags & IOCB_DIRECT))
goto out_uninit;
goto retry;

so if it still has that lock (if I understand correctly), it will always retry.

But if it *lost* the lock, it will retry only if was a direct write,
and return a partial success for a regular write() rather than
continue the write.

Whaa? I'm assuming that this is more of that "direct writes have to
succeed fully or not at all", but according to POSIX *regular* writes
should also succeed fully, unless some error happens.

Losing the lock doesn't sound like an error to me.

And there are a lot of applications that do assume "write to a regular
file didn't complete fully means that the disk is full" etc. Because
that's the traditional meaning.

This doesn't seem to be related to any data corruption issue, but it does smell.

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2022-04-26 Thread Linus Torvalds
On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher  wrote:
>
> Also, note that we're fighting with a rare case of data corruption that
> seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> page fault deadlocks for buffered I/O"; merged in v5.16).  The
> corruption seems to occur in gfs2_file_buffered_write() when
> fault_in_iov_iter_readable() is involved.  We do end up with data that's
> written at an offset, as if after a fault-in, the position in the iocb
> got out of sync with the position in the iov_iter.  The user buffer the
> iov_iter points at isn't page aligned, so the corruption also isn't page
> aligned.  The code all seems correct and we couldn't figure out what's
> going on so far.

So this may be stupid, but looking at gfs2_file_direct_write(), I see
what looks like two different obvious bugs.

This looks entirely wrong:

if (ret > 0)
read = ret;

because this code is *repeated*.

I think it should use "read += ret;" so that if we have a few
successful reads, they add up.

And then at the end:

if (ret < 0)
return ret;
return read;

looks wrong too, since maybe the *last* iteration failed, but an
earlier succeeded, so I think it should be

if (read)
return read;
return ret;

But hey, what do I know. I say "looks like obvious bugs", but I don't
really know the code, and I may just be completely full of sh*t.

The reason I think I'm full of sh*t is that you say that the problem
occurs in gfs2_file_buffered_write(), not in that
gfs2_file_direct_write() case.

And the *buffered* case seems to get both of those "obvious bugs" right.

So I must be wrong, but I have to say, that looks odd to me.

Now hopefully somebody who knows the code will explain to me why I'm
wrong, and in the process go "Duh, but.." and see what's up.

  Linus



Re: [Cluster-devel] [GIT PULL] gfs2 fixes

2022-02-11 Thread Linus Torvalds
On Fri, Feb 11, 2022 at 9:05 AM Andreas Gruenbacher  wrote:
>
> * Revert debug commit that causes unexpected data corruption.

Well, apparently not just unexpected, but unexplained too.

That's a bit worrisome. It sounds like the corruption cause is still
there, just hidden by the lack of __cond_resched()?

Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-29 Thread Linus Torvalds
On Fri, Oct 29, 2021 at 10:50 AM Catalin Marinas
 wrote:
>
> First of all, a uaccess in interrupt should not force such signal as it
> had nothing to do with the interrupted context. I guess we can do an
> in_task() check in the fault handler.

Yeah. It ends up being similar to the thread flag in that you still
end up having to protect against NMI and other users of asynchronous
page faults.

So the suggestion was more of a "mindset" difference and modified
version of the task flag rather than anything fundamentally different.

> Second, is there a chance that we enter the fault-in loop with a SIGSEGV
> already pending? Maybe it's not a problem, we just bail out of the loop
> early and deliver the signal, though unrelated to the actual uaccess in
> the loop.

If we ever run in user space with a pending per-thread SIGSEGV, that
would already be a fairly bad bug. The intent of "force_sig()" is not
only to make sure you can't block the signal, but also that it targets
the particular thread that caused the problem: unlike other random
"send signal to process", a SIGSEGV caused by a bad memory access is
really local to that _thread_, not the signal thread group.

So somebody else sending a SIGSEGV asynchronsly is actually very
different - it goes to the thread group (although you can specify
individual threads too - but once you do that you're already outside
of POSIX).

That said, the more I look at it, the more I think I was wrong. I
think the "we have a SIGSEGV pending" could act as the per-thread
flag, but the complexity of the signal handling is probably an
argument against it.

Not because a SIGSEGV could already be pending, but because so many
other situations could be pending.

In particular, the signal code won't send new signals to a thread if
that thread group is already exiting. So another thread may have
already started the exit and core dump sequence, and is in the process
of killing the shared signal threads, and if one of those threads is
now in the kernel and goes through the copy_from_user() dance, that
whole "thread group is exiting" will mean that the signal code won't
add a new SIGSEGV to the queue.

So the signal could conceptually be used as the flag to stop looping,
but it ends up being such a complicated flag that I think it's
probably not worth it after all. Even if it semantically would be
fairly nice to use pre-existing machinery.

Could it be worked around? Sure. That kernel loop probably has to
check for fatal_signal_pending() anyway, so it would all work even in
the presense of the above kinds of issues. But just the fact that I
went and looked at just how exciting the signal code is made me think
"ok, conceptually nice, but we take a lot of locks and we do a lot of
special things even in the 'simple' force_sig() case".

> Third is the sigcontext.pc presented to the signal handler. Normally for
> SIGSEGV it points to the address of a load/store instruction and a
> handler could disable MTE and restart from that point. With a syscall we
> don't want it to point to the syscall place as it shouldn't be restarted
> in case it copied something.

I think this is actually independent of the whole "how to return
errors". We'll still need to return an error from the system call,
even if we also have a signal pending.

  Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-28 Thread Linus Torvalds
On Thu, Oct 28, 2021 at 2:21 PM Catalin Marinas  wrote:
>
> They do look fairly similar but we should have the information in the
> fault handler to distinguish: not a page fault (pte permission or p*d
> translation), in_task(), user address, fixup handler. But I agree the
> logic looks fragile.

So thinking about this a bit more, I think I have a possible
suggestion for how to handle this..

The pointer color fault (or whatever some other architecture may do to
generate sub-page faults) is not only not recoverable in the sense
that we can't fix it up, it also ends up being a forced SIGSEGV (ie it
can't be blocked - it has to either be caught or cause the process to
be killed).

And the thing is, I think we could just make the rule be that kernel
code that has this kind of retry loop with fault_in_pages() would
force an EFAULT on a pending SIGSEGV.

IOW, the pending SIGSEGV could effectively be exactly that "thread flag".

And that means that fault_in_xyz() wouldn't need to worry about this
situation at all: the regular copy_from_user() (or whatever flavor it
is - to/from/iter/whatever) would take the fault. And if it's a
regular page fault,. it would act exactly like it does now, so no
changes.

If it's a sub-page fault, we'd just make the rule be that we send a
SIGSEGV even if the instruction in question has a user exception
fixup.

Then we just need to add the logic somewhere that does "if active
pending SIGSEGV, return -EFAULT".

Of course, that logic might be in fault_in_xyz(), but it migth also be
a separate function entirely.

So this does effectively end up being a thread flag, but it's also
slightly more than that - it's that a sub-page fault from kernel mode
has semantics that a regular page fault does not.

The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT
instead" has always been an odd and somewhat wrong-headed thing. Of
course it should cause a SIGSEGV, but that's not how Unix traditionall
worked. We would just say "color faults always raise a signal, even if
the color fault was triggered in a system call".

(And I didn't check: I say "SIGSEGV" above, but maybe the pointer
color faults are actually SIGBUS? Doesn't change the end result).

 Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-27 Thread Linus Torvalds
On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas
 wrote:
>
> As an alternative, you mentioned earlier that a per-thread fault status
> was not feasible on x86 due to races. Was this only for the hw poison
> case? I think the uaccess is slightly different.

It's not x86-specific, it's very generic.

If we set some flag in the per-thread status, we'll need to be careful
about not overwriting it if we then have a subsequent NMI that _also_
takes a (completely unrelated) page fault - before we then read the
per-thread flag.

Think 'perf' and fetching backtraces etc.

Note that the NMI page fault can easily also be a pointer coloring
fault on arm64, for exactly the same reason that whatever original
copy_from_user() code was. So this is not a "oh, pointer coloring
faults are different". They have the same re-entrancy issue.

And both the "pagefault_disable" and "fault happens in interrupt
context" cases are also the exact same 'faulthandler_disabled()'
thing. So even at fault time they look very similar.

So we'd have to have some way to separate out only the one we care about.

   Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-26 Thread Linus Torvalds
On Tue, Oct 26, 2021 at 11:50 AM Linus Torvalds
 wrote:
>
> Because for _most_ cases of "copy_to/from_user()" and friends by far,
> the only thing we look for is "zero for success".

Gaah. Looking around, I almost immediately found some really odd
exceptions to this.

Like parse_write_buffer_into_params() in amdgpu_dm_debugfs.c, which does

r = copy_from_user(wr_buf_ptr, buf, wr_buf_size);

/* r is bytes not be copied */
if (r >= wr_buf_size) {
DRM_DEBUG_DRIVER("user data not be read\n");
return -EINVAL;
}

and allows a partial copy to justy silently succeed, because all the
callers have pre-cleared the wr_buf_ptr buffer.

I have no idea why the code does that - it seems to imply that user
space could give an invalid 'size' parameter and mean to write only
the part that didn't succeed.

Adding AMD GPU driver people just to point out that this code not only
has odd whitespace, but that the pattern for "couldn't copy from user
space" should basically always be

if (copy_from_user(wr_buf_ptr, buf, wr_buf_size))
return -EFAULT;

because if user-space passes in a partially invalid buffer, you
generally really shouldn't say "ok, I got part of it and will use that
part"

There _are_ exceptions. We've had situations where user-space only
passes in the pointer to the buffer, but not the size. Bad interface,
but it historically happens for the 'mount()' system call 'data'
pointer. So then we'll copy "up to a page size".

Anyway, there are clearly some crazy users, and converting them all to
also check for negative error returns might be more painful than I
thought it would be.

 Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-26 Thread Linus Torvalds
On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas
 wrote:
>
> While more intrusive, I'd rather change copy_page_from_iter_atomic()
> etc. to take a pointer where to write back an error code.

I absolutely hate this model.

The thing is, going down that rat-hole, you'll find that you'll need
to add it to *all* the "copy_to/from_user()" cases, which isn't
acceptable. So then you start doing some duplicate versions with
different calling conventions, just because of things like this.

So no, I really don't want a "pass down a reference to an extra error
code" kind of horror.

That said, the fact that these sub-page faults are always
non-recoverable might be a hint to a solution to the problem: maybe we
could extend the existing return code with actual negative error
numbers.

Because for _most_ cases of "copy_to/from_user()" and friends by far,
the only thing we look for is "zero for success".

We could extend the "number of bytes _not_ copied" semantics to say
"negative means fatal", and because there are fairly few places that
actually look at non-zero values, we could have a coccinelle script
that actually marks those places.

End result: no change in calling conventions, no change to most users,
and the (relatively few) cases where we look at the "what about
partial results", we just add a

 .. existing code ..
 ret = copy_from_user(..);
+if (ret < 0)
+break;  // or whatever "fatal error" situation
 .. existing  code ..

kind of thing that just stops the re-try.

(The coccinelle script couldn't actually do that, but it could add
some comment marker or something so that it's easy to find and then
manually fix up the places it finds).

 Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-22 Thread Linus Torvalds
On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas  wrote:
>
> Probing only the first byte(s) in fault_in() would be ideal, no need to
> go through all filesystems and try to change the uaccess/probing order.

Let's try that. Or rather: probing just the first page - since there
are users like that btrfs ioctl, and the direct-io path.

  Linus



Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()

2021-10-21 Thread Linus Torvalds
On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher  wrote:
>
> But probing the entire memory range in fault domain granularity in the
> page fault-in functions still doesn't actually make sense. Those
> functions really only need to guarantee that we'll be able to make
> progress eventually. From that point of view, it should be enough to
> probe the first byte of the requested memory range

That's probably fine.

Although it should be more than one byte - "copy_from_user()" might do
word-at-a-time optimizations, so you could have an infinite loop of

 (a) copy_from_user() fails because the chunk it tried to get failed partly

 (b) fault_in() probing succeeds, because the beginning part is fine

so I agree that the fault-in code doesn't need to do the whole area,
but it needs to at least do some  thing, to
handle the situation where the copy_to/from_user requires more than a
single byte.

 Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-21 Thread Linus Torvalds
On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas
 wrote:
>
> However, with MTE doing both get_user() every 16 bytes and
> gup can get pretty expensive.

So I really think that anything that is performance-critical had
better only do the "fault_in_write()" code path in the cold error path
where you took a page fault.

IOW, the loop structure should be


 repeat:
take_locks();
pagefault_disable();
ret = do_things_with_locks();
pagefault_enable();
release_locks();

// Failure path?
if (unlikely(ret == -EFAULT)) {
if (fault_in_writable(..))
goto repeat;
return -EFAULT;
}

rather than have it be some kind of "first do fault_in_writable(),
then do the work" model.

So I wouldn't worry too much about the performance concerns. It simply
shouldn't be a common or hot path.

And yes, I've seen code that does that "fault_in_xyz()" before the
critical operation that cannot take page faults, and does it
unconditionally.

But then it isn't the "fault_in_xyz()" that should be blamed if it is
slow, but the caller that does things the wrong way around.

Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-20 Thread Linus Torvalds
On Wed, Oct 20, 2021 at 6:37 AM Catalin Marinas  wrote:
>
> The atomic "add zero" trick isn't that simple for MTE since the arm64
> atomic or exclusive instructions run with kernel privileges and
> therefore with the kernel tag checking mode.

Are there any instructions that are useful for "probe_user_write()"
kind of thing? We could always just add that as an arch function, with
a fallback to using the futex "add zero" if the architecture doesn't
need anything special.

Although at least for MTE, I think the solution was to do a regular
read, and that checks the tag, and then we could use the gup machinery
for the writability checks.

Linus



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-19 Thread Linus Torvalds
On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher  wrote:
>
> From my point of view, the following questions remain:
>
>  * I hope these patches will be merged for v5.16, but what process
>should I follow for that?  The patch queue contains mm and iomap
>changes, so a pull request from the gfs2 tree would be unusual.

Oh, I'd much rather get these as one pull request from the author and
from the person that actually ended up testing this.

It might be "unusual", but it's certainly not unheard of, and trying
to push different parts of the series through different maintainers
would just cause lots of extra churn.

Yes, normally I'd expect filesystem changes to have a diffstat that
clearly shows that "yes, it's all local to this filesystem", and when
I see anything else it raises red flags.

But it raises red flags not because it would be wrong to have changes
to other parts, but simply because when cross-subsystem development
happens, it needs to be discussed and cleared with people. And you've
done that.

So I'd take this as one pull request from you. You've been doing the
work, you get the questionable glory of being in charge of it all.
You'll get the blame too ;)

>  * Will Catalin Marinas's work for supporting arm64 sub-page faults
>be queued behind these patches?  We have an overlap in
>fault_in_[pages_]readable fault_in_[pages_]writeable, so one of
>the two patch queues will need some adjustments.

I think that on the whole they should be developed separately, I don't
think it's going to be a particularly difficult conflict.

That whole discussion does mean that I suspect that we'll have to
change fault_in_iov_iter_writeable() to do the "every 16 bytes" or
whatever thing, and make it use an actual atomic "add zero" or
whatever rather than walk the page tables. But that's a conceptually
separate discussion from this one, I wouldn't actually want to mix up
the two issues too much.

Sure, they touch the same code, so there is _that_ overlap, but one is
about "the hardware rules are a-changing" and the other is about
filesystem use of - and expansion of - the things we do. Let's keep
them separate until ready, and then fix up the fallout at that point
(either as a merge resolution, or even partly after-the-fact).

 Linus



Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()

2021-10-12 Thread Linus Torvalds
On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas
 wrote:
>
> Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
> called from the futex code which uses the GUP mechanism as the write
> would be destructive. It looks like it could potentially trigger the
> same infinite loop on -EFAULT.

Hmm.

I think the reason we do fault_in_user_writeable() using GUP is that

 (a) we can avoid the page fault overhead

 (b) we don't have any good "atomic_inc_user()" interface or similar
that could do a write with a zero increment or something like that.

We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's
all kinds of crazy, but we *could* do

   arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, , uaddr);

instead of doing the fault_in_user_writeable().

That might be a good idea anyway. I dunno.

But I agree other options exist:

> I wonder whether we should actually just disable tag checking around the
> problematic accesses. What these callers seem to have in common is using
> pagefault_disable/enable(). We could abuse this to disable tag checking
> or maybe in_atomic() when handling the exception to lazily disable such
> faults temporarily.

Hmm. That would work for MTE, but possibly be very inconvenient for
other situations.

> A more invasive change would be to return a different error for such
> faults like -EACCESS and treat them differently in the caller.

That's _really_ hard for things like "copy_to_user()", that isn't a
single operation, and is supposed to return the bytes left.

Adding another error return would be nasty.

We've had hacks like "squirrel away the actual error code in the task
structure", but that tends to be unmaintainable because we have
interrupts (and NMI's) doing their own possibly nested atomics, so
even disabling preemption won't actually fix some of the nesting
issues.

All of these things make me think that the proper fix ends up being to
make sure that our "fault_in_xyz()" functions simply should always
handle all faults.

Another option may be to teach the GUP code to actually check
architecture-specific sub-page ranges.

Linus



Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()

2021-10-11 Thread Linus Torvalds
On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas  wrote:
>
> +#ifdef CONFIG_ARM64_MTE
> +#define FAULT_GRANULE_SIZE (16)
> +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))

[...]

> If this looks in the right direction, I'll do some proper patches
> tomorrow.

Looks fine to me. It's going to be quite expensive and bad for caches, though.

That said, fault_in_writable() is _supposed_ to all be for the slow
path when things go south and the normal path didn't work out, so I
think it's fine.

I do wonder how the sub-page granularity works. Is it sufficient to
just read from it? Because then a _slightly_ better option might be to
do one write per page (to catch page table writability) and then one
read per "granule" (to catch pointer coloring or cache poisoning
issues)?

That said, since this is all preparatory to us wanting to write to it
eventually anyway, maybe marking it all dirty in the caches is only
good.

Linus



Re: [Cluster-devel] [RFC][arm64] possible infinite loop in btrfs search_ioctl()

2021-10-11 Thread Linus Torvalds
On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas
 wrote:
>
> I cleaned up this patch [1] but I realised it still doesn't solve it.
> The arm64 __copy_to_user_inatomic(), while ensuring progress if called
> in a loop, it does not guarantee precise copy to the fault position.

That should be ok., We've always allowed the user copy to return early
if it does word copies and hits a page crosser that causes a fault.

Any user then has the choice of:

 - partial copies are bad

 - partial copies are handled and then you retry from the place
copy_to_user() failed at

and in that second case, the next time around, you'll get the fault
immediately (or you'll make some more progress - maybe the user copy
loop did something different just because the length and/or alignment
was different).

If you get the fault immediately, that's -EFAULT.

And if you make some more progress, it's again up to the caller to
rinse and repeat.

End result: user copy functions do not have to report errors exactly.
It is the caller that has to handle the situation.

Most importantly: "exact error or not" doesn't actually _matter_ to
the caller. If the caller does the right thing for an exact error, it
will do the right thing for an inexact one too. See above.

> The copy_to_sk(), after returning an error, starts again from the previous
> sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
> stopped. So it can get stuck attempting to copy the same search header.

That seems to be purely a copy_to_sk() bug.

Or rather, it looks like a bug in the caller. copy_to_sk() itself does

if (copy_to_user_nofault(ubuf + *sk_offset, , sizeof(sh))) {
ret = 0;
goto out;
}

and the comment says

 *  0: all items from this leaf copied, continue with next

but that comment is then obviously not actually true in that it's not
"continue with next" at all.

But this is all very much a bug in the btrfs
search_ioctl()/copy_to_sk() code: it simply doesn't do the proper
thing for a partial result.

Because no, "just retry the whole thing" is by definition not the proper thing.

That said, I think that if we can have faults at non-page-aligned
boundaries, then we just need to make fault_in_pages_writeable() check
non-page boundaries.

> An ugly fix is to fall back to byte by byte copying so that we can
> attempt the actual fault address in fault_in_pages_writeable().

No, changing the user copy machinery is wrong. The caller really has
to do the right thing with partial results.

And I think we need to make fault_in_pages_writeable() match the
actual faulting cases - maybe remote the "pages" part of the name?

That would fix the btrfs code - it's not doing the right thing as-is,
but it's "close enough' to right that I think fixing
fault_in_pages_writeable() should fix it.

No?

 Linus



Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-09-09 Thread Linus Torvalds
On Thu, Sep 9, 2021 at 4:31 AM Christoph Hellwig  wrote:
>
> What about just passing done_before as an argument to
> iomap_dio_complete? gfs2 would have to switch to __iomap_dio_rw +
> iomap_dio_complete instead of iomap_dio_rw for that, and it obviously
> won't work for async completions, but you force sync in this case
> anyway, right?

I think you misunderstand.

Or maybe I do.

It very much doesn't force sync in this case. It did the *first* part
of it synchronously, but then it wants to continue with that async
part for the rest, and very much do that async completion.

And that's why it wants to add that "I already did X much of the
work", exactly so that the async completion can report the full end
result.

But maybe now it's me who is misunderstanding.

  Linus



Re: [Cluster-devel] [PATCH v7 17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults

2021-09-09 Thread Linus Torvalds
On Thu, Sep 9, 2021 at 4:36 AM Christoph Hellwig  wrote:
>
> On Fri, Aug 27, 2021 at 06:49:24PM +0200, Andreas Gruenbacher wrote:
> > Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
> > -EFAULT when it would otherwise trigger a page fault.  This is roughly
> > similar to FOLL_FAST_ONLY but available on all architectures, and less
> > fragile.
>
> So, FOLL_FAST_ONLY only has one single user through
> get_user_pages_fast_only (pin_user_pages_fast_only is entirely unused,
> which makes totally sense given that give up on fault and pin are not
> exactly useful semantics).

So I think we should treat FOLL_FAST_ONLY as a special "internal to
gup.c" flag, and perhaps not really compare it to the new
FOLL_NOFAULT.

In fact, maybe we could even just make FOLL_FAST_ONLY be the high bit,
and not expose it in  and make it entirely private as a
name in gup.c.

Because FOLL_FAST_ONLY really is meant more as a "this way we can
share code easily inside gup.c, by having the internal helpers that
*can* do everything, but not do it all when the user is one of the
limited interfaces".

Because we don't really expect people to use FOLL_FAST_ONLY externally
- they'll use the explicit interfaces we have instead (ie
"get_user_pages_fast()"). Those use-cases that want that fast-only
thing really are so special that they need to be very explicitly so.

FOLL_NOFAULT is different, in that that is something an external user
_would_ use.

Admittedly we'd only have one single case for now, but I think we may
end up with other filesystems - or other cases entirely - having that
same kind of "I am holding locks, so I can't fault into the MM, but
I'm otherwise ok with the immediate mmap_sem lock usage and sleeping".

End result: FOLL_FAST_ONLY and FOLL_NOFAULT have some similarities,
but at the same time I think they are fundamentally different.

The FAST_ONLY is the very very special "I can't sleep, I can't even
take the fundamental MM lock, and we export special interfaces because
it's _so_ special and can be used in interrupts etc".

In contrast, NOFAULT is not _that_ special. It's just another flag,
and has generic use.

   Linus



Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

2021-09-03 Thread Linus Torvalds
On Fri, Sep 3, 2021 at 11:28 AM Al Viro  wrote:
>
> FWIW, my objections regarding the calling conventions are still there.

So I'm happy to further change the calling conventions, but by now
_that_ part is most definitely a "not this merge window". The need for
that ternary state is still there.

It might go away in the future, but I think that's literally that: a
future cleanup. Not really related to the problem at hand.

  Linus



Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

2021-09-03 Thread Linus Torvalds
On Wed, Sep 1, 2021 at 12:53 PM Andreas Gruenbacher  wrote:
>
> So there's a minor merge conflict between Christoph's iomap_iter
> conversion and this patch queue now, and I should probably clarify the
> description of "iomap: Add done_before argument to iomap_dio_rw" that
> Darrick ran into. Then there are the user copy issues that Al has
> pointed out. Fixing those will create superficial conflicts with this
> patch queue, but probably nothing serious.
>
> So how should I proceed: do you expect a v8 of this patch queue on top
> of the current mainline?

So if you rebase for fixes, it's going to be a "next merge window" thing again.

Personally, I'm ok with the series as is, and the conflict isn't an
issue. So I'd take it as is, and then people can fix up niggling
issues later.

But if somebody screams loudly..

 Linus



Re: [Cluster-devel] [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

2021-08-28 Thread Linus Torvalds
On Fri, Aug 27, 2021 at 1:56 PM Kari Argillander
 wrote:
>
> At least this patch will break ntfs3 which is in next. It has been there
> just couple weeks so I understand. I added Konstantin and ntfs3 list so
> that we know what is going on. Can you please info if and when do we
> need rebase.

No need to rebase. It just makes it harder for me to pick one pull
over another, since it would mix the two things together.

I'll notice the semantic conflict as I do my merge build test, and
it's easy for me to fix as part of the merge - whichever one I merge
later.

It's good if both sides remind me about the issue, but these kinds of
conflicts are not a problem.

And yes, it does happen that I miss conflicts like this if I merge
while on the road and don't do my full build tests, or if it's some
architecture-specific thing or a problem that doesn't happen on my
usual allmodconfig testing.  But neither of those cases should be
present in this situation.

Linus



Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-08-27 Thread Linus Torvalds
On Fri, Aug 27, 2021 at 2:32 PM Darrick J. Wong  wrote:
>
> No, because you totally ignored the second question:
>
> If the directio operation succeeds even partially and the PARTIAL flag
> is set, won't that push the iov iter ahead by however many bytes
> completed?
>
> We already finished the IO for the first page, so the second attempt
> should pick up where it left off, i.e. the second page.

Darrick, I think you're missing the point.

It's the *return*value* that is the issue, not the iovec.

The iovec is updated as you say. But the return value from the async
part is - without Andreas' patch - only the async part of it.

With Andreas' patch, the async part will now return the full return
value, including the part that was done synchronously.

And the return value is returned from that async part, which somehow
thus needs to know what predated it.

Could that pre-existing part perhaps be saved somewhere else? Very
possibly. That 'struct iomap_dio' addition is kind of ugly. So maybe
what Andreas did could be done differently. But I think you guys are
arguing past each other.

   Linus



Re: [Cluster-devel] [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

2021-08-27 Thread Linus Torvalds
On Fri, Aug 27, 2021 at 12:23 PM Al Viro  wrote:
>
> Could you show the cases where "partial copy, so it's OK" behaviour would
> break anything?

Absolutely.

For example, i t would cause an infinite loop in
restore_fpregs_from_user() if the "buf" argument is a situation where
the first page is fine, but the next page is not.

Why? Because __restore_fpregs_from_user() would take a fault, but then
fault_in_pages_readable() (renamed) would succeed, so you'd just do
that "retry" forever and ever.

Probably there are a number of other places too. That was literally
the *first* place I looked at.

Seriously. The current semantics are "check the whole area".

THOSE MUST NOT CHANGE.

  Linus



Re: [Cluster-devel] [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

2021-08-27 Thread Linus Torvalds
On Fri, Aug 27, 2021 at 11:52 AM Al Viro  wrote:
>
> Again, the calling conventions are wrong.  Make it success/failure or
> 0/-EFAULT.  And it's inconsistent for iovec and non-iovec cases as it is.

Al, the 0/-EFAULT thing DOES NOT WORK.

The whole "success vs failure" model is broken.

Because "success" for some people is "everything worked".

And for other people it is "at least _part_ of it worked".

So no, 0/-EFAULT fundamentally cannot work, because the return needs
to be able to handle that ternary situation (ie "nothing" vs
"something" vs "everything").

This is *literally* the exact same thing that we have for
copy_to/from_user(). And Andreas' solution (based on my suggestion) is
the exact same one that we have had for that code since basically day
#1.

The whole "0/-EFAULT" is simpler, yes. And it's what
"{get|put}_user()" uses, yes. And it's more common to a lot of other
functions that return zero or an error.

But see above. People *need* that ternary result, and "bytes/pages
uncopied" is not only the traditional one we use elsewhere in similar
situations, it's the one that has the easiest error tests for existing
users (because zero remains "everything worked").

Andreas originally had that "how many bytes/pages succeeded" return
value instead, and yes, that's also ternary. But it means that now the
common "complete success" test ends up being a lot uglier, and the
semantics of the function changes completely where "0" no longer means
success, and that messes up much more.

So I really think you are barking entirely up the wrong tree.

If there is any inconsistency, maybe we should make _more_ cases use
that "how many bytes/pages not copied" logic, but in a lot of cases
you don't actually need the ternary decision value.

So the inconsistency is EXACTLY the same as the one we have always had
for get|put_user() vs copy_to|from_user(), and it exists for the EXACT
same reason.

IOW, please explain how you'd solve the ternary problem without making
the code a lot uglier.

  Linus



Re: [Cluster-devel] [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

2021-08-27 Thread Linus Torvalds
On Fri, Aug 27, 2021 at 11:53 AM Al Viro  wrote:
>
> I really disagree with these calling conventions.  "Number not faulted in"
> is bloody useless

It's what we already have for copy_to/from_user(), so it's actually
consistent with that.

And it avoids changing all the existing tests where people really
cared only about the "everything ok" case.

Andreas' first patch did that changed version, and was ugly as hell.

But if you have a version that avoids the ugliness...

 Linus



Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

2021-08-27 Thread Linus Torvalds
On Fri, Aug 27, 2021 at 9:49 AM Andreas Gruenbacher  wrote:
>
> here's another update on top of v5.14-rc7.  Changes:
>
>  * Some of the patch descriptions have been improved.
>
>  * Patch "gfs2: Eliminate ip->i_gh" has been moved further to the front.
>
> At this point, I'm not aware of anything that still needs fixing,

>From a quick scan, I didn't see anything that raised my hackles.

But I skipped all the gfs2-specific changes in the series, since
that's all above my paygrade.

 Linus



Re: [Cluster-devel] [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks

2021-08-19 Thread Linus Torvalds
On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher
 wrote:
>
> Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
> the pages array? That would match fault_in_iov_iter_writeable, which
> is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
> vmas.

I don't understand what you mean.. GUP already skips VM_IO (and
VM_PFNMAP) pages. It just returns EFAULT.

We could make it return another error. We already have DAX and
FOLL_LONGTERM returning -EOPNOTSUPP.

Of course, I think some code ends up always just returning "number of
pages looked up" and might return 0 for "no pages" rather than the
error for the first page.

So we may end up having interfaces that then lose that explanation
error code, but I didn't check.

But we couldn't make it just say "skip them and try later addresses",
if that is what you meant. THAT makes no sense - that would just make
GUP look up some other address than what was asked for.

> > I also do still think that even regardless of that, we want to just
> > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> > and then you can use the regular get_user_pages().
> >
> > That at least gives us the full _normal_ page handling stuff.
>
> And it does fix the generic/208 failure.

Good. So I think the approach is usable, even if we might have corner
cases left.

So I think the remaining issue is exactly things like VM_IO and
VM_PFNMAP. Do the fstests have test-cases for things like this? It
_is_ quite specialized, it might be a good idea to have that.

Of course, doing direct-IO from special memory regions with zerocopy
might be something special people actually want to do. But I think
we've had that VM_IO flag testing there basically forever, so I don't
think it has ever worked (for some definition of "ever").

Linus



Re: [Cluster-devel] [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks

2021-08-18 Thread Linus Torvalds
[ Sorry for the delay, I was on the road and this fell through the cracks ]

On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher
 wrote:
>
> On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
>  wrote:
> >
> > Hmm. Have you tried to figure out why that "still returns 0" happens?
>
> The call stack is:
>
> gup_pte_range
> gup_pmd_range
> gup_pud_range
> gup_p4d_range
> gup_pgd_range
> lockless_pages_from_mm
> internal_get_user_pages_fast
> get_user_pages_fast
> iov_iter_get_pages
> __bio_iov_iter_get_pages
> bio_iov_iter_get_pages
> iomap_dio_bio_actor
> iomap_dio_actor
> iomap_apply
> iomap_dio_rw
> gfs2_file_direct_write
>
> In gup_pte_range, pte_special(pte) is true and so we return 0.

Ok, so that is indeed something that the fast-case can't handle,
because some of the special code wants to have the mm_lock so that it
can look at the vma flags (eg "vm_normal_page()" and friends.

That said, some of these cases even the full GUP won't ever handle,
simply because a mapping doesn't necessarily even _have_ a 'struct
page' associated with it if it's a VM_IO mapping.

So it turns out that you can't just always do
fault_in_iov_iter_readable() and then assume that you can do
iov_iter_get_pages() and repeat until successful.

We could certainly make get_user_pages_fast() handle a few more cases,
but I get the feeling that we need to have separate error cases for
EFAULT - no page exists - and the "page exists, but cannot be mapped
as a 'struct page'" case.

I also do still think that even regardless of that, we want to just
add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
and then you can use the regular get_user_pages().

That at least gives us the full _normal_ page handling stuff.

   Linus



Re: [Cluster-devel] [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks

2021-08-03 Thread Linus Torvalds
On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher  wrote:
>
>
> With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
> endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
> in get_user_pages_fast when called with FOLL_FAST_ONLY:
>
>  (1) The test case performs an aio write into a 32 MB buffer.
>
>  (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
>  ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
>  set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
>  get_user_pages_fast() returns 0, which causes
>  bio_iov_iter_get_pages to return -EFAULT.
>
>  (3) Then gfs2_file_direct_write faults in the entire buffer with
>  fault_in_iov_iter_readable(), which succeeds.
>
>  (4) With the buffer in memory, we retry the iomap_dio_rw() ->
>  ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
>  This should succeed now, but get_user_pages_fast() still returns 0.

Hmm. Have you tried to figure out why that "still returns 0" happens?

One option - for debugging only - would be to introduce a new flag to
get_user_pages_fast() tyhat says "print out reason if failed" and make
the retry (but not the original one) have that flag set.

There are a couple of things of note when it comes to "get_user_pages_fast()":

 (a) some architectures don't even enable it

 (b) it can be very picky about the page table contents, and wants the
accessed bit to already be set (or the dirty bit, in the case of a
write).

but (a) shouldn't be an issue on any common platform and (b) shouldn't
be an issue with  fault_in_iov_iter_readable() that actually does a
__get_user() so it will access through the page tables.

(It might be more of an issue with fault_in_iov_iter_writable() due to
walking the page tables by hand - if we don't do the proper
access/dirty setting, I could see get_user_pages_fast() failing).

Anyway, for reason (a) I do think that eventually we should probably
introduce FOLL_NOFAULT, and allow the full "slow" page table walk -
just not calling down to handle_mm_fault() if it fails.

But (a) should be a non-issue in your test environment, and so it
would be interesting to hear what it is that fails. Because scanning
through the patches, they all _look_ fine to me (apart from the one
comment about return values, which is more about being consistent with
copy_to/from_user() and making the code simpler - not about
correctness)

   Linus



Re: [Cluster-devel] [PATCH v5 03/12] Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable}

2021-08-03 Thread Linus Torvalds
On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher  wrote:
>
> Turn fault_in_pages_{readable,writeable} into versions that return the number
> of bytes faulted in instead of returning a non-zero value when any of the
> requested pages couldn't be faulted in.

Ugh. This ends up making most users have horribly nasty conditionals.

I think I suggested (or maybe that was just my internal voice and I
never actually vocalized it) using the same semantics as
"copy_to/from_user()" for fault_in_pages_{readable|writable}().

Namely to return the number of bytes *not* faulted in.

That makes it trivial to test "did I get everything" - becasue zero
means "nothing failed" and remains the "complete success" marker.

And it still allows for the (much less common) case of "ok, I need to
know about partial failures".

So instead of this horror:

-   if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+   if (fault_in_writeable(buf_fx, fpu_user_xstate_size) ==
+   fpu_user_xstate_size)

you'd just have

-   if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+   if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))

because zero would continue to be a marker of success.

 Linus



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-27 Thread Linus Torvalds
On Tue, Jul 27, 2021 at 4:14 AM Andreas Gruenbacher  wrote:
>
> On Tue, Jul 27, 2021 at 11:30 AM David Laight  wrote:
> >
> > Is it actually worth doing any more than ensuring the first byte
> > of the buffer is paged in before entering the block that has
> > to disable page faults?
>
> We definitely do want to process as many pages as we can, especially
> if allocations are involved during a write.

Yeah, from an efficiency standpoint, once you start walking page
tables, it's probably best to just handle as much as you can.

But once you get an error, I don't think it should be "everything is bad".

This is a bit annoying, because while *most* users really just want
that "everything is good", *some* users might just want to handle the
partial success case.

It's why "copy_to/from_user()" returns the number of bytes *not*
written, rather than -EFAULT like get/put_user(). 99% of all users
just want to know "did I write all bytes" (and then checking for a
zero return is a simple and cheap verification of "everything was
ok").

But then very occasionally, you hit a case where you actually want to
know how much of a copy worked. It's rare, but it happens, and the
read/write system calls tend to be the main user of it.

And yes, the fact that "copy_to/from_user()" doesn't return an error
(like get/put_user() does) has confused people many times over the
years. It's annoying, but it's required by those (few) users that
really do want to handle that partial case.

I think this iov_iter_fault_in_readable/writeable() case should do the same.

And no, it's not new to Andreas' patch. iov_iter_fault_in_readable()
is doing the "everything has to be good" thing already.

Which maybe implies that nobody cares about partial reads/writes. Or
it's very very rare - I've seen code that handles page faults in user
space, but it's admittedly been some very special CPU
simulator/emulator checkpointing stuff.

   Linus



Re: [Cluster-devel] [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper

2021-07-26 Thread Linus Torvalds
On Mon, Jul 26, 2021 at 9:33 AM Jan Kara  wrote:
>
> On Fri 23-07-21 22:58:34, Andreas Gruenbacher wrote:
> > + gup_flags = FOLL_TOUCH | FOLL_POPULATE;
>
> I don't think FOLL_POPULATE makes sense here. It makes sense only with
> FOLL_MLOCK and determines whether mlock(2) should fault in missing pages or
> not.

Yeah, it won't hurt, but FOLL_POPULATE doesn't actually do anything
unless FOLL_MLOCK is set. It is, as you say, a magic flag just for
mlock.

The only ones that should matter are FOLL_WRITE (for obvious reasons)
and FOLL_TOUCH (to set the accessed and dirty bits, rather than just
th protection bits)

   Linus



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Linus Torvalds
On Sat, Jul 24, 2021 at 1:26 PM Al Viro  wrote:
>
> On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
>
> > So I think the code needs to return 0 if _any_ fault was successful.
>
> s/any/the first/...

Yes, but as long as we do them in order, and stop when it fails, "any"
and "first" end up being the same thing ;)

> The same goes for fault-in for read

Yeah. That said, a partial write() (ie "read from user space") might
be something that a filesystem is not willing to touch for other
reasons, so I think returning -EFAULT in that case is, I think,
slightly more reasonable.

Things like "I have to prepare buffers to be filled" etc.

The partial read() case is at least something that a filesystem should
be able to handle fairly easily.

And I don't think returning -EFAULT early is necessarily wrong - we
obviously do it anyway if you give really invalid addresses.

But we have generally strived to allow partial IO for missing pages,
because people sometimes play games with unmapping things dynamically
or using mprotect() to catch modifications (ie doing things like catch
SIGSEGV/SIGBUS, and remap it).

But those kinds of uses tend to have to catch -EFAULT anyway, so I
guess it's not a big deal either way.

   Linus



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Linus Torvalds
On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
 wrote:
>
> +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> +{
...
> +   if (fault_in_user_pages(start, len, true) != len)
> +   return -EFAULT;

Looking at this once more, I think this is likely wrong.

Why?

Because any user can/should only care about at least *part* of the
area being writable.

Imagine that you're doing a large read. If the *first* page is
writable, you should still return the partial read, not -EFAULT.

So I think the code needs to return 0 if _any_ fault was successful.
Or perhaps return how much it was able to fault in. Because returning
-EFAULT if any of it failed seems wrong, and doesn't allow for partial
success being reported.

The other reaction I have is that you now only do the
iov_iter_fault_in_writeable, but then you make fault_in_user_pages()
still have that "bool write" argument.

We already have 'fault_in_pages_readable()', and that one is more
efficient (well, at least if the fault isn't needed it is). So it
would make more sense to just implement fault_in_pages_writable()
instead of that "fault_in_user_pages(, bool write)".

 Linus



Re: [Cluster-devel] [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper

2021-07-23 Thread Linus Torvalds
On Fri, Jul 23, 2021 at 1:58 PM Andreas Gruenbacher  wrote:
>
> Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> Other than fault_in_pages_writeable(), this function is non-destructive.

Again, as I pointed out in the previous version, "Other than" is not
sensible language.

You mean "Unlike".

Same issue in the comment:

> + * Other than fault_in_pages_writeable(), this function is non-destructive 
> even
> + * when faulting in pages for writing.

It really should be

  "Unlike fault_in_pages_writeable(), this function .."

to parse correctly.

I understand what you mean, but only because I know what
fault_in_pages_writeable() does and what the issue was.

And in a year or two, I might have forgotten, and wonder what you meant.

 Linus



Re: [Cluster-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks

2021-07-19 Thread Linus Torvalds
On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher  wrote:
>
> here's an update to the gfs2 mmap + page faults fix that implements
> Linus's suggestion of disabling page faults while we're holding the
> inode glock.

Apart from some wording/naming issues, I think this looks a _lot_
better, and should fix the fundamental and underlying deadlock
properly.

So Ack from me with the trivial suggestions I sent to the individual patches.

Linus



Re: [Cluster-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag

2021-07-19 Thread Linus Torvalds
On Sun, Jul 18, 2021 at 3:40 PM Andreas Gruenbacher  wrote:
>
> Introduce a new ITER_FLAG_FAST_ONLY flag

I think the code is fine, but I think it might be best to call this
"ITER_FLAG_NOIO" or something like that.

The "FAST_ONLY" name makes sense in the context of
"get_user_pages_fast()" where we have that "fast" naming (and the long
history too). But I don't think it makes much sense as a name in the
context of iov_iter.

   Linus



Re: [Cluster-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper

2021-07-19 Thread Linus Torvalds
On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher  wrote:
>
> Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> Other than fault_in_pages_writeable(), this function is non-destructive.

You mean "Unlike" rather than "Other than" (also in the comment of the patch).

This is fairly inefficient, but as long as it's the exceptional case,
that's fine. It might be worth making that very explicit, so that
people don't try to use it normally.

Linus



Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks

2021-07-01 Thread Linus Torvalds
On Thu, Jul 1, 2021 at 5:20 PM Andreas Gruenbacher  wrote:
>
> On Thu, Jul 1, 2021 at 11:41 PM Linus Torvalds
>  wrote:
> > Also, I have to say that I think the direct-IO code is fundamentally
> > mis-designed. Why it is doing the page lookup _during_ the IO is a
> > complete mystery to me. Why wasn't that done ahead of time before the
> > filesystem took the locks it needed?
>
> That would be inconvenient for reads, when the number of bytes read is
> much smaller than the buffer size and we won't need to page in the
> entire buffer.

What?

A file read will READ THE WHOLE BUFFER.

We're not talking pipes or ttys here. If you ask for X bytes, you'll
get X bytes.

Of course, if you ask for more data than the file has, that's another
thing, but who really does that with direct-IO? And if they do, why
should we care about their silly behavior?

Face it, right now direct-IO is *BUGGY* because of this, and you can
deadlock filesystems with it.

So tell me again how it's "inconvenient" to fix this bug, and fix the
bad direct-IO design?

  Linus



Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks

2021-07-01 Thread Linus Torvalds
On Thu, Jul 1, 2021 at 5:30 PM Linus Torvalds
 wrote:
>
>
> Of course, if you ask for more data than the file has, that's another
> thing, but who really does that with direct-IO? And if they do, why
> should we care about their silly behavior?

Now, if the issue is that people do IO for bigger areas than you have
memory for, then I think that's a chunking issue. I don't think the
ITER_IOVEC necessarily needs to be converted to an ITER_BVEC in one
single go. That would indeed be painful if somebody tries to do some
huge direct-IO when they just don't have the memory for it.

But the fact is, direct-IO has been an incredible pain-point for
decades, because it's (a) unusual, (b) buggy and (c) has very little
overall design and common code.

The three issues are likely intimately tied together.

The iomap code at least has tried to make for much more common code,
but I really think some direct-IO people should seriously reconsider
how they are doing things when there are fundamental deadlocks in the
design.

And I do think that a ITER_IOVEC -> ITER_BVEC conversion function
might be a really really good idea to solve this problem. There's even
a very natural chunking algorithm: try to do as much as possible with
get_user_pages_fast() - so that if you already *have* the memory, you
can do the full IO (or at least a big part of it).

And if get_user_pages_fast() only gives you a small area - or nothing
at all - you chunk it down aggressively, and realize that "oh, doing
direct-IO when user space is paged out might not be the most optimal
case".

  Linus



Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks

2021-07-01 Thread Linus Torvalds
On Thu, Jul 1, 2021 at 2:41 PM Linus Torvalds
 wrote:
>
> So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a
> ITER_KVEC by doing the page lookup ahead of time

Actually, an ITER_BVEC, not ITER_KVEC. It wants a page array, not a
kernel pointer array.

But I hope people understood what I meant..

 Linus



Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks

2021-07-01 Thread Linus Torvalds
On Thu, Jul 1, 2021 at 1:43 PM Andreas Gruenbacher  wrote:
>
> here's another attempt at fixing the mmap + page fault deadlocks we're
> seeing on gfs2.  Still not ideal because get_user_pages_fast ignores the
> current->pagefault_disabled flag

Of course get_user_pages_fast() ignores the pagefault_disabled flag,
because it doesn't do any page faults.

If you don't want to fall back to the "maybe do IO" case, you should
use the FOLL_FAST_ONLY flag - or get_user_pages_fast_only(), which
does that itself.

> For getting get_user_pages_fast changed to fix this properly, I'd need
> help from the memory management folks.

I really don't think you need anything at all from the mm people,
because we already support that whole "fast only" case.

Also, I have to say that I think the direct-IO code is fundamentally
mis-designed. Why it is doing the page lookup _during_ the IO is a
complete mystery to me. Why wasn't that done ahead of time before the
filesystem took the locks it needed?

So what the direct-IO code _should_ do is to turn an ITER_IOVEC into a
ITER_KVEC by doing the page lookup ahead of time, and none of these
issues should even exist, and then the whole pagefault_disabled and/or
FOLL_FAST_ONLY would be a complete non-issue.

Is there any reason why that isn't what it does (other than historical baggage)?

   Linus



Re: [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()

2021-06-12 Thread Linus Torvalds
On Sat, Jun 12, 2021 at 2:47 PM Al Viro  wrote:
>
> O_DIRECT case is a PITA - there we use GUP and there's no way
> to tell GUP that in the current situation we do *NOT* want to hit
> ->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
> noticed there...

Well, we could change that.

And we do have get_user_pages_fast_only() these days.

  Linus



Re: [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()

2021-06-12 Thread Linus Torvalds
On Sat, Jun 12, 2021 at 2:12 PM Al Viro  wrote:
>
> Actually, is there any good way to make sure that write fault is triggered
> _without_ modification of the data?  On x86 lock xadd (of 0, that is) would
> probably do it and some of the other architectures could probably get away
> with using cmpxchg and its relatives, but how reliable it is wrt always
> triggering a write fault if the page is currently read-only?

I wouldn't worry about the CPU optimizing a zero 'add' away (extra
work for no gain in any normal situation).

But any architecture using 'ldl/stc' to do atomics would do it in
software for at least cmpxchg (ie just abort after the "doesn't
match").

Even on x86, it's certainly _possible_ that a non-matching cmpxchg
might not have done the writability check, although I would find that
unlikely (ie I would expect it to do just one TLB lookup, and just one
permission check, whether it then writes or not).

And if the architecture does atomic operations using that ldl/stc
model, I could (again) see the software loop breaking out early
(before the stc) on the grounds of "value didn't change".

Although it's a lot less likely outside of cmpxchg. I suspect an "add
zero" would work just fine even on a ldl/stc model.

That said, reads are obviously much easier, and I'd probably prefer
the model for writes to be to not necessarily pre-fault anything at
all, but just write to user space with page faults disabled.

And then only if that fails do you do anything special. And at that
point, even walking the page tables by hand might be perfectly
acceptable - since we know it's going to fault anyway, and it might
actually be cheaper to just do it by hand with GUP or whatever.

  Linus



Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

2021-06-01 Thread Linus Torvalds
On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher  wrote:
>
> Fix that by recognizing the self-recursion case.

Hmm. I get the feeling that the self-recursion case should never have
been allowed to happen in the first place.

IOW, is there some reason why you can't make the user accesses always
be doen with page faults disabled (ie using the "atomic" user space
access model), and then if you get a partial read (or write) to user
space, at that point you drop the locks in read/write, do the "try to
make readable/writable" and try again.

IOW, none of this "detect recursion" thing. Just "no recursion in the
first place".

That way you'd not have these odd rules at fault time at all, because
a fault while holding a lock would never get to the filesystem at all,
it would be aborted early. And you'd not have any odd "inner/outer"
locks, or lock compatibility rules or anything like that. You'd
literally have just "oh, I didn't get everything at RW time while I
held locks, so let's drop the locks, try to access user space, and
retry".

Wouldn't that be a lot simpler and more robust?

Because what if the mmap is something a bit more complex, like
overlayfs or usefaultfd, and completing the fault isn't about gfs2
handling it as a "fault", but about some *other* entity calling back
to gfs2 and doing a read/write instead? Now all your "inner/outer"
lock logic ends up being entirely pointless, as far as I can tell, and
you end up deadlocking on the lock you are holding over the user space
access _anyway_.

So I literally think that your approach is

 (a) too complicated

 (b) doesn't actually fix the issue in the more general case

But maybe I'm missing something.

  Linus

Linus



Re: [Cluster-devel] [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2)

2021-05-31 Thread Linus Torvalds
On Mon, May 31, 2021 at 7:02 AM Andreas Gruenbacher  wrote:
>
> @@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb 
> *iocb, struct iov_iter *to,
> [...]
> ret = iomap_dio_rw(iocb, to, _iomap_ops, NULL, 0);
> gfs2_glock_dq(gh);
> +   if (unlikely(current_needs_retry())) {
> +   set_current_needs_retry(false);
> +   if (ret == -EFAULT &&
> +   !iov_iter_fault_in_writeable(to, PAGE_SIZE))
> +   goto retry;
> +   }

Hmm. I haven't walked through this all, but is that "ret == -EFAULT"
test the right thing to do?

Can iomap_dio_rw() not instead just return a partial success if it hit
a missing page half-way?

Shouldn't you retry for that case too?

Linus



Re: [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write

2021-05-31 Thread Linus Torvalds
Sorry, I'm on a boat right now, with only a cellphone. Which is why this
html mess email, and quick reply.

Due to the html, this may get a bounce from the mailing list, and only make
it to the personal email recipients. Feel free to quote more just in case
others didn't get my original email through the lists.

I'll be out most of the day, but I'll try to take a deeper look this
evening.

I'm the meantime, a couple of questions and comments..

On Mon, May 31, 2021, 07:01 Andreas Gruenbacher  wrote:

>
> here's a set of fixes for how gfs2 handles page faults during read and
> write syscalls.


So how much of this is due to the confusion you just introduced where you
pointlessly and incorrectly take an exclusive luck for write faults?

See my reply to that pull request for why it's wrong and pointless.

  The patch queue is ready for merging except for two
> open issues.
>

There is no way this series is acceptable for 5.13. This kind of change is
very much a merge window thing. Much much too late to make fundamental
locking changes. Maybe it can then be backported to stable (including at
that point 5.13 of course) if it's been shown to be ok.

This deadlock is not new, we've very much had the same kind of thing when
writing to a file in the generic filemap_write() function, where we take
the page lock and then copy from user space. If that copy faults, and needs
the same page for the source due to an odd mmap issue (usually malicious),
you get a deadlock on the page lock it you aren't careful.

I'm surprised that gfs2 hasn't seen this, I thought we had fstests for it.
And I'd have expected that case to also trigger any internal gfs2 issues,
although it's possible that the generic code just does such a good job at
avoiding the issue that we'd need another test for your case.

  Linus

>


Re: [Cluster-devel] [GIT PULL] gfs2 fixes for v5.13-rc5

2021-05-31 Thread Linus Torvalds
[ Adding fsdevel, because this is not limited to gfs2 ]

On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher
 wrote:
>
> Andreas Gruenbacher (2):
>   gfs2: Fix mmap locking for write faults

This is bogus.

I've pulled it, but this is just wrong.

A write fault on a mmap IS NOT A WRITE to the filesystem.

It's a read.

Yes, it will later then allow writes to the page, but that's entirely
immaterial - because the write is going to happen not at fault time,
but long *after* the fault, and long *after* the filesystem has
installed the page.

The actual write will happen when the kernel returns from the user space.

And no, the explanation in that commit makes no sense either:

   "When a write fault occurs, we need to take the inode glock of the underlying
inode in exclusive mode.  Otherwise, there's no guarantee that the
dirty page
will be written back to disk"

the thing is, FAULT_FLAG_WRITE only says that the *currently* pending
access that triggered the fault was a write. That's entirely
irrelevant to a filesystem, because

 (a) it might be a private mapping, and a write to a page will cause a
COW by the VM layer, and it's not actually a filesystem write at all

AND

 (b) if it's a shared mapping, the first access that paged things in
is likely a READ, and the page will be marked writable (because it's a
shared mapping!) and subsequent writes will not cause any faults at
all.

In other words, a filesystem that checks for FAULT_FLAG_WRITE is
_doubly_ wrong. It's absolutely never the right thing to do. It
*cannot* be the right thing to do.

And yes, some other filesystems do this crazy thing too. If your
friends jumped off a bridge, would you jump too?

The xfs and ext3/ext4 cases are wrong too - but at least they spent
five seconds (but no more) thinking about it, and they added the check
for VM_SHARED. So they are only wrong for reason (b)

But wrong is wrong. The new code is not right in gfs2, and the old
code in xfs/ext4 is not right either.

Yeah, yeah, you can - and people do - do things like "always mark the
page readable on initial page fault, use mkwrite to catch when it
becomes writable, and do timestamps carefully, at at least have full
knowledge of "something might become dirty"

But even then it is ENTIRELY BOGUS to do things like write locking.

Really.

Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE
LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think
it protects anything at all, you're wrong.

So don't do write locking. At an absolute most, you can do things like

 - update file times (and honestly, that's quite questionable -
because again - THE WRITE HASN'T HAPPENED YET - so any tests that
depend on exact file access times to figure out when the last write
was done is not being helped by your extra code, because you're
setting the WRONG time.

 - set some "this inode will have dirty pages" flag just for some
possible future use. But honestly, if this is about consistency etc,
you need to do it not for a fault, but across the whole mmap/munmap.

So some things may be less bogus - but still very very questionable.

But locking? Bogus. Reads and writes aren't really any different from
a timestamp standpoint (if you think you need to mtime for write
accesses, you should do atime for reads, so from a filesystem
timestamp standpoint read and write faults are exactly the same - and
both are bogus, because by definition for a mmap both the reads and
the writes can then happen long long long afterwards, and repeatedly).

And if that "set some flag" thing needs a write lock, but a read
doesn't, you're doing something wrong and odd.

Look at the VM code. The VM code does this right. The mmap_sem is
taken for writing for mmap and for munmap. But a page fault is always
a read lock, even if the access that caused the page fault is a write.

The actual real honest-to-goodness *write* happens much later, and the
only time the filesystem really knows when it is done is at writeback
time. Not at fault time. So if you take locks, logically you should
take them when the fault happens, and release them when the writeback
is done.

Are you doing that? No. So don't do the write lock over the read
portion of the page fault. It is not a sensible operation.

 Linus



Re: [Cluster-devel] GFS2 changes for the 5.7 merge window

2020-03-31 Thread Linus Torvalds
On Tue, Mar 31, 2020 at 9:41 AM Bob Peterson  wrote:
>
> Could you please pull the following changes for gfs2?

So I've pulled it, but I note that you didn't get the automated
notification of that like almost everybody else does.

The reason seems to be pretty simple: the pr-tracker-bot looks for
emails to lkml (and other mailing lists) with variations of the
subject line "[GIT PULL]" in it.

And while lkml was cc'd for your pull request, you don't use that
subject line prefix, and so pr-tracker-bot ignores your email.

I don't personally care - you do have the markers that _I_ look for,
with both "git" and "pull" in the message body, so your pull request
doesn't get lost in the chaos that is my inbox during the merge
window. So this is purely about that notification bot.

IOW, if you care, then you should change your scripting (or manual
habits, I don't know) to have that "[GIT PULL]" in the subject line,
and then you will get that nice timely notification when I've pulled
(and pushed out).

And if you don't want that notification, you can just continue doing
what you're doing.

  Linus




Re: [Cluster-devel] [GIT PULL] iomap: small cleanups for 5.5

2019-12-03 Thread Linus Torvalds
On Tue, Dec 3, 2019 at 8:09 AM Darrick J. Wong  wrote:
> Please pull this series containing some more new iomap code for 5.5.
> There's not much this time -- just removing some local variables that
> don't need to exist in the iomap directio code.

Hmm. The tag message (which was also in the email thanks to git
request-pull) is very misleading.

Pulled, but please check these things.

   Linus




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-27 Thread Linus Torvalds
On Wed, Nov 27, 2019 at 7:42 AM Steven Whitehouse  wrote:
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward.

As mentioned, I don't _hate_ that patch (ok, I seem to have typoed it
and said that I don't "gate" it ;), so if that's what you guys really
want to do, I'm ok with it. But..

I do think you already get the data with the current case, from the
"short read" thing. So just changing the current generic read function
to check against the size first:

  --- a/mm/filemap.c
  +++ b/mm/filemap.c
  @@ -2021,9 +2021,9 @@ static ssize_t
generic_file_buffered_read(struct kiocb *iocb,
unsigned int prev_offset;
int error = 0;

  - if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
  + if (unlikely(*ppos >= inode->i_size))
return 0;
  - iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
  + iov_iter_truncate(iter, inode->i_size);

index = *ppos >> PAGE_SHIFT;
prev_index = ra->prev_pos >> PAGE_SHIFT;

and you're done. Nice and clean.

Then in gfs2 you just notice the short read, and check at that point.
Sure, you'll also cut read-ahead to the old size boundary, but does
anybody _seriously_ believe that read-ahead matters when you hit the
"some other node write more data, we're reading past the old end"
case? I don't think that's the case.

But I _can_ live with the patch that adds the extra "cached only" bit.
It just honestly feels pointless.

   Linus




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-25 Thread Linus Torvalds
On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse  wrote:
>
> Linus, is that roughly what you were thinking of?

So the concept looks ok, but I don't really like the new flags as they
seem to be gfs2-specific rather than generic.

That said, I don't _gate_ them either, since they aren't in any
critical code sequence, and it's not like they do anything really odd.

I still think the whole gfs2 approach is broken. You're magically ok
with using stale data from the cache just because it's cached, even if
another client might have truncated the file or something.

So you're ok with saying "the file used to be X bytes in size, so
we'll just give you this data because we trust that the X is correct".

But you're not ok to say "oh, the file used to be X bytes in size, but
we don't want to give you a short read because it might not be correct
any more".

See the disconnect? You trust the size in one situation, but not in another one.

I also don't really see that you *need* the new flag at all. Since
you're doing to do a speculative read and then a real read anyway, and
since the only thing that you seem to care about is the file size
(because the *data* you will trust if it is cached), then why don't
you just use the *existing* generic read, and *IFF* you get a
truncated return value, then you go and double-check that the file
hasn't changed in size?

See what I'm saying? I think gfs2 is being very inconsistent in when
it trusts the file size, and I don't see that you even need the new
behavior that patch gives, because you might as well just use the
existing code (just move the i_size check earlier, and then teach gfs2
to double-check the "I didn't get as much as I expected" case).

 Linus




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-10-30 Thread Linus Torvalds
On Wed, Oct 30, 2019 at 11:35 AM Steven Whitehouse  wrote:
>
> NFS may be ok here, but it will break GFS2. There may be others too...
> OCFS2 is likely one. Not sure about CIFS either. Does it really matter
> that we might occasionally allocate a page and then free it again?

Why are gfs2 and cifs doing things wrong?

"readpage()" is not for synchrionizing metadata. Never has been. You
shouldn't treat it that way, and you shouldn't then make excuses for
filesystems that treat it that way.

Look at mmap, for example. It will do the SIGBUS handling before
calling readpage(). Same goes for the copyfile code. A filesystem that
thinks "I will update size at readpage" is already fundamentally
buggy.

We do _recheck_ the inode size under the page lock, but that's to
handle the races with truncate etc.

Linus



Re: [Cluster-devel] [GIT PULL] gfs2 changes

2019-09-21 Thread Linus Torvalds
On Sat, Sep 21, 2019 at 9:51 AM Andreas Gruenbacher  wrote:
>
> please consider pulling the following changes for gfs2.

Merged.

But I notice that you're not getting any pr-tracker replies.

I'd suggest you cc lkml to get the automatic notification in case you care,

 Linus



Re: [GIT PULL] iomap: new code for 5.4

2019-09-19 Thread Linus Torvalds
On Thu, Sep 19, 2019 at 10:07 AM Christoph Hellwig  wrote:
>
> I personally surived with it, but really hated writing the open coded
> list_for_each_entry + list_del or while list_first_entry_or_null +
> ┐ist_del when I could have a nice primitive for it.  I finally decided
> to just add that helper..

You realize that the list helper is even mis-named?

Generally a "pop()" should pop the last entry, not the first one. Or
course, which one is last and which one is first is entirely up to you
since you can add at the tail or the beginning. So even the name is
ambiguous.

So I really think the whole thing was badly implemented (and yes, part
of that was the whole implementation thing).

Doing list_del() by hand also means that you can actually decide if
you want list_del_init() or not. So it's

But again - keep that helper if you want it. Just don't put a badly
implemented and badly named "helper" it in a global header file that
_everybody_ has to look at.

   Linus


Re: [GIT PULL] iomap: new code for 5.4

2019-09-19 Thread Linus Torvalds
On Thu, Sep 19, 2019 at 10:03 AM Linus Torvalds
 wrote:
>
> So inside of xfs, that "pop ioend from the list" model really may make
> perfect sense, and you could just do
>
> static inline struct xfs_ioend *xfs_pop_ioend(struct list_head *head)

Note that as usual, my code in emails is written in the MUA, entirely
untested, and may be completely broken.

So take it just as a "maybe something like this works for you".

 Linus


Re: [Cluster-devel] [GIT PULL] iomap: new code for 5.4

2019-09-19 Thread Linus Torvalds
On Wed, Sep 18, 2019 at 8:45 PM Darrick J. Wong  wrote:
>
> I propose the following (assuming Linus isn't cranky enough to refuse
> the entire iomap patchpile forever):

Since I don't think the code was _wrong_, it was just that I didn't
want to introduce a new pattern for something we already have, I'd be
ok with just a fix patch on top too.

And if the xfs people really want to use the "pop" model, that's fine
too - just make it specific to just the iomap_ioend type, which is all
you seem to really want to pop, and make the typing (and naming) be
about that particular thing.

As mentioned, I don't think that whole "while (pop)" model is _wrong_
per se.  But I also don't like proliferating new random list users in
general, just because some subsystem has some internal preference.
See?

And I notice that one of the users actually does keep the list around
and modifies it, ie this one:

while ((ioend = list_pop_entry(, struct xfs_ioend, io_list))) {
xfs_ioend_try_merge(ioend, );
xfs_end_ioend(ioend);
}

actually seems to _work_ on the remaining part of the list in that
xfs_ioend_try_merge() function.

So inside of xfs, that "pop ioend from the list" model really may make
perfect sense, and you could just do

static inline struct xfs_ioend *xfs_pop_ioend(struct list_head *head)
{
struct xfs_ioend *ioend = list_first_entry_or_null(head,
struct xfs_ioend *, io_list);
if (ioend)
list_del(>io_list);
return ioend;
}

and I don't think it's wrong.

It's just that we've survived three decades without that list_pop(),
and I don't want to make our header files even bigger and more complex
unless we actually have multiple real users.

 Linus



Re: [Cluster-devel] [GIT PULL] iomap: new code for 5.4

2019-09-18 Thread Linus Torvalds
On Wed, Sep 18, 2019 at 6:31 PM Linus Torvalds
 wrote:
>
> Why would anybody use that odd "list_pop()" thing in a loop, when what
> it really seems to just want is that bog-standard
> "list_for_each_entry_safe()"

Side note: I do agree that the list_for_each_entry_safe() thing isn't
exactly beautiful, particularly since you need that extra variable for
the temporary "next" pointer.

It's one of the C++ features I'd really like to use in the kernel -
the whole "declare new variable in a for (;;) statement" thing.

In fact, it made it into C - it's there in C99 -  but we still use
"-std=gnu89" because of other problems with the c99 updates.

Anyway, I *would* be interested in cleaning up
list_for_each_entry_safe() if somebody has the energy and figures out
what we could do to get the c99 behavior without the breakage from
other sources.

For some background: the reason we use "gnu89" is because we use the
GNU extension with type cast initializers quite a bit, ie things like

#define __RAW_SPIN_LOCK_UNLOCKED(lockname)  \
(raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname)

and that broke in c99 and gnu99, which considers those compound
literals and you can no longer use them as initializers.

See

https://lore.kernel.org/lkml/20141019231031.gb9...@node.dhcp.inet.fi/

for some of the historical discussion about this. It really _is_ sad,
because variable declarations inside for-loops are very useful, and
would have the potential to make some of our "for_each_xyz()" macros a
lot prettier (and easier to use too).

So our list_for_each_entry_safe() thing isn't perfect, but that's no
reason to try to then make up completely new things.

 Linus



Re: [GIT PULL] iomap: new code for 5.4

2019-09-18 Thread Linus Torvalds
On Tue, Sep 17, 2019 at 8:21 AM Darrick J. Wong  wrote:
>
> Please pull this series containing all the new iomap code for 5.4.

So looking at the non-iomap parts of it, I react to the new "list_pop() code.

In particular, this:

struct list_head *pos = READ_ONCE(list->next);

is crazy to begin with..

It seems to have come from "list_empty()", but the difference is that
it actually makes sense to check for emptiness of a list outside
whatever lock that protects the list. It can be one of those very
useful optimizations where you don't even bother taking the lock if
you can optimistically check that the list is empty.

But the same is _not_ true of an operation like "list_pop()". By
definition, the list you pop something off has to be stable, so the
READ_ONCE() makes no sense here.

Anyway, if that was the only issue, I wouldn't care. But looking
closer, the whole thing is just completely wrong.

All the users seem to do some version of this:

struct list_head tmp;

list_replace_init(>io_list, );
iomap_finish_ioend(ioend, error);
while ((ioend = list_pop_entry(, struct iomap_ioend, io_list)))
iomap_finish_ioend(ioend, error);

which is completely wrong and pointless.

Why would anybody use that odd "list_pop()" thing in a loop, when what
it really seems to just want is that bog-standard
"list_for_each_entry_safe()"

struct list_head tmp;
struct iomap_ioend *next;

list_replace_init(>io_list, );
iomap_finish_ioend(ioend, error);
list_for_each_entry_safe(struct iomap_ioend, next, , io_list)
iomap_finish_ioend(ioend, error);

which is not only the common pattern, it's more efficient and doesn't
pointlessly re-write the list for each entry, it just walks it (and
the "_safe()" part is because it looks up the next entry early, so
that the entry that it's walking can be deleted).

So I pulled it. But then after looking at it, I unpulled it again
because I don't want to see this kind of insanity in one of THE MOST
CORE header files we have in the whole kernel.

If xfs and iomap want to think they are "popping" a list, they can do
so. In the privacy of your own home, you can do stupid and pointless
things.

But no, we don't pollute core kernel code with those stupid and
pointless things.

  Linus


Re: [Cluster-devel] [PATCH] gfs2: Fix error path kobject memory leak

2019-05-13 Thread Linus Torvalds
On Mon, May 13, 2019 at 3:37 PM Andreas Gruenbacher  wrote:
>
> Sorry, I should have been more explicit. Would you mind taking this
> patch, please? If it's more convenient or more appropriate, I'll send
> a pull request instead.

Done.

However,I'd like to point out that when I see patches from people who
I normally get a pull request from, I usually ignore them.

Particularly when they are in some thread with discussion, I'll often
just assume that  th epatch is part of the thread, not really meant
for me in particular.

In this case I happened to notice that suddenly my participation
status changed, which is why I asked, but in general I might hav ejust
archived the thread with the assumption that I'll be getting the patch
later as a git pull.

Just so you'll be aware of this in the future, in case I don't react...

   Linus



Re: [PATCH] gfs2: Fix error path kobject memory leak

2019-05-13 Thread Linus Torvalds
Andreas,
 did you intend for me to take this as a patch from the email, or will
I get a pull request with this later?

 Linus


Re: [Cluster-devel] GFS2: Pull Request

2019-05-08 Thread Linus Torvalds
On Wed, May 8, 2019 at 1:58 PM Jonathan Corbet  wrote:
>
> I think this certainly belongs in the maintainer manual, but probably not
> in pull-requests.rst.  There are a lot of things about repository
> management that seem to trip up even experienced maintainers; pre-pull
> merges is just one of those.  I would love to see a proper guide on when
> and how to do merges in general.

We had another pull request issue today, about a situation that I
definitely know you've written about in the past, because I linked to
lwn in my email:

   
https://lore.kernel.org/lkml/CAHk-=wiKoePP_9CM0fn_Vv1bYom7iB5N=ulallz7yost3k+...@mail.gmail.com/

and while I suspect people don't actually read documentation
(_particularly_ maintainers that have already been around for a long
time but still do this), maybe that part could be in the same
documentation?

 Linus



Re: [Cluster-devel] GFS2: Pull Request

2019-05-08 Thread Linus Torvalds
On Wed, May 8, 2019 at 1:17 PM Andreas Gruenbacher  wrote:
>
> Would it make sense to describe how to deal with merge conflicts in
> Documentation/maintainer/pull-requests.rst to stop people from getting
> this wrong over and over again?

Probably. I do think it got written up at some point (lwn or something
like that), but it's possible it never got made into an actual
documentation file..

Anybody want to try to massage that email into the appropriate doc file?

  Linus



Re: [Cluster-devel] GFS2: Pull Request

2019-05-08 Thread Linus Torvalds
On Wed, May 8, 2019 at 10:55 AM Linus Torvalds
 wrote:
>
> See what I'm saying? You would ask me to pull the un-merged state, but
> then say "I noticed a few merge conflicts when I did my test merge,
> and this is what I did" kind of aside.

Side note: this is more important if you know of a merge issue that
doesn't cause a conflict, and that I won't see in my simple build
tests.

For example, in this case, the merge issue doesn't cause a conflict
(because it's a totally new user of bio_for_each_segment_all() and the
syntax changed in another branch), but I see it trivially when I do a
test build, since the compiler spews out lots of warnings, and so I
can trivially fix it up (and you _mentioning_ the issue gives me the
heads up that you knew about it and what it's all about).

But if it's other architectures, or only happens under special config
options etc, I might not have seen the merge issue at all. And then
it's really good if the maintainer talks about it and shows that yes,
the maintainer knows what he's doing.

Now I'm in the situation where I have actually done the merge the way
I *like* doing them, and without your superfluous merge commit. But if
I use my merge, I'll lose the signature from your tag, because you
signed *your* merge that I didn't actually want to use at all.

See? Your "helpful" merge actually caused me extra work, and made me
have to pick one of two *worse* situations than if you had just tagged
your own development tree. Either my tree has a extra pointless merge
commit, or my tree lacks your signature on your work.

I'll just undo my pull, and delay it all in the hope that you'll just
send me a new signed tag of the non-merged state.

   Linus



Re: [Cluster-devel] GFS2: Pull Request

2019-05-08 Thread Linus Torvalds
On Wed, May 8, 2019 at 4:49 AM Andreas Gruenbacher  wrote:
>
> There was a conflict with commit 2b070cfe582b ("block: remove the i
> argument to bio_for_each_segment_all") on Jens's block layer changes
> which you've already merged. I've resolved that by merging those block
> layer changes; please let me know if you want this done differently.

PLEASE.

I say this to somebody pretty much every single merge window: don't do
merges for me.

You are actually just hurting, not helping. I want to know what the
conflicts are, not by being told after-the-fact, but by just seeing
them and resolving them.

Yes, I like being _warned_ ahead of time - partly just as a heads up
to me, but partly also to show that the maintainers are aware of the
notifications from linux-next, and that linux-next is working as
intended, and people aren't just ignoring what it reports.

But I do *NOT* want to see maintainers cross-merging each others trees.

It can cause nasty problems, ranging from simply mis-merges to causing
me to not pull a tree at all because one side of the development
effort had done something wrong.

And yes, mis-merges happen - and they happen to me too. It's fairly
rare, but it can be subtle and painful when it does happen.

But (a) I do a _lot_ of merges, so I'm pretty good at them, and (b) if
_I_ do the merge, at least I know about the conflict and am not as
taken by surprise by possible problems due to a mis-merge.

And that kind of thing is really really important to me as an upstream
maintainer. I *need* to know when different subtrees step on each
others toes.

As a result, even when there's a conflict and a merge is perfectly
fine, I want to know about it and see it, and I want to have the
option to pull the maintainer trees in different orders (or not pull
one tree at all), which means that maintainers *MUST NOT* do
cross-tree merges. See?

And I don't want to see back-merges (ie merges from my upstream tree,
as opposed to merges between different maintainer trees) either, even
as a "let me help Linus, he's already merged the other tree, I'll do
the merge for him". That's not helping, that's just hiding the issue.

Now, very very occasionally I will hit a merge that is so hard that I
will go "Hmm, I really need the involved parties to sort this out".
Honestly, I can't remember the last time that happened, but it _has_
happened.

Much more commonly, I'll do the merge, but ask for verification,
saying "ok, this looked more subtle than I like, and I can't test any
of it, so can you check out my merge". Even that isn't all that
common, but it definitely happens.

There is _one_ very special kind of merge that I like maintainers
doing: the "test merge".

That test merge wouldn't be sent to me in the pull request as the
primary signed pull, but it's useful for the maintainer to do to do a
final *check* before doing the pull request, so that you as a
maintainer know what's going on, and perhaps to warn me about
conflicts.

If you do a test merge, and you think the test merge was complex, you
might then point to your resolution in the pull request as a "this is
how I would do it". But you should not make that merge be *the* pull
request.

One additional advantage of a test merge is that it actually gives a
"more correct" diffstat for the pull request. Particularly if the pull
request is for something with complex history (ie you as a maintainer
have sub-maintainers, and have pulled other peoples work), a
test-merge can get a much better diffstat. I don't _require_ that
better diffstat, though - I can see how you got the "odd" diffstat if
you don't do a test merge - but it's can be a "quality of pull
request" thing.

See what I'm saying? You would ask me to pull the un-merged state, but
then say "I noticed a few merge conflicts when I did my test merge,
and this is what I did" kind of aside.

 Linus



Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"

2019-01-31 Thread Linus Torvalds
On Thu, Jan 31, 2019 at 11:12 AM Andreas Gruenbacher
 wrote:
>
> That patch just hasn't seen enough testing to make me comfortable
> sending it yet.

Ok, I'll just apply the revert.

Thanks,

  Linus



Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"

2019-01-31 Thread Linus Torvalds
On Wed, Jan 30, 2019 at 12:30 PM Andreas Gruenbacher
 wrote:
>
> This reverts commit 2d29f6b96d8f80322ed2dd895bca590491c38d34.
>
> It turns out that the fix can lead to a ~20 percent performance regression
> in initial writes to the page cache according to iozone.  Let's revert this
> for now to have more time for a proper fix.

Ugh. I think part of the problem is that the

n += (rbm->bii - initial_bii);

doesn't make any sense when we just set rbm->bii to 0 a couple of
lines earlier. So it's basically a really odd way to write

n -= initial_bii;

which in turn really doesn't make any sense _either_.

So I'l all for reverting the commit because it caused a performance
regression, but the end result really is all kinds of odd.

Is the bug as simple as "we incremented the iterator counter twice"?
Because in the -E2BIG case, we first increment it by the difference in
bii, but then we *also* increment it in res_covered_end_of_rgrp()
(which we do *not* do for the "ret > 0" case, which goes to
next_iter).

So if somebody can run the performance test again, how does it look if
*instead* of the revert, we do what looks at least _slightly_ more
sensible, and move the "increment iteration count" up to the
next-bitmap case instead?

At that point, it will actually match the bii updates (because
next_bitmap also updates rbm->bii). Hmm?

Something like the whitespace-damaged thinig below. Completely
untested. But *if* this works, it would make more sense than the
revert..

Hmm?

   Linus

--- snip snip ---
  diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
  index 831d7cb5a49c..5b1006d5344f 100644
  --- a/fs/gfs2/rgrp.c
  +++ b/fs/gfs2/rgrp.c
  @@ -1796,10 +1796,10 @@ static int gfs2_rbm_find(struct gfs2_rbm
*rbm, u8 state, u32 *minext,
rbm->bii++;
if (rbm->bii == rbm->rgd->rd_length)
rbm->bii = 0;
  + n++;
   res_covered_end_of_rgrp:
if ((rbm->bii == 0) && nowrap)
break;
  - n++;
   next_iter:
if (n >= iters)
break;



Re: [Cluster-devel] [GIT PULL] gfs2: 4.20 fixes

2018-11-16 Thread Linus Torvalds
On Thu, Nov 15, 2018 at 2:44 PM Andreas Gruenbacher  wrote:
>
> Ok, I've changed the merge commit as follows now:
>
> Merge tag 'v4.20-rc1'
>
> Pull in the gfs2 fixes that went into v4.19-rc8:
>
>   gfs2: Fix iomap buffered write support for journaled files
>   gfs2: Fix iomap buffered write support for journaled files (2)
>
> Without these two commits, the following fix would cause conflicts.
>
> So merging v4.19-rc8 would have been sufficient. v4.20-rc1 is what I
> ended up testing, though.
>
> Are you okay with that now?

If the only reason for this was the trivial one-liner conflict, the
real fix is to just not do the merge at all, and not worry about
conflicts.

I handle simple conflicts like this in my sleep. It's actually much
better and clearer if the development trees just do their own
development, and not worry about "Oh, Linus will get a small conflict
due to other changes he has pulled".

So what I did was to actually try to generate the tree I think it
*should* have been, and just merge that with the simple conflict
resolution.

You might want to double-check it - not only because I rewrote your
pull to not have the merge at all, and in the process modified things,
but just to see what I think the right approach would have been.

Now, there *are* reasons to do back-merges, but no, "Oh, there's a
trivial conflict" is not one of them.

  Linus



Re: [Cluster-devel] [GIT PULL] gfs2: 4.20 fixes

2018-11-15 Thread Linus Torvalds
On Thu, Nov 15, 2018 at 12:20 PM Andreas Gruenbacher
 wrote:
>
> I guess rebasing the for-next branch onto something more recent to
> avoid the back-merge in the first place will be best, resulting in a
> cleaner history.

Rebases aren't really any better at all.

If you have a real *reason* for a merge, do the merge. But then the
reason should be clearly stated in the merge commit. Not just some
random undocumented merge message. Tell why some other branch was
relevant to your fix and needed to be pulled in.

Better yet, don't do either merges or rebases.

 Linus



Re: [Cluster-devel] [GIT PULL] gfs2: 4.20 fixes

2018-11-15 Thread Linus Torvalds
On Thu, Nov 15, 2018 at 6:00 AM Andreas Gruenbacher  wrote:
>
> could you please pull the following gfs2 fixes for 4.20?

No.

I'm not pulling this useless commit message:

  "Merge tag 'v4.20-rc1'"

with absolutely _zero_ explanation for why that merge was done.

Guys, stop doing this. Because I will stop pulling them.

If you can't be bothered to explain exactly why you're doing a merge,
I can't be bothered to pull the result.

Commit messages are important. They explain why something was done.
Merge commits are to some degree *more* important, because they do odd
things and the reason is not obvious from the code ("Oh, it's a
oneliner obvious fix").

So merge commits without  a reason for them are simply not acceptable.

Linus



Re: [Cluster-devel] GFS2: Pull request (merge window)

2018-10-24 Thread Linus Torvalds
On Tue, Oct 23, 2018 at 8:26 PM Bob Peterson  wrote:
>
> Please consider pulling the following changes for the GFS2 file system.

Pulled.

Just FYI - I end up editing your notes to look a bit more like most
other people send them. No need to change anything, I just thought I'd
mention it.

Linus



Re: [Cluster-devel] GFS2: Pull request (merge window)

2017-05-05 Thread Linus Torvalds
On Fri, May 5, 2017 at 1:28 PM, Bob Peterson  wrote:
>
> I asked around, but nobody could tell me what went wrong. Strangely,
> this command:
>
> git log --oneline --right-only origin/master...FETCH_HEAD --stat
>
> doesn't show this, but this one does:
>
> git diff --stat --right-only origin/master...FETCH_HEAD

So the fundamental difference between "git log" and "git diff" is that
one is a "ser operation" on the commits in question, and the other is
fundamentally a "operation between two endpoints".

And that's why "git log" will always show the "right" thing - because
even in the presense of complex history, there's no ambiguity about
which commits are part of the new set, and which are in the old set.
So "git log" just does a set difference, and shows the commits in one
set but not the other.

But "git diff", because it is fundamentally about two particular
points in history, can have a hard time once you have complex history:
what are the two points?

In particular, what "git diff origin/master...FETCH_HEAD" means is really:

 - find the common point (git calls it "merge base" because the common
point is also used for merging) between the two commits (origin/master
and FETCH_HEAD)

 - do the diff from that common point to the end result (FETCH_HEAD)

and for linear history that is all very obvious and unambiguous.

But once you have non-linear history, and particularly once you have
back-merges (ie you're not just merging work that is uniquely your own
from multiple of your *own* branches, but you're also doing merges of
upstream code), the notion of that "common case" is no longer
unambiguous. There is not necessarily any *one* common base, there can
be multiple points in history that are common between the two
branches, but are distinct points of history (ie one is not an
ancestor of another).

And since a diff is fundamentally about just two end-points ("what are
the differences between these two points in the history"), "git diff"
fundamentally cannot handle that case without help.

So "git diff" will pick the first of the merge bases it finds, and
just use that. Which even in the presense of more complex history will
often work by luck, but more often just means that you'll see
differences that aren't all from your tree, but some of them came from
the *other* common point(s).

For example, after doing the pull, I can then do:

git merge-base --all HEAD^ HEAD^2

to see the merge bases of the merge in HEAD. In this case, because of
your back-merge, there's two of them (with more complex history, there
can be more still):

  f9fe1c12d126 rhashtable: Add rhashtable_lookup_get_insert_fast
  69eea5a4ab9c Merge branch 'for-linus' of git://git.kernel.dk/linux-block

and because "git diff" will just pick the first one, you will
basically have done

git diff f9fe1c12d126..FETCH_HEAD

and if you then look at the *set* of changes (with "git log" of that
range), you'll see why that diff also ends up containing those block
changes (because they came on from that other merge base: commit
69eea5a4ab9c that had that linux-block merge).

Now, doing a *merge* in git will take _all_ of those merge bases into
account, and do something *much* more complicated than just a two-way
diff. It will internally first create a single merge base (by
recursively merging up all the other merge bases into a new internal
commit), and then using that single merge base it will then do a
normal three-way merge of the two branches.

"git diff' doesn't do that kind of complicated operation, and although
it *could* do that merge base recursive merging dance, the problem
would be what to do about conflicts (which "git merge" obviously can
also have, but with git merge you have that whole manual conflict
resolution case).

So once you have complex history that isn't just about merging your
own local changes from other local branches, you'll start hitting this
situation.

Visualizing the history with "gitk" for those cases is often a great
way to see why there's no single point that can be diffed against.

But once you *do* have that kind of complex history, you're also
expected to have the expertise to handle it:

> So I created a temporary local branch and used git merge to
> generate a correct diffstat.

That's the correct thing to do. Hopefully the above explains *why*
it's the correct thing to do.

(Although to be honest, I also am pretty used to parsing the wrong
diffs, and people sometimes just send me the garbage diffstat and say
"I don't know what happened", and I'll figure it out and can still
validate that the garbage diffstat they sent me is what I too get if I
do just a silly "git diff" without taking merge bases into account).

   Linus



Re: [Cluster-devel] [PATCH 04/21] fs: Replace CURRENT_TIME with current_fs_time() for inode timestamps

2016-06-09 Thread Linus Torvalds
On Thu, Jun 9, 2016 at 1:38 PM, Deepa Dinamani  wrote:
>
> 1. There are a few link, rename functions which assign times like this:
>
> -   inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
> +   inode->i_ctime = dir->i_ctime = dir->i_mtime =
> current_fs_time(dir->i_sb);

So I think you should just pass one any of the two inodes and just add
a comment.

Then, if we hit a filesystem that actually wants to have different
granularity for different inodes, we'll split it up, but even then
we'd be better off than with the superblock, since then we *could*
easily split this one case up into "get directory time" and "get inode
time".


> 2. Also, this means that we will make it an absolute policy that any 
> filesystem
> timestamp that is not directly connected to an inode would have to use
> ktime_get_* apis.

The thing is, those kinds of things are all going to be inside the
filesystem itself.

At that point, the *filesystem* already knows what the timekeeping
rules for that filesystem is.

I think we should strive to design the "current_fs_time()" not for
internal filesystem use, but for actual generic use where we *don't*
know a priori what the rules are, and we have to go to this helper
function to figure it out.

Inside a filesystem, why *shouldn't* the low-level filesystem already
use the normal "get time" functions?

See what I'm saying? The primary value-add to "current_fs_time()" is
for layers like the VFS and security layer that don't know what the
filesystem itself does.

At the low-level filesystem layer, you may just know that "ok, I only
have 32-bit timestamps anyway, so I should just use a 32-bit time
function".

> 3. Even if the filesystem inode has extra timestamps and these are not
> part of vfs inode, we still use
> vfs inode to get the timestamps from current_fs_time(): Eg: ext4 create time

But those already have an inode.

In fact, ext4 is a particularly bad example, since it uses the
ext4_current_time() function to get the time. And that one gets an
inode pointer.

So at least one filesystem that already does this, already uses a
inode-based model.

Everything I see just says "times are about inodes". Anything else
almost has to be filesystem-internal anyway, since the only thing that
is ever visible outside the filesystem (time-wise) is the inode.

And as mentioned, once it's internal to the low-level filesystem, it's
not obvious at all that you'd have to use "currenf_fs_time()" anyway.
The internal filesystem code might very well decide to use other
timekeeping functions.

 Linus



Re: [Cluster-devel] [PATCH 04/21] fs: Replace CURRENT_TIME with current_fs_time() for inode timestamps

2016-06-09 Thread Linus Torvalds
On Wed, Jun 8, 2016 at 10:04 PM, Deepa Dinamani  wrote:
> CURRENT_TIME macro is not appropriate for filesystems as it
> doesn't use the right granularity for filesystem timestamps.
> Use current_fs_time() instead.

Again - using the inode instead fo the syuperblock in tghis patch
would have made the patch much more obvious (it could have been 99%
generated with the sed-script I sent out a week or two ago), and it
would have made it unnecessary to add these kinds of things:

> diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
> index e9f5043..85c12f0 100644
> --- a/drivers/usb/core/devio.c
> +++ b/drivers/usb/core/devio.c
> @@ -2359,6 +2359,7 @@ static long usbdev_do_ioctl(struct file *file, unsigned 
> int cmd,
>  {
> struct usb_dev_state *ps = file->private_data;
> struct inode *inode = file_inode(file);
> +   struct super_block *sb = inode->i_sb;
> struct usb_device *dev = ps->dev;
> int ret = -ENOTTY;

where we add a new variable just because the calling convention was wrong.

It's not even 100% obvious that a filesystem has to have one single
time representation, so making the time function about the entity
whose time is set is also conceptually a much better model, never mind
that it is just what every single user seems to want anyway.

So I'd *much* rather see

+   inode->i_atime = inode->i_mtime = inode->i_ctime =
current_fs_time(inode);

over seeing either of these two variants::

+   inode->i_atime = inode->i_mtime = inode->i_ctime =
current_fs_time(inode->i_sb);
+   ret->i_atime = ret->i_mtime = ret->i_ctime = current_fs_time(sb);

because the first of those variants (grep for current_fs_time() in the
current git tree, and notice that it's the common one) we have the
pointless "let's chase a pointer in every caller"

And while it's true that the second variant is natural for *some*
situations, I've yet to find one where it wasn't equally sane to just
pass in the inode instead.

 Linus



Re: [Cluster-devel] GFS2: Pull request (merge window)

2015-04-14 Thread Linus Torvalds
On Tue, Apr 14, 2015 at 10:47 AM, Bob Peterson rpete...@redhat.com wrote:

 There's another that adds me as a GFS2 co-maintainer [...]

So generally, when I start getting pull requests from different
people, I'd like to see a previous separate heads-up or confirmation
from the previous person just so that I'm not taken by surprise.
Rather than this kind of as part of the pull request, make the thing
official.

Yes, yes, I can see that you use the same script that Steven did, and
I can see that you've been the main developer lately, but still..

Anyway. Pulled.

 Linus



Re: [Cluster-devel] GFS2: Pull request (merge window)

2015-04-14 Thread Linus Torvalds
On Tue, Apr 14, 2015 at 10:47 AM, Bob Peterson rpete...@redhat.com wrote:

  12 files changed, 184 insertions(+), 95 deletions(-)

Oh, and this was incorrect. You had apparently limited the statistics
to the fs/gfs2 directory, and thus missed the changes to the
MAINTAINERS file. Don't do that - include all the changes. It's what I
see and compare against when I actually do the pull anyway.

  Linus



Re: [Cluster-devel] GFS2: Pull request (fixes)

2013-08-19 Thread Linus Torvalds
On Mon, Aug 19, 2013 at 2:26 AM, Steven Whitehouse swhit...@redhat.com wrote:
 Oops, forgot to sign this when I sent it first, so here is a signed copy of 
 it.

Please sign the git tag instead of the email. There are no sane tools
to check email signatures etc, but git signed tags not only check your
pgp key signature, they also allow you to just write the message into
the tag.

Also, please don't do crap like this:

 are available in the git repository at:

   https://git.kernel.org/cgit/linux/kernel/git/steve/gfs2-3.0-fixes.git master

Yeah, no. The proper address is

  git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes

not that cgit web page.

 Linus



Re: [Cluster-devel] [patch for-3.8] fs, dlm: fix build error when EXPERIMENTAL is disabled

2013-02-12 Thread Linus Torvalds
On Tue, Feb 12, 2013 at 1:50 AM, Steven Whitehouse swhit...@redhat.com wrote:

 That doesn't seem right to me... DLM has not been experimental for a
 long time now. Why not just select CRC32 in addition to IP_SCTP ?

Hmm. IP_SCTP already does a select libcrc32c. So why doesn't that
end up working?

Linus



Re: [Cluster-devel] GFS2: Pull request (merge window)

2012-08-03 Thread Linus Torvalds
On Fri, Aug 3, 2012 at 3:18 AM, Steven Whitehouse swhit...@redhat.com wrote:

 I've been writing summaries when I send out the pre-pull patch posting
 so I could just copy  paste from that if it thats the kind of thing
 that you are after?

 https://lkml.org/lkml/2012/7/23/59

Yup, exactly something like that. And then depending on whether there
are any new and interesting things, please elaborate..

  Thanks,
 Linus



Re: [Cluster-devel] GFS2: Pull request (merge window)

2012-08-02 Thread Linus Torvalds
[ For some reason this didn't go out, and was in my drafts folder.
Better late than never, I guess ]

On Mon, Jul 23, 2012 at 7:59 AM, Steven Whitehouse swhit...@redhat.com wrote:

 Please consider pulling the following patches. There have been no changes 
 since
 they were posted for review,

It would be lovely to get a short overview of what the changes you ask
me to pull actually do, so that I can populate the merge message with
relevant information.

(Not just a message for you, this is true in general. Although I'm
very pleased with the merge window so far - I think most pull requests
have had at least *some* general overview in tags and/or emails)

Linus



Re: [Cluster-devel] GFS2: Use kmalloc when possible for -readdir()

2010-07-28 Thread Linus Torvalds
On Wed, Jul 28, 2010 at 4:15 AM, Steven Whitehouse swhit...@redhat.com wrote:

 We may be able to eliminate vmalloc entirely at some stage,
 but this is easy to do right away.

Quite frankly, I'd much rather see this abstracted out a bit. Why not just do a

  void *memalloc(unsigned int size)
  {
  if (size  KMALLOC_MAX_SIZE) {
 void *ptr = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
 if (ptr)
return ptr;
  }
  return vmalloc(size);
   }

   void memfree(void *ptr)
   {
  unsigned long addr = (unsigned long) ptr;

  if (is_vmalloc_addr(addr)) {
 vfree(ptr);
 return;
  }

  kfree(ptr);
   }

wouldn't that be much nicer? No need for that explicit flag, and you
don't mess up an already way-too-ugly function even more.

Also, I do notice that you used GFP_NOFS, but you didn't use that for
the vmalloc() thing. If there really are lock reentrancy reasons, you
_could_ use __vmalloc(size, GFP_NOFS, PAGE_KERNEL).  But since you've
been using vmalloc() for a long time, I suspect GFP_KERNEL works fine.
Yes/no?

  Linus



[Cluster-devel] Re: [PATCH] SLOW_WORK: Move slow_work's proc file to debugfs

2009-12-01 Thread Linus Torvalds


On Tue, 1 Dec 2009, Ingo Molnar wrote:
 
 Nice - thanks for doing this so quickly! It might sound like nitpicking 
 but /proc ABIs tend to be a lot harder to get rid of than debugfs 
 interfaces.

Ok, I applied it, so we'll not switch interfaces (even if they are just 
for debugging) across releases.

Btw David, for things like this, it's _really_ nice to use git rename 
detection. The diffstat (w/ summary) with rename detection looks like 
this:

 Documentation/slow-work.txt  |4 ++--
 include/linux/slow-work.h|8 
 init/Kconfig |8 
 kernel/Makefile  |2 +-
 kernel/{slow-work-proc.c = slow-work-debugfs.c} |4 ++--
 kernel/slow-work.c   |   18 --
 kernel/slow-work.h   |6 +++---
 7 files changed, 28 insertions(+), 22 deletions(-)
 rename kernel/{slow-work-proc.c = slow-work-debugfs.c} (97%)

which makes it obvious that the changes were really just about renaming. 
Compare to the non-rename-aware one:

 Documentation/slow-work.txt |4 +-
 include/linux/slow-work.h   |8 +-
 init/Kconfig|8 +-
 kernel/Makefile |2 +-
 kernel/slow-work-debugfs.c  |  227 +++
 kernel/slow-work-proc.c |  227 ---
 kernel/slow-work.c  |   18 +++-
 kernel/slow-work.h  |6 +-
 8 files changed, 253 insertions(+), 247 deletions(-)
 create mode 100644 kernel/slow-work-debugfs.c
 delete mode 100644 kernel/slow-work-proc.c

where you can kind of guess that slow-work-[proc|debugfs].c are largely 
the same, but you don't actually _see_ that it only has four lines of 
changes (and the patch then shows that the changes are just to comments).

Linus



  1   2   >