Re: [GIT PULL] Btrfs fixes for 4.17-rc6

2018-05-20 Thread Linus Torvalds
On Sun, May 20, 2018 at 8:21 AM David Sterba  wrote:

> They IMHO qualify for a late rc, though I did not expect that many.

Especially with the tree-log.c changes being fairly big, I took a look, and
I have to say that I appreciate (a) the warning in the pull request and (b)
the extensive log messages explaining the problems these patches fix.

I obviously still prefer to see only small and simple one-liners just
before I'm making ready to release rc6, but in the absence of oneliners I
do appreciate good explanations.

Thanks,

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please add 21035965f60b ("bitmap: fix memset optimization on big-endian systems") to the stable tree

2018-04-03 Thread Linus Torvalds
On Tue, Apr 3, 2018 at 11:14 AM, Omar Sandoval  wrote:
> Just wanted to make sure this doesn't get missed because I misspelled
> the stable email address in the patch. It applies to v4.13+.

The "sta...@kernel.org" address for a stable cc is actually better
than stable@vger" imho, because afaik it still matches Greg's
automation (that also picks up on "Fixes:" tags), and it does *not*
cause extra email flurries when people use git-send-email scripts.

iirc, we had some vendor leak a security bug early because they
actually just cc'd everybody - including the stable list - on the
not-yet-publicly-committed change.

Greg - if your automation has changed, and you actually really want
the "vger", let me know. Because I tend to just use
"sta...@kernel.org"

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-02 Thread Linus Torvalds
On Mon, Apr 2, 2018 at 4:37 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> We should probably have at least switched it to "unsigned long int"

I meant just "unsigned int", of course.

Right now we occasionally have a silly 64-bit field just for a couple of flags.

Of course, we do have cases where 64-bit architectures happily end up
using more than 32 bits too, so the "unsigned long" is occasionally
useful. But it's rare enough that it probably wasn't the right thing
to do.

Water under the bridge. Part of it is due to another historical
accident: the alpha port was one of the early ports, and it didn't do
atomic byte accesses at all.

So we had multiple issues that all conspired to "unsigned long arrays
are the natural for atomic bit operations" even though they have this
really annoying byte order issue.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitmap: fix memset optimization on big-endian systems

2018-04-02 Thread Linus Torvalds
On Mon, Apr 2, 2018 at 3:58 PM, Omar Sandoval  wrote:
>
> Commit 2a98dc028f91 introduced an optimization to bitmap_{set,clear}()
> which uses memset() when the start and length are constants aligned to a
> byte. This is wrong on big-endian systems;

Ugh, yes.

In retrospect, I do wish I had just made the bitmap types be
byte-based, but we had strong reasons for those "unsigned long" array
semantics.

The traditional problem - and the reason why it is byte-order
dependent - was that we often mix bitmap operations with "unsigned
long flags" style operations.

We should probably have at least switched it to "unsigned long int"
with the whole 64-bit transition, but never did even that, so the
bitmap format is not just byte order dependent, but dependent on the
size of "long".

I guess the "unsigned long flag" issue still exists in several places,
and we're stuck with it, probably forever. Think page flags, but also
various networking flags etc.

You'd *think* they use bitmap operations consistently, but they
definitely mix it with direct accesses to the flags field, eg the page
flags are usually done using the PG_xyz bit numbers, but occasionally
you have code that checks multiple independent bits at once, doing
things like

  #define PAGE_FLAGS_PRIVATE  \
  (1UL << PG_private | 1UL << PG_private_2)

  static inline int page_has_private(struct page *page)
  {
  return !!(page->flags & PAGE_FLAGS_PRIVATE);
  }

so the bits really are _exposed_ as being encoded as bits in an unsigned long.

Your patch is obviously correct, and we just need to make sure people
*really* understand that bitmaps are arrays of unsigned long, and byte
(and bit) order is a real thing.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-22 Thread Linus Torvalds
On Thu, Mar 22, 2018 at 8:01 AM, Kees Cook  wrote:
>
> Seems like it doesn't like void * arguments:

Yeah, that was discussed separately, I just didn't realize we had any
such users.

As David said, just adding a (long) cast to it should be fine, ie

  #define __is_constant(a) \
 (sizeof(int) == sizeof(*(1 ? ((void*)((long)(a) * 0l)) : (int*)1)))

and is probably a good idea even outside of pointers (because "long
long" constants could cause warnings too due to the cast to (void *)).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-20 Thread Linus Torvalds
On Tue, Mar 20, 2018 at 4:23 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Hmm. So thanks to the diseased mind of Martin Uecker, there's a better
> test for "__is_constant()":
>
>   /* Glory to Martin Uecker <martin.uec...@med.uni-goettingen.de> */
>   #define __is_constant(a) \
> (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))
>
> that is actually *specified* by the C standard to work, and doesn't
> even depend on any gcc extensions.

Well, it does depend on 'sizeof(*(void *)X)' being 1 and the compiler
not complaining about it, and that sizeof(int) is not 1.

But since we depend on those things in the kernel anyway, that's fine.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-20 Thread Linus Torvalds
On Sat, Mar 17, 2018 at 1:07 PM, Kees Cook  wrote:
>
> No luck! :( gcc 4.4 refuses to play along. And, hilariously, not only
> does it not change the complaint about __builtin_choose_expr(), it
> also thinks that's a VLA now.

Hmm. So thanks to the diseased mind of Martin Uecker, there's a better
test for "__is_constant()":

  /* Glory to Martin Uecker  */
  #define __is_constant(a) \
(sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))

that is actually *specified* by the C standard to work, and doesn't
even depend on any gcc extensions.

The reason is some really subtle pointer conversion rules, where the
type of the ternary operator will depend on whether one of the
pointers is NULL or not.

And the definition of NULL, in turn, very much depends on "integer
constant expression that has the value 0".

Are you willing to do one final try on a generic min/max? Same as my
last patch, but using the above __is_constant() test instead of
__builtin_constant_p?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-19 Thread Linus Torvalds
On Mon, Mar 19, 2018 at 2:43 AM, David Laight  wrote:
>
> Is it necessary to have the full checks for old versions of gcc?
>
> Even -Wvla could be predicated on very recent gcc - since we aren't
> worried about whether gcc decides to generate a vla, but whether
> the source requests one.

You are correct. We could just ignore the issue with old gcc versions,
and disable -Wvla rather than worry about it.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-18 Thread Linus Torvalds
On Sun, Mar 18, 2018 at 3:59 PM, Rasmus Villemoes
 wrote:
>
> OK, I missed where this was made about side effects of x and y

We never made it explicit, since all we really cared about in the end
is the constantness.

But yes:

> but I suppose the idea was to use
>
>   no_side_effects(x) && no_side_effects(y) ? trivial_max(x, y) :
> old_max(x, y)

Exactly. Except with __builtin_choose_expr(), because we need the end
result to be seen as a integer constant expression, so that we can
then use it as an array size. So it needs that early parse-time
resolution.

>> We only really use __builtin_constant_p() as a (bad) approximation of
>> that in this case, since it's the best we can do.
>
> I don't think you should parenthesize bad, rather capitalize it. ({x++;
> 0;}) is constant in the eyes of __builtin_constant_p, but not
> side-effect free.

Hmm. Yeah, but probably don't much care for the kernel.

For example, we do things like this:

   #define __swab64(x) \
(__builtin_constant_p((__u64)(x)) ? \
___constant_swab64(x) : \
__fswab64(x))

where that "___constant_swab64()" very much uses the same argument
over and over.

And we do that for related reasons - we really want to do the constant
folding at build time for some cases, and this was the only sane way
to do it. Eg code like

return (addr & htonl(0xff00)) == htonl(0x7f00);

wants to do the right thing, and long ago gcc didn't have builtins for
byte swapping, so we had to just do nasty horribly macros that DTRT
for constants.

But since the kernel is standalone, we don't need to really worry
about the *generic* case, we just need to worry about our own macros,
and if somebody does that example you show I guess we'll just have to
shun them ;)

Of course, our own macros are often macros from hell, exactly because
they often contain a lot of type-checking and/or type-(size)-based
polymorphism. But then we actually *want* gcc to simplify things, and
avoid side effects, just have potentially very complex expressions.

But we basically never have that kind of intentionally evil macros, so
we are willing to live with a bad substitute.

But yes, it would be better to have some more control over things, and
actually have a way to say "if X is a constant integer expression, do
transformation Y, otherwise call function y()".

Actually sparse started out with the idea in the background that it
could become not just a checker, but a "transformation tool". Partly
because of long gone historical issues (ie gcc people used to be very
anti-plugin due to licensing issues).

Of course, a more integrated and powerful preprocessor language is
almost always what we *really* wanted, but traditionally "powerful
preprocessor" has always meant "completely incomprehensible and badly
integrated preprocessor".

"cpp" may be a horrid language, but it's there and it's fast (when
integrated with the front-end, like everybody does now)

But sadly, cpp is really bad for the above kind of "if argument is
constant" kind of tricks. I suspect we'd use it a lot otherwise.

>> So the real use would be to say "can we use the simple direct macro
>> that just happens to also fold into a nice integer constant
>> expression" for __builtin_choose_expr().
>>
>> I tried to find something like that, but it really doesn't exist, even
>> though I would actually have expected it to be a somewhat common
>> concern for macro writers: write a macro that works in any arbitrary
>> environment.
>
> Yeah, I think the closest is a five year old suggestion
> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57612) to add a
> __builtin_assert_no_side_effects, that could be used in macros that for
> some reason cannot be implemented without evaluating some argument
> multiple times. It would be more useful to have the predicate version,
> which one could always turn into a build bug version. But we have
> neither, unfortunately.

Yeah, and since we're in the situation that *new* gcc versions work
for us anyway, and we only have issues with older gcc's (that sadly
people still use), even if there was a new cool feature we couldn't
use it.

Oh well. Thanks for the background.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-18 Thread Linus Torvalds
On Sun, Mar 18, 2018 at 2:13 PM, Rasmus Villemoes
<li...@rasmusvillemoes.dk> wrote:
> On 2018-03-17 19:52, Linus Torvalds wrote:
>>
>> Ok, so it really looks like that same "__builtin_constant_p() doesn't
>> return a constant".
>>
>> Which is really odd, but there you have it.
>
> Not really. We do rely on builtin_constant_p not being folded too
> quickly to a 0/1 answer, so that gcc still generates good code even if
> the argument is only known to be constant at a late(r) optimization
> stage (through inlining and all).

Hmm. That makes sense. It just doesn't work for our case when we
really want to choose the expression based on side effects or not.

> So unlike types_compatible_p, which
> can obviously be answered early during parsing, builtin_constant_p is
> most of the time a yes/no/maybe/it's complicated thing.

The silly thing is, the thing we *really* want to know _is_ actually
easily accessible during the early parsing, exactly like
__builtin_compatible_p(): it's not really that we care about the
expressions being constant, as much as the "can this have side
effects" question.

We only really use __builtin_constant_p() as a (bad) approximation of
that in this case, since it's the best we can do.

So the real use would be to say "can we use the simple direct macro
that just happens to also fold into a nice integer constant
expression" for __builtin_choose_expr().

I tried to find something like that, but it really doesn't exist, even
though I would actually have expected it to be a somewhat common
concern for macro writers: write a macro that works in any arbitrary
environment.

I guess array sizes are really the only true limiting environment
(static initializers is another one).

How annoying. Odd that newer gcc's seem to do so much better (ie gcc-5
seems to be fine). So _something_ must have changed.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-17 Thread Linus Torvalds
On Sat, Mar 17, 2018 at 12:27 AM, Kees Cook  wrote:
>
> Unfortunately my 4.4 test fails quickly:
>
> ./include/linux/jiffies.h: In function ‘jiffies_delta_to_clock_t’:
> ./include/linux/jiffies.h:444: error: first argument to
> ‘__builtin_choose_expr’ not a constant

Ok, so it really looks like that same "__builtin_constant_p() doesn't
return a constant".

Which is really odd, but there you have it.

I wonder if you can use that "sizeof()" to force evaluation of it,
because sizeof() really does end up being magical when it comes to
"integer constant expression".

So instead of this:

   #define __no_side_effects(a,b) \
  (__builtin_constant_p(a)&&__builtin_constant_p(b))

that just assumes that __builtin_constant_p() itself always counts as
a constant expression, what happens if you do

  #define __is_constant(a) \
(sizeof(char[__builtin_constant_p(a)]))

  #define __no_side_effects(a,b) \
(__is_constant(a) && __is_constant(b))

I realize that the above looks completely insane: the whole point is
to *not* have VLA's, and we know that __builtin_constant_p() isn't
always evaliated as a constant.

But hear me out: if the issue is that there's some evaluation ordering
between the two builtins, and the problem is that the
__builtin_choose_expr() part of the expression is expanded *before*
the __builtin_constant_p() has been expanded, then just hiding it
inside that bat-shit-crazy sizeof() will force that to be evaluated
first (because a sizeof() is defined to be a integer constant
expression.

So the above is completely insane, bit there is actually a chance that
using that completely crazy "x -> sizeof(char[x])" conversion actually
helps, because it really does have a (very odd) evaluation-time
change.  sizeof() has to be evaluated as part of the constant
expression evaluation, in ways that "__builtin_constant_p()" isn't
specified to be done.

But it is also definitely me grasping at straws. If that doesn't work
for 4.4, there's nothing else I can possibly see.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 10:44 AM, David Laight  wrote:
>
> I looked at the generated code for one of the constant sized VLA that
> the compiler barfed at.
> It seemed to subtract constants from %sp separately for the VLA.
> So it looks like the compiler treats them as VLA even though it
> knows the size.
> That is probably missing optimisation.

Looking at the code is definitely an option.

In fact, instead of depending on -Wvla, we could just make 'objtool'
warn about real variable-sized stack frames.

That said, if that "sizeof()" trick of Al's actually works with older
gcc versions too (it *should*, but it's not like
__builtin_choose_expr() and __builtin_constant_p() have well-defined
rules in the standard), that may just be the solution.

And if gcc ends up generating bad code for those "constant sized vlas"
anyway, then -Wvla would effectively warn about that code generation
problem.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 1:14 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> It does not work with gcc-4.1.x, but works with gcc-4.4.x.
>
> I can't seem to see the errors any way, I wonder if
> __builtin_choose_expr() simply didn't exist back then.

No, that goes further back.

It seems to be -Wvla itself that doesn't exist in 4.1, so the test
build failed simply because I used that flag ;)

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 1:12 PM, Al Viro  wrote:
>
> That's C99, straight from N1256.pdf (C99-TC3)...

I checked C90, since the error is

   ISO C90 forbids variable length array

and I didn't see anything there.

Admittedly I only found a draft copy.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 1:03 PM, Miguel Ojeda
 wrote:
>>
>> Kees - is there some online "gcc-4.4 checker" somewhere? This does
>> seem to work with my gcc. I actually tested some of those files you
>> pointed at now.
>
> I use this one:
>
>   https://godbolt.org/

Well, my *test* code works on that one and -Wvla -Werror.

It does not work with gcc-4.1.x, but works with gcc-4.4.x.

I can't seem to see the errors any way, I wonder if
__builtin_choose_expr() simply didn't exist back then.

Odd that you can't view warnings/errors with it.

But it's possible that it fails on more complex stuff in the kernel.

I've done a "allmodconfig" build with that patch, and the only issue
it found was that (real) type issue in tpm_tis_core.h.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 10:55 AM, Al Viro  wrote:
>
> That's not them, that's C standard regarding ICE.

Yes. The C standard talks about "integer constant expression". I know.
It's come up in this very thread before.

The C standard at no point talks about - or forbids - "variable length
arrays". That never comes up in the whole standard, I checked.

So we are right now hindered by a _syntactic_ check, without any way
to have a _semantic_ check.

That's my problem. The warnings are misleading and imply semantics.

And apparently there is no way to actually check semantics.

> 1,100 is *not* a constant expression as far as the standard is concerned,

I very much know.

But it sure isn't "variable" either as far as the standard is
concerned, because the standard doesn't even have that concept (it
uses "variable" for argument numbers and for variables).

So being pedantic doesn't really change anything.

> Would you argue that in
> void foo(char c)
> {
> int a[(c<<1) + 10 - c + 2 - c];

Yeah, I don't think that even counts as a constant value, even if it
can be optimized to one. I would not at all be unhppy to see
__builtin_constant_p() to return zero.

But that is very much different from the syntax issue.

So I would like to get some way to get both type-checking and constant
checking without the annoying syntax issue.

> expr, constant_expression is not a constant_expression.  And in
> this particular case the standard is not insane - the only reason
> for using that is typechecking and _that_ can be achieved without
> violating 6.6p6:
> sizeof(expr,0) * 0 + ICE
> *is* an integer constant expression, and it gives you exact same
> typechecking.  So if somebody wants to play odd games, they can
> do that just fine, without complicating the logics for compilers...

Now that actually looks like a good trick. Maybe we can use that
instead of the comma expression that causes problems.

And using sizeof() to make sure that __builtin_choose_expression()
really gets an integer constant expression and that there should be no
ambiguity looks good.

Hmm.

This works for me, and I'm being *very* careful (those casts to
pointer types are inside that sizeof, because it's not an integral
type, and non-integral casts are not valid in an ICE either) but
somebody needs to check gcc-4.4:

  #define __typecheck(a,b) \
(!!(sizeof((typeof(a)*)1==(typeof(b)*)1)))

  #define __no_side_effects(a,b) \
(__builtin_constant_p(a)&&__builtin_constant_p(b))

  #define __safe_cmp(a,b) \
(__typecheck(a,b) && __no_side_effects(a,b))

  #define __cmp(a,b,op) ((a)op(b)?(a):(b))

  #define __cmp_once(a,b,op) ({ \
typeof(a) __a = (a);\
typeof(b) __b = (b);\
__cmp(__a,__b,op); })

  #define __careful_cmp(a,b,op) \
__builtin_choose_expr(__safe_cmp(a,b), __cmp(a,b,op),
__cmp_once(a,b,op))

  #define min(a,b)  __careful_cmp(a,b,<)
  #define max(a,b)  __careful_cmp(a,b,>)
  #define min_t(t,a,b)  __careful_cmp((t)(a),(t)(b),<)
  #define max_t(t,a,b)  __careful_cmp((t)(a),(t)(b),>)

and yes, it does cause new warnings for that

comparison between ‘enum tis_defaults’ and ‘enum tpm2_const’

in drivers/char/tpm/tpm_tis_core.h due to

   #define TIS_TIMEOUT_A_MAX   max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)

but technically that warning is actually correct, I'm just confused
why gcc cares about the cast placement or something.

That warning is easy to fix by turning it into a "max_t(int, enum1,
enum2)' and that is technically the right thing to do, it's just not
warned about for some odd reason with the current code.

Kees - is there some online "gcc-4.4 checker" somewhere? This does
seem to work with my gcc. I actually tested some of those files you
pointed at now.

  Linus
 include/linux/kernel.h | 77 +-
 1 file changed, 20 insertions(+), 57 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..23c31bf1d7fb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -787,37 +787,29 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
  * strict type-checking.. See the
  * "unnecessary" pointer comparison.
  */
-#define __min(t1, t2, min1, min2, x, y) ({		\
-	t1 min1 = (x);	\
-	t2 min2 = (y);	\
-	(void) ( == );			\
-	min1 < min2 ? min1 : min2; })
+#define __typecheck(a,b) \
+	(!!(sizeof((typeof(a)*)1==(typeof(b)*)1)))
 
-/**
- * min - return minimum of two values of the same or compatible types
- * @x: first value
- * @y: second value
- */
-#define min(x, y)	\
-	__min(typeof(x), typeof(y),			\
-	  __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),	\
-	  x, y)
+#define __no_side_effects(a,b) \
+	(__builtin_constant_p(a)&&__builtin_constant_p(b))
 
-#define __max(t1, t2, max1, max2, x, y) ({		\
-	t1 max1 = (x);	\
-	t2 max2 = (y);	\
-	(void) ( == );			\
-	max1 

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 4:47 AM, Florian Weimer  wrote:
>
> If you want to catch stack frames which have unbounded size,
> -Werror=stack-usage=1000 or -Werror=vla-larger-than=1000 (with the constant
> adjusted as needed) might be the better approach.

No, we want to catch *variable* stack sizes.

Does "-Werror=vla-larger-than=0" perhaps work for that? No, because
the stupid compiler says that is "meaningless".

And no, using "-Werror=vla-larger-than=1" doesn't work either, because
the moronic compiler continues to think that "vla" is about the
_type_, not the code:

   t.c: In function ‘test’:
   t.c:6:6: error: argument to variable-length array is too large
[-Werror=vla-larger-than=]
 int array[(1,100)];

Gcc people are crazy.

Is there really no way to just say "shut up about the stupid _syntax_
issue that is entirely irrelevant, and give us the _code_ issue".

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 4:46 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Well, the explicit typing allows that mixing, in that you can just
> have "const_max_t(5,sizeof(x))"

I obviously meant "const_max_t(size_t,5,sizeof(x))". Heh.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 4:41 PM, Kees Cook  wrote:
>
> I much prefer explicit typing, but both you and Rasmus mentioned
> wanting the int/sizeof_t mixing.

Well, the explicit typing allows that mixing, in that you can just
have "const_max_t(5,sizeof(x))"

So I'm ok with that.

What I'm *not* so much ok with is "const_max(5,sizeof(x))" erroring
out, or silently causing insane behavior due to hidden subtle type
casts..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 3:46 PM, Kees Cook  wrote:
>
> So, AIUI, I can either get strict type checking, in which case, this
> is rejected (which I assume there is still a desire to have):
>
> int foo[const_max(6, sizeof(whatever))];

Ehh, yes, that looks fairly sane, and erroring out would be annoying.

But maybe we should just make the type explicit, and make it "const_max_t()"?

I think all the existing users are of type "max_t()" anyway due to the
very same issue, no?

At least if there's an explicit type like 'size_t', then passing in
"-1" becoming a large unsigned integer is understandable and clear,
not just some odd silent behavior.

Put another way: I think it's unacceptable that

 const_max(-1,6)

magically becomes a huge positive number like in that patch of yours, but

 const_max_t(size_t, -1, 6)

*obviously* is a huge positive number.

The two things would *do* the same thing, but in the second case the
type is explicit and visible.

> due to __builtin_types_compatible_p() rejecting it, or I can construct
> a "positive arguments only" test, in which the above is accepted, but
> this is rejected:

That sounds acceptable too, although the "const_max_t()" thing is
presumably going to be simpler?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook  wrote:
>
> size_t __error_not_const_arg(void) \
> __compiletime_error("const_max() used with non-compile-time constant arg");
> #define const_max(x, y) \
> __builtin_choose_expr(__builtin_constant_p(x) &&\
>   __builtin_constant_p(y),  \
>   (typeof(x))(x) > (typeof(y))(y) ? \
> (x) : (y),  \
>   __error_not_const_arg())
>
> Is typeof() forcing enums to int? Regardless, I'll put this through
> larger testing. How does that look?

Ok, that alleviates my worry about one class of insane behavior, but
it does raise a few other questions:

 - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.

 - this does have the usual "what happen if you do

 const_max(-1,sizeof(x))

where the comparison will now be done in 'size_t', and -1 ends up
being a very very big unsigned integer.

Is there no way to get that type checking inserted? Maybe now is a
good point for that __builtin_types_compatible(), and add it to the
constness checking (and change the name of that error case function)?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 12:47 PM, Kees Cook  wrote:
>
> To gain the ability to compare differing types, the arguments are
> explicitly cast to size_t.

Ugh, I really hate this.

It silently does insane things if you do

   const_max(-1,6)

and there is nothing in the name that implies that you can't use
negative constants.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-12 Thread Linus Torvalds
On Mon, Mar 12, 2018 at 3:55 PM, Andrew Morton
 wrote:
>
> Replacing the __builtin_choose_expr() with ?: works of course.

Hmm. That sounds like the right thing to do. We were so myopically
staring at the __builtin_choose_expr() problem that we overlooked the
obvious solution.

Using __builtin_constant_p() together with a ?: is in fact our common
pattern, so that should be fine. The only real reason to use
__builtin_choose_expr() is if you want to get the *type* to vary
depending on which side you choose, but that's not an issue for
min/max.

> What will be the runtime effects?

There should be none. Gcc will turn the conditional for the ?: into a
constant, and DTRT.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-11 Thread Linus Torvalds
On Sun, Mar 11, 2018 at 4:05 AM, Ingo Molnar  wrote:
>
> BTW., while I fully agree with everything you said, it's not entirely correct 
> to
> claim that if a C compiler can generate VLA code it is necessarily able to 
> parse
> and evaluate constant array sizes "just fine".
>
> Constant expressions are typically parsed very early on, at the preprocessing
> stage. They can be used with some preprocessor directives as well, such as 
> '#if'
> (with some further limitations on their syntax).

Yes. But constant simplification and CSE etc is just a very
fundamental part of a compiler, and anybody who actually implements
VLA's would have to do it anyway.

So no, a message like

  warning: Array declaration is not a C90 constant expression,
resulting in VLA code generation

would be moronic. Only some completely mindless broken shit would do
"oh, it's not a parse-time constant, so it will be variable". The two
just do not follow AT ALL.

So the message might be about _possibly_ resulting in VLA code
generation, but honestly, at that point you should just add the
warning when you actually generate the code to do the stack
allocation.

Because at that point, you know whether it's variable or not.

And trust me, it won't be variable for things like (2,3), or even for
our "max()" thing with other odd builtins. Not unless the compiler
doesn't really support VLA at all (maybe some bolted-on crazy thing
that just turns a VLA at the front-end time into just an alloca), or
the user has explicitly asked to disable some fundamental optimization
phase.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 9:34 AM, Miguel Ojeda
 wrote:
>
> So the warning is probably implemented to just trigger whenever VLAs
> are used but the given standard does not allow them, for all
> languages. The problem is why the ISO C90 frontend is not giving an
> error for using invalid syntax for array sizes to begin with?

So in *historical* context - when a compiler didn't do variable length
arrays at all - the original semantics of C "constant expressions"
actually make a ton of sense.

You can basically think of a constant expression as something that can
be (historically) handled by the front-end without any real
complexity, and no optimization phases - just evaluating a simple
parse tree with purely compile-time constants.

So there's a good and perfectly valid reason for why C limits certain
expressions to just a very particular subset. It's not just array
sizes, it's  case statements etc too. And those are very much part of
the C standard.

So an error message like

   warning: ISO C90 requires array sizes to be constant-expressions

would be technically correct and useful from a portability angle. It
tells you when you're doing something non-portable, and should be
automatically enabled with "-ansi -pedantic", for example.

So what's misleading is actually the name of the warning and the
message, not that it happens. The warning isn't about "variable
length", it's literally about the rules for what a
"constant-expression" is.

And in C, the expression (2,3) has a constant _value_ (namely 3), but
it isn't a constant-expression as specified by the language.

Now, the thing is that once you actually do variable length arrays,
those old front-end rules make no sense any more (outside of the "give
portability warnings" thing).

Because once you do variable length arrays, you obviously _parse_
everything just fine, and you're doing to evaluate much more complex
expressions than some limited constant-expression rule.

And at that point it would make a whole lot more sense to add a *new*
warning that basically says "I have to generate code for a
variable-sized array", if you actually talk about VLA's.

But that's clearly not what gcc actually did.

So the problem really is that -Wvla doesn't actually warn about VLA's,
but about something technically completely different.

And that's why those stupid syntactic issues with min/max matter. It's
not whether the end result is a compile-time constant or not, it's
about completely different issues, like whether there is a
comma-expression in it.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>
> Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or
> some other name for the simple macro. Bleh.

Oh, and I'm starting to see the real problem.

It's not that our current "min/max()" are broiken. It's that "-Wvla" is garbage.

Lookie here:

int array[(1,2)];

results in gcc saying

 warning: ISO C90 forbids variable length array ‘array’ [-Wvla]
   int array[(1,2)];
   ^~~

and that error message - and the name of the flag - is obviously pure garbage.

What is *actually* going on is that ISO C90 requires an array size to
be not a constant value, but a constant *expression*. Those are two
different things.

A constant expression has little to do with "compile-time constant".
It's a more restricted form of it, and has actual syntax requirements.
A comma expression is not a constant expression, for example, which
was why I tested this.

So "-Wvla" is garbage, with a misleading name, and a misleading
warning string. It has nothing to do with "variable length" and
whether the compiler can figure it out at build time, and everything
to do with a _syntax_ rule.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>
> And sparse freaks out too:
>
>drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect
> type in initializer (different address spaces) @@expected void
> *addr @@got struct mac_regs [noderef] http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 11:03 PM, Miguel Ojeda
 wrote:
>
> Just compiled 4.9.0 and it seems to work -- so that would be the
> minimum required.
>
> Sigh...
>
> Some enterprise distros are either already shipping gcc >= 5 or will
> probably be shipping it soon (e.g. RHEL 8), so how much does it hurt
> to ask for a newer gcc? Are there many users/companies out there using
> enterprise distributions' gcc to compile and run the very latest
> kernels?

I wouldn't mind upping the compiler requirements, and we have other
reasons to go to 4.6.

But _this_ particular issue doesn't seem worth it to then go even
further. Annoying.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 5:31 PM, Kees Cook  wrote:
>
> WTF, gmail just blasted HTML into my explicitly plain-text email?! 
> Apologies...

There's more now in your email, I think maybe it's triggered by your
signature file and some gmail web interface bug. Or it just tries to
force quote using html.

There's some html email disease inside google, where some parts of the
company seems to think that html email is _good_.

The official gmail Android app is a big pain, int hat it doesn't
*have* a plain-text mode, even though it knows how to format the
plain-text part of the email.

You might want to be on the lookout for some bad drugs at the office.
Because the world would thank you if you took them away from the gmail
app people.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton  wrote:
>
> I wonder which gcc versions actually accept Kees's addition.

Note that we already do have this pattern, as seen by:

  git grep -2  __builtin_choose_expr | grep -2 __builtin_constant_p

which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern
already exists current kernels and _works_ - it apparently just
doesn't work in slightly more complicated cases.

It's one reason why I wondered if simplifying the expression to have
just that single __builtin_constant_p() might not end up working..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton  wrote:
>
> A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> to know that __builtin_constant_p(x) is a constant.  Or something.

LOL.

I suspect it might be that it wants to evaluate
__builtin_choose_expr() at an earlier stage than it evaluates
__builtin_constant_p(), so it's not that it doesn't know that
__builtin_constant_p() is a constant, it just might not know it *yet*.

Maybe.

Side note, if it's not that, but just the "complex" expression that
has the logical 'and' etc, maybe the code could just use

  __builtin_constant_p((x)+(y))

or something.

But yeah:

> Sigh.  Wasn't there some talk about modernizing our toolchain
> requirements?

Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
and once we upgrade to 4.5 I think Arnd said that no distro actually
ships it, so we might as well go to 4.6.

So maybe this is just the excuse to finally make that official, if
there is no clever workaround any more.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 12:05 PM, Kees Cook  wrote:
> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:

Ok, looks good.

I just have a couple of questions about applying it.

In particular, if this will help people working on getting rid of
VLA's in the short term, I can apply it directly. But if people who
are looking at it (anybody else than Kees?) don't much care, then this
might be a 4.17 thing or at least "random -mm queue"?

The other unrelated reaction I had to this was that "we're passing
those types down very deep, even though nobody _cares_ about them all
that much at that deep level".

Honestly, the only case that really cares is the very top level, and
the rest could just take the properly cast versions.

For example, "max_t/min_t" really don't care at all, since they - by
definition - just take the single specified type.

So I'm wondering if we should just drop the types from __max/__min
(and everything they call) entirely, and instead do

#define __check_type(x,y) ((void)((typeof(x)*)1==(typeof(y)*)1))
#define min(x,y)   (__check_type(x,y),__min(x,y))
#define max(x,y)   (__check_type(x,y),__max(x,y))

#define min_t(t,x,y) __min((t)(x),(t)(y))
#define max_t(t,x,y) __max((t)(x),(t)(y))

and then __min/__max and friends are much simpler (and can just assume
that the type is already fine, and the casting has been done).

This is technically entirely independent of this VLA cleanup thing,
but the "passing the types around unnecessarily" just becomes more
obvious when there's now another level of macros, _and_ it's a more
complex expression too.

Yes, yes, the __single_eval_xyz() functions still end up wanting the
types for the declaration of the new single-evaluation variables, but
the 'typeof' pattern is the standard pattern, so

#define __single_eval_max(max1, max2, x, y) ({  \
typeof (x) max1 = (x);  \
typeof (y) max2 = (y);  \
max1 > max2 ? max1 : max2; })

actually looks more natural to me than passing the two types in as
arguments to the macro.

(That form also is amenable to things like "__auto_type" etc simplifications).

Side note: do we *really* need the unique variable names? That's what
makes those things _really_ illegible. I thgink it's done just for a
sparse warning that we should probably ignore. It's off by default
anyway exactly because it doesn't work that well due to nested macro
expansions like this.

There is very real value to keeping our odd macros legible, I feel.
Even when they are complicated by issues like this, it would be good
to keep them as simple as possible.

Comments?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 4:45 PM, Kees Cook  wrote:
>
> Rasmus mentioned this too. What I said there was that I was shy to
> make that change, since we already can't mix that kind of thing with
> the existing min()/max() implementation. The existing min()/max() is
> already extremely strict, so there are no instances of this in the
> tree.

Yes, but I also didn't want to add any new cases in case people add
new min/max() users.

But:

> If I explicitly add one, I see this with or without the patch:
>
> In file included from drivers/misc/lkdtm.h:7:0,
>  from drivers/misc/lkdtm_core.c:33:
> drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’:
> ./include/linux/kernel.h:809:16: warning: comparison of distinct
> pointer types lacks a cast

Oh, ok, in that case, just drop the __builtin_types_compatible_p()
entirely. It's not adding anything.

I was expecting the non-chosen expression in __builtin_choose_expr()
to not cause type warnings. I'm actually surprised it does. Type games
is why __builtin_choose_expr() tends to exist in the first place.

> So are you saying you _want_ the type enforcement weakened here, or
> that I should just not use __builtin_types_compatible_p()?

I don't want to weaken the type enforcement, and I _thought_ you had
done that __builtin_types_compatible_p() to keep it in place.

But if that's not why you did it, then why was it there at all? If the
type warning shows through even if it's in the other expression, then
just a


#define __min(t1, t2, x, y) \
__builtin_choose_expr(  \
__builtin_constant_p(x) &   \
__builtin_constant_p(y),\
(t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),  \
__single_eval_min(t1, t2,   \
   ...

would seem to be sufficient?

Because logically, the only thing that matters is that x and y don't
have any side effects and can be evaluated twice, and
"__builtin_constant_p()" is already a much stronger version of that.

Hmm? The __builtin_types_compatible_p() just doesn't seem to matter
for the only thing I thought it was there for.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 1:40 PM, Kees Cook  wrote:
> +#define __min(t1, t2, x, y)\
> +   __builtin_choose_expr(__builtin_constant_p(x) &&\
> + __builtin_constant_p(y) &&\
> + __builtin_types_compatible_p(t1, t2), \
> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\

I understand why you use __builtin_types_compatible_p(), but please don't.

It will mean that trivial constants like "5" and "sizeof(x)" won't
simplify, because they have different types.

The ?: will give the right combined type anyway, and if you want the
type comparison warning, just add a comma-expression with something
like like

   (t1 *)1 == (t2 *)1

to get the type compatibility warning.

Yeah, yeah, maybe none of the VLA cases triggered that, but it seems
silly to not just get that obvious constant case right.

Hmm?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-31 Thread Linus Torvalds
On Wed, Jan 31, 2018 at 4:29 AM, Jeff Layton  wrote:
>
> Do you mind just taking it directly? I don't have anything else queued
> up for this cycle.

Done.

I wonder if "false for same, true for different" calling convention
makes much sense, but it matches the old "0 for same" so obviously
makes for a smaller diff.

If it ever ends up confusing people, maybe the sense of that function
should be reversed, and the name changed to something like
"same_inode_version()" or something.

But at least for now the situation seems ok to me,

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iversion: make inode_cmp_iversion{+raw} return bool instead of s64

2018-01-30 Thread Linus Torvalds
Ack. Should I expect this in a future pull request, or take it directly?

There's no hurry about this, since none of the existing users of that
function actually do anything but test the return value against zero,
and nobody saves it into anything but a "bool" (which has magical
casting properties and does not lose upper bits).

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Linus Torvalds
On Tue, Jan 30, 2018 at 4:05 AM, Jeff Layton  wrote:
>
> My intent here was to have this handle wraparound using the same sort of
> method that the time_before/time_after macros do. Obviously, I didn't
> document that well.

Oh, the intent was clear. The implementation was wrong.

Note that "time_before()" returns a _boolean_.

So yes, the time comparison functions do a 64-bit subtraction, exactly
like yours do. BUT THEY DO NOT RETURN THAT DIFFERENCE. They test the
sign in 64 bits and return that boolean single-bit value.

> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

There's something fundamentally wrong. The _intent_ is to allow
numbers up to 2**63 apart, but if somebody does that

 int cmp = inode_cmp_iversion(inode, old);

 if (cmp < 0 ...

then it actually only ever tests numbers 2**31 apart, because the high
32 bits will have been thrown away when the 64-bit difference is cast
to "int".

And what used to be a sign bit (in 64 bits) no longer exists, and the
above tests the *new* sign bit that is bit #31, not #63.

So you are a factor of four billion off.

And while 2**63 is a big number and perfectly ok for a filesystem
event difference, a difference of 2**31 is a "this can actually
happen".

Yes, even 2**31 is still a big difference, and it will take a very
unusual situation, and quite a long time to trigger: something that
does a thousand filesystem ops per second will not see the problem for
24 days. So you'll never see it in a test. But 24 days happens in real
life..

That's why you need to do the comparison against zero inside the
actual helper functions like the time comparisons do. Because if you
return the 64-bit difference, it will be trivially lost, and the code
will _look_ right, but not work right.

The fact that you didn't even seem to see the problem in my example
calling sequence just proves that point.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-29 Thread Linus Torvalds
On Mon, Jan 29, 2018 at 1:50 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Hmm. I have pulled this, but it is really really broken in one place,
> to the degree that I always went "no, I won't pull this garbage".

always=almost.

I'd blame auto-correct, but I'm not on the phone.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-29 Thread Linus Torvalds
On Mon, Jan 29, 2018 at 4:26 AM, Jeff Layton  wrote:
>
> This pile of patches is a rework of the inode->i_version field. We have
> traditionally incremented that field on every inode data or metadata
> change. Typically this increment needs to be logged on disk even when
> nothing else has changed, which is rather expensive.

Hmm. I have pulled this, but it is really really broken in one place,
to the degree that I always went "no, I won't pull this garbage".

But the breakage is potential, not actual, and can be fixed trivially,
so I'll let it slide - but I do require it to be fixed. And I require
people to *think* about it.

So what's to horribly horribly wrong?

The inode_cmp_iversion{+raw}() functions are pure and utter crap.

Why?

You say that they return 0/negative/positive, but they do so in a
completely broken manner. They return that ternary value as the
sequence number difference in a 's64', which means that if you
actually care about that ternary value, and do the *sane* thing that
the kernel-doc of the function implies is the right thing, you would
do

int cmp = inode_cmp_iversion(inode, old);

if (cmp < 0 ...

and as a result you get code that looks sane, but that doesn't
actually *WORK* right.

To make it even worse, it will actually work in practice by accident
in 99.9% of all cases, so now you have

 (a) subtly buggy code
 (b) that looks fine
 (c) and that works in testing

which is just about the worst possible case for any code. The
interface is simply garbage that encourages bugs.

And the bug wouldn't be in the user, the bug would be in this code you
just sent me. The interface is simply wrong.

So this absolutely needs to be fixed. I see two fixes:

 - just return a boolean. That's all that any current user actually
wants, so the ternary value seems pointless.

 - make it return an 'int', and not just any int, but -1/0/1. That way
there is no worry about uses, and if somebody *really* cares about the
ternary value, they can now use a "switch" statement to get it
(alternatively, make it return an enum, but whatever).

That "ternary" function that has 18446744069414584320 incorrect return
values really is unacceptable.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [btrfs_mount] general protection fault: 0000 [#1] SMP

2017-11-29 Thread Linus Torvalds
On Wed, Nov 29, 2017 at 6:38 PM, Nick Terrell  wrote:
>
> The stack trace looks like the bug fixed by
>
> Qu Wenruo:
> btrfs: Fix wild memory access in compression level parser [1]
>
> That fix looks to be included in the pull request for 4.15-rc2 [2].

Which got merged earlier today. So maybe instead of a bisection, just
test the current git tree?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs fixes for 4.15-rc2

2017-11-29 Thread Linus Torvalds
On Wed, Nov 29, 2017 at 11:28 AM, David Sterba  wrote:
>
> With signed tag: for-4.15-rc2-tag
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.15-rc2

Oh, please actually ask me to pull the signed tag (exact same
pull-request, just point git request-pull at the tag), because now
what happened was that first I just pulled that branch you mentioned,
and only noticed that "With signed tag:" notice after I had already
pulled and was filling in the merge message.

Anyway, I redid the pull with the proper signed tag, but it was just
annoying extra work.

And I wonder how many times I _hadn't_ noticed that, because I didn't
have your key in my keyring either. Or maybe I caught it the first
time.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/5] squashfs: Add zstd support

2017-08-04 Thread Linus Torvalds
On Fri, Aug 4, 2017 at 1:19 PM, Nick Terrell  wrote:
>
> This patch was written by Sean Purcell , but I will be
> taking over the submission process.

Please, if so, get Sean's sign-off, and also make sure that the patch
gets submitted with

   From: Sean Purcell 

at the top of the body of the email so that authorship gets properly
attributed by all the usual tools.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs for 4.13

2017-07-05 Thread Linus Torvalds
On Tue, Jul 4, 2017 at 8:19 AM, David Sterba  wrote:
>
> There are conflicts with the recently merged block layer branch, the
> resolutions are a bit tedious but still straightforward. Stephen sent a mail
> about that [1]. You can find my merge at for-4.13-part1-merged, there might be
> some whitespace formatting differences but the result is the same.

Ouch, yeah, that was annoying. And yes, I ended up with several
whitespace differences but other than that it looks the same. Oh well.

Jens, Christoph, let's not do that stupid thing with status-vs-error
ever again, ok?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-12-06 Thread Linus Torvalds
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra  wrote:
>>
>> Of course, I'm really hoping that this shmem.c use is the _only_ such
>> case.  But I doubt it.
>
> $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l
> 28

Hmm. Most of them seem to be ok, because they use "wait_event()",
which will always remove itself from the wait-queue. And they do it
from the place that allocates the wait-queue.

IOW, the mm/shmem.c case really was fairly special, because it just
did "prepare_to_wait()", and then did a finish_wait() - and not in the
thread that allocated it on the stack.

So it's really that "some _other_ thread allocated the waitqueue on
the stack, and now we're doing a wait on it" that is bad.

So the normal pattern seems to be:

 - allocate wq on the stack
 - pass it on to a waker
 - wait for it

and that's ok, because as part of "wait for it" we will also be
cleaning things up.

The reason mm/shmem.c was buggy was that it did

 - allocate wq on the stack
 - pass it on to somebody else to wait for
 - wake it up

and *that* is buggy, because it's the waiter, not the waker, that
normally cleans things up.

Is there some good way to find this kind of pattern automatically, I wonder

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
Adding the scheduler people to the participants list, and re-attaching
the patch, because while this patch is internal to the VM code, the
issue itself is not.

There might well be other cases where somebody goes "wake_up_all()"
will wake everybody up, so I can put the wait queue head on the stack,
and then after I've woken people up I can return".

Ingo/PeterZ: the reason that does *not* work is that "wake_up_all()"
does make sure that everybody is woken up, but the usual autoremove
wake function only removes the wakeup entry if the process was woken
up by that particular wakeup. If something else had woken it up, the
entry remains on the list, and the waker in this case returned with
the wait head still containing entries.

Which is deadly when the wait queue head is on the stack.

So I'm wondering if we should make that "synchronous_wake_function()"
available generally, and maybe introduce a DEFINE_WAIT_SYNC() helper
that uses it.

Of course, I'm really hoping that this shmem.c use is the _only_ such
case.  But I doubt it.

Comments?

Note for Ingo and Peter: this patch has not been tested at all. But
Vegard did test an earlier patch of mine that just verified that yes,
the issue really was that wait queue entries remained on the wait
queue head just as we were about to return and free it.

   Linus


On Mon, Dec 5, 2016 at 12:10 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Anyway, can you try this patch instead? It should actually cause the
> wake_up_all() to always remove all entries, and thus the WARN_ON()
> should no longer happen (and I removed the "list_del()" hackery).
>
>Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..17beb44e9f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,19 @@ alloc_nohuge:page = 
shmem_alloc_and_acct_page(gfp, info, sbinfo,
return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int 
sync, void *key)
+{
+   int ret = default_wake_function(wait, mode, sync, key);
+   list_del_init(>task_list);
+   return ret;
+}
+
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
wait_queue_head_t *shmem_falloc_waitq;
-   DEFINE_WAIT(shmem_fault_wait);
+   DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list));
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum  wrote:
>
> [ cut here ]
> WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0

Ok, good. So that's confirmed as the cause of this problem.

And the call chain that I wanted is obviously completely
uninteresting, because it's call cghain on the other side (the page
fault side) that would show the nested wake queue behavior. I was just
being stupid about it.

I wonder if we have any other places where we just blithely assume
that "wake_up_all()" will actually empty the whole wait queue. It's
_usually_ true, but as noted, nested waiting does happen.

Anyway, can you try this patch instead? It should actually cause the
wake_up_all() to always remove all entries, and thus the WARN_ON()
should no longer happen (and I removed the "list_del()" hackery).

   Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..17beb44e9f4f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1848,6 +1848,19 @@ alloc_nohuge:page = 
shmem_alloc_and_acct_page(gfp, info, sbinfo,
return error;
 }
 
+/*
+ * This is like autoremove_wake_function, but it removes the wait queue
+ * entry unconditionally - even if something else had already woken the
+ * target.
+ */
+static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int 
sync, void *key)
+{
+   int ret = default_wake_function(wait, mode, sync, key);
+   list_del_init(>task_list);
+   return ret;
+}
+
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1883,7 +1896,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
vmf->pgoff >= shmem_falloc->start &&
vmf->pgoff < shmem_falloc->next) {
wait_queue_head_t *shmem_falloc_waitq;
-   DEFINE_WAIT(shmem_fault_wait);
+   DEFINE_WAIT_FUNC(shmem_fault_wait, 
synchronous_wake_function);
 
ret = VM_FAULT_NOPAGE;
if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -2665,6 +2678,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list));
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski  wrote:
>
> So your kernel has been smp-alternatived.  That 3e comes from
> alternatives_smp_unlock.  If you're running on SMP with UP
> alternatives, things will break.

I'm assuming he's just running in a VM with a single CPU.

The problem that I pointed out with assuming wake_up_all() actually
removes all wait queue entries does not depend on SMP. The race is
much more fundamental and long-lived.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum  wrote:
>
> The warning shows that it made it past the list_empty_careful() check
> in finish_wait() but then bugs out on the >task_list
> dereference.
>
> Anything stick out?

I hate that shmem waitqueue garbage. It's really subtle.

I think the problem is that "wake_up_all()" in shmem_fallocate()
doesn't necessarily wake up everything. It wakes up TASK_NORMAL -
which does include TASK_UNINTERRUPTIBLE, but doesn't actually mean
"everything on the list".

I think that what happens is that the waiters somehow move from
TASK_UNINTERRUPTIBLE to TASK_RUNNING early, and this means that
wake_up_all() will ignore them, leave them on the list, and now that
list on stack is no longer empty at the end.

And the way *THAT* can happen is that the task is on some *other*
waitqueue as well, and that other waiqueue wakes it up. That's not
impossible, you can certainly have people on wait-queues that still
take faults.

Or somebody just uses a directed wake_up_process() or something.

Since you apparently can recreate this fairly easily, how about trying
this stupid patch?

NOTE! This is entirely untested. I may have screwed this up entirely.
You get the idea, though - just remove the wait queue head from the
list - the list entries stay around, but nothing points to the stack
entry (that we're going to free) any more.

And add the warning to see if this actually ever triggers (and because
I'd like to see the callchain when it does, to see if it's another
waitqueue somewhere or what..)

  Linus
diff --git a/mm/shmem.c b/mm/shmem.c
index 166ebf5d2bce..a80148b43476 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2665,6 +2665,8 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
spin_lock(>i_lock);
inode->i_private = NULL;
wake_up_all(_falloc_waitq);
+   if (WARN_ON_ONCE(!list_empty(_falloc_waitq.task_list)))
+   list_del(_falloc_waitq.task_list);
spin_unlock(>i_lock);
error = 0;
goto out;


Re: bio linked list corruption.

2016-10-31 Thread Linus Torvalds
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones  wrote:
>
> BUG: Bad page state in process kworker/u8:12  pfn:4e0e39
> page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c
> flags: 0x400c(referenced|uptodate)
> page dumped because: non-NULL mapping

Hmm. So this seems to be btrfs-specific, right?

I searched for all your "non-NULL mapping" cases, and they all seem to
have basically the same call trace, with some work thread doing
writeback and going through btrfs_writepages().

Sounds like it's a race with either fallocate hole-punching or
truncate. I'm not seeing it, but I suspect it's btrfs, since DaveJ
clearly ran other filesystems too but I am not seeing this backtrace
for anything else.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-27 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig  wrote:
>> Dave, can you hit the warnings with this? Totally untested...
>
> Can we just kill off the unhelpful blk_map_ctx structure, e.g.:

Yeah, I found that hard to read too. The difference between
blk_map_ctx and blk_mq_alloc_data is minimal, might as well just use
the latter everywhere.

But please separate that patch from the patch that fixes the ctx list
corruption.

And Jens - can I have at least the corruption fix asap? Due to KS
travel, I'm going to do rc3 on Saturday, and I'd like to have the fix
in at least a day before that..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe  wrote:
>
> Actually, I think I see what might trigger it. You are on nvme, iirc,
> and that has a deep queue.

Yes. I have long since moved on from slow disks, so all my systems are
not just flash, but m.2 nvme ssd's.

So at least that could explain why Dave sees it at bootup but I don't.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
> blk_mq_merge_queue_io() into two

I did that myself too, since Dave sees this during boot.

But I'm not getting the warning ;(

Dave gets it with ext4, and thats' what I have too, so I'm not sure
what the required trigger would be.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones  wrote:
>
> I gave it a shot too for shits & giggles.
> This falls out during boot.
>
> [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 
> blk_sq_make_request+0x465/0x4a0

Hmm. That's the

WARN_ON_ONCE(rq->mq_ctx != ctx);

that I added to blk_mq_merge_queue_io(), and I really think that
warning is valid, and the fact that it triggers shows that something
is wrong with locking.

We just did a

spin_lock(>lock);

and that lock is *supposed* to protect the __blk_mq_insert_request(),
but that uses rq->mq_ctx.

So if rq->mq_ctx != ctx, then we're locking the wrong context.

Jens - please explain to me why I'm wrong.

Or maybe I actually might have found the problem? In which case please
send me a patch that fixes it ;)

Dave: it might be a good idea to split that "WARN_ON_ONCE()" in
blk_mq_merge_queue_io() into two, since right now it can trigger both
for the

blk_mq_bio_to_request(rq, bio);

path _and_ for the

if (!blk_mq_attempt_merge(q, ctx, bio)) {
blk_mq_bio_to_request(rq, bio);
goto insert_rq;

path. If you split it into two: one before that "insert_rq:" label,
and one before the "goto insert_rq" thing, then we could see if it is
just one of the blk_mq_merge_queue_io() cases (or both) that is
broken..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason  wrote:
>
> This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in 
> < 10 minutes.
> I've done 30 minutes each with XFS and Ext4 without luck.

Ok, see the email I wrote that crossed yours - if it's really some
list corruption on ctx->rq_list due to some locking problem, I really
would expect CONFIG_VMAP_STACK to be entirely irrelevant, except
perhaps from a timing standpoint.

> WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> list_add corruption. prev->next should be next (e8d80b08), but was 
> 88012b65fb88. (prev=880128c8d500).
> Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw 
> gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy 
> button pcspkr sch_fq_codel autofs4 virtio_blk
> CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
> 04/01/2014
>  880104eff868 814fde0f 8151c46e 880104eff8c8
>  880104eff8c8  880104eff8b8 810648cf
>  880128cab2c0 00213fc57c68 8801384e8928 880128cab180
> Call Trace:
>  [] dump_stack+0x53/0x74
>  [] ? __list_add+0xbe/0xd0
>  [] __warn+0xff/0x120
>  [] warn_slowpath_fmt+0x49/0x50
>  [] __list_add+0xbe/0xd0
>  [] blk_sq_make_request+0x388/0x580
>  [] generic_make_request+0x104/0x200

Well, it's very consistent, I have to say. So I really don't think
this is random corruption.

Could you try the attached patch? It adds a couple of sanity tests:

 - a number of tests to verify that 'rq->queuelist' isn't already on
some queue when it is added to a queue

 - one test to verify that rq->mq_ctx is the same ctx that we have locked.

I may be completely full of shit, and this patch may be pure garbage
or "obviously will never trigger", but humor me.

  Linus
 block/blk-mq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed64771..4f575de7fdd0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -521,6 +521,8 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head)
 */
BUG_ON(rq->cmd_flags & REQ_SOFTBARRIER);
 
+WARN_ON_ONCE(!list_empty(>queuelist));
+
spin_lock_irqsave(>requeue_lock, flags);
if (at_head) {
rq->cmd_flags |= REQ_SOFTBARRIER;
@@ -838,6 +840,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
queued++;
break;
case BLK_MQ_RQ_QUEUE_BUSY:
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add(>queuelist, _list);
__blk_mq_requeue_request(rq);
break;
@@ -1034,6 +1037,8 @@ static inline void __blk_mq_insert_req_list(struct 
blk_mq_hw_ctx *hctx,
 
trace_block_rq_insert(hctx->queue, rq);
 
+WARN_ON_ONCE(!list_empty(>queuelist));
+
if (at_head)
list_add(>queuelist, >rq_list);
else
@@ -1137,6 +1142,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
from_schedule)
depth = 0;
}
 
+WARN_ON_ONCE(!list_empty(>queuelist));
depth++;
list_add_tail(>queuelist, _list);
}
@@ -1172,6 +1178,7 @@ static inline bool blk_mq_merge_queue_io(struct 
blk_mq_hw_ctx *hctx,
blk_mq_bio_to_request(rq, bio);
spin_lock(>lock);
 insert_rq:
+WARN_ON_ONCE(rq->mq_ctx != ctx);
__blk_mq_insert_request(hctx, rq, false);
spin_unlock(>lock);
return false;
@@ -1326,6 +1333,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
old_rq = same_queue_rq;
list_del_init(_rq->queuelist);
}
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add_tail(>queuelist, >mq_list);
} else /* is_sync */
old_rq = rq;
@@ -1412,6 +1420,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
trace_block_plug(q);
}
 
+WARN_ON_ONCE(!list_empty(>queuelist));
list_add_tail(>queuelist, >mq_list);
return cookie;
}


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason  wrote:
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>
> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 
> __list_add+0xbe/0xd0
> [ 2759.119652] list_add corruption. prev->next should be next 
> (e8c80308), but was c9ccfb88. (prev=880128522380).
> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel 
> aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw 
> pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 
> 4.9.0-rc1-15246-g4ce9206-dirty #317
> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.9.0-1.fc24 04/01/2014
> [ 2759.125077]  c9000f6fb868 814fe4ff 8151cb5e 
> c9000f6fb8c8
> [ 2759.125077]  c9000f6fb8c8  c9000f6fb8b8 
> 81064bbf
> [ 2759.127444]  880128523680 002139968000 880138b7a4a0 
> 880128523540
> [ 2759.127444] Call Trace:
> [ 2759.127444]  [] dump_stack+0x53/0x74
> [ 2759.127444]  [] ? __list_add+0xbe/0xd0
> [ 2759.127444]  [] __warn+0xff/0x120
> [ 2759.127444]  [] warn_slowpath_fmt+0x49/0x50
> [ 2759.127444]  [] __list_add+0xbe/0xd0
> [ 2759.127444]  [] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

 blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

list_add_tail(>queuelist, >rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like ">rq_list" might be corrupt.

And that actually seems much more likely than the "plug" list, because
while the plug is entirely thread-local (and thus shouldn't have any
races), the ctx->rq_list very much is not.

Jens?

For example, should we have a

BUG_ON(ctx != rq->mq_ctx);

in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then
__blk_mq_insert_request() will insert things onto the queues of
rq->mq_ctx.

blk_mq_insert_requests() has similar issues, but there has that BUG_ON().

The locking there really is *very* messy. All the lockers do

spin_lock(>lock);
...
spin_unlock(>lock);

but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't
act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those
wrong, you're completely dead.

Now, I'm not seeing why they'd be wrong, and why they'd be associated
with the VMAP_STACK thing, but it could just be an unlucky timing
thing.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones  wrote:
>
> The stacks show nearly all of them are stuck in sync_inodes_sb

That's just wb_wait_for_completion(), and it means that some IO isn't
completing.

There's also a lot of processes waiting for inode_lock(), and a few
waiting for mnt_want_write()

Ignoring those, we have

> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
> [] sync_filesystem+0x6e/0xa0
> [] SyS_syncfs+0x3c/0x70
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

Don't know this one. There's a couple of them. Could there be some
ABBA deadlock on the ordered roots waiting?

> [] call_rwsem_down_write_failed+0x17/0x30
> [] btrfs_fallocate+0xb2/0xfd0 [btrfs]
> [] vfs_fallocate+0x13e/0x220
> [] SyS_fallocate+0x43/0x80
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

This one is also inode_lock(), and is interesting only because it's
fallocate(), which has shown up so many times before..

But there are other threads blocked on do_truncate, or
btrfs_file_write_iter instead, or on lseek, so this is not different
for any other reason.

> [] wait_on_page_bit+0xaf/0xc0
> [] __filemap_fdatawait_range+0x151/0x170
> [] filemap_fdatawait_keep_errors+0x1c/0x20
> [] sync_inodes_sb+0x273/0x300
> [] sync_filesystem+0x57/0xa0
> [] SyS_syncfs+0x3c/0x70
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x

This is actually waiting on the page. Possibly this is the IO that is
never completing, and keeps the inode lock.

> [] btrfs_start_ordered_extent+0x5b/0xb0 [btrfs]
> [] lock_and_cleanup_extent_if_need+0x22d/0x290 [btrfs]
> [] __btrfs_buffered_write+0x1b8/0x6e0 [btrfs]
> [] btrfs_file_write_iter+0x170/0x550 [btrfs]
> [] do_iter_readv_writev+0xa8/0x100
> [] do_readv_writev+0x172/0x210
> [] vfs_writev+0x3a/0x50
> [] do_pwritev+0xb0/0xd0
> [] SyS_pwritev+0xc/0x10
> [] do_syscall_64+0x5c/0x170
> [] entry_SYSCALL64_slow_path+0x25/0x25

Hmm. This is the one that *started* the ordered extents (as opposed to
the ones waiting for it)

I dunno. There might be a lost IO. More likely it's the same
corruption that causes it, it just didn't result in an oops this time.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 9:30 AM, Dave Jones  wrote:
>
> I gave this a go last thing last night. It crashed within 5 minutes,
> but it was one we've already seen (the bad page map trace) with nothing
> additional that looked interesting.

Did the bad page map trace have any registers that looked like they
had 0xd0d0d0d0d0d0 in them?

I assume not, but worth checking.

> Heisenbugs man, literally the worst.

I know you already had this in some email, but I lost it. I think you
narrowed it down to a specific set of system calls that seems to
trigger this best. fallocate and xattrs or something?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-25 Thread Linus Torvalds
On Tue, Oct 25, 2016 at 6:33 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Completely untested. Maybe there's some reason we can't write to the
> whole thing like that?

That hack boots and seems to work for me, but doesn't show anything.

Dave, mind just trying that oneliner?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-25 Thread Linus Torvalds
On Tue, Oct 25, 2016 at 5:27 PM, Dave Jones  wrote:
>
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ?  It's been a while since I did some stress
> testing on XFS, so these might not be new..

Andy, do you think we could just do some poisoning of the stack as we
free it, to see if that catches anything?

Something truly stupid like just

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -218,6 +218,7 @@ static inline void free_thread_stack(struct
task_struct *tsk)
unsigned long flags;
int i;

+   memset(tsk->stack_vm_area->addr, 0xd0, THREAD_SIZE);
local_irq_save(flags);
for (i = 0; i < NR_CACHED_STACKS; i++) {
if (this_cpu_read(cached_stacks[i]))

or similar?

It seems like DaveJ had an easier time triggering these problems with
the stack cache, but they clearly didn't go away when the stack cache
was disabled. So maybe the stack cache just made the reuse more likely
and faster, making the problem show up faster too. But if we actively
poison things, we'll corrupt the free'd stack *immediately* if there
is some stale use..

Completely untested. Maybe there's some reason we can't write to the
whole thing like that?

 Linus

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski  wrote:
>
> Here's my theory: I think you're looking at the right code but the
> wrong stack.  shmem_fault_wait is fine, but shmem_fault_waitq looks
> really dicey.

Hmm.

> Consider:
>
> fallocate calls wake_up_all(), which calls autoremove_wait_function().
> That wakes up the shmem_fault thread.  Suppose that the shmem_fault
> thread gets moving really quickly (presumably because it never went to
> sleep in the first place -- suppose it hadn't made it to schedule(),
> so schedule() turns into a no-op).  It calls finish_wait() *before*
> autoremove_wake_function() does the list_del_init().  finish_wait()
> gets to the list_empty_careful() call and it returns true.

All of this happens under inode->i_lock, so the different parts are
serialized. So if this happens before the wakeup, then finish_wait()
will se that the wait-queue entry is not empty (it points to the wait
queue head in shmem_falloc_waitq.

But then it will just remove itself carefully with list_del_init()
under the waitqueue lock, and things are fine.

Yes, it uses the waitiqueue lock on the other stack, but that stack is
still ok since
 (a) we're serialized by inode->i_lock
 (b) this code ran before the fallocate thread catches up and exits.

In other words, your scenario seems to be dependent on those two
threads being able to race. But they both hold inode->i_lock in the
critical region you are talking about.

> Now the fallocate thread catches up and *exits*.  Dave's test makes a
> new thread that reuses the stack (the vmap area or the backing store).
>
> Now the shmem_fault thread continues on its merry way and takes
> q->lock.  But oh crap, q->lock is pointing at some random spot on some
> other thread's stack.  Kaboom!

Note that q->lock should be entirely immaterial, since inode->i_lock
nests outside of it in all uses.

Now, if there is some code that runs *without* the inode->i_lock, then
that would be a big bug.

But I'm not seeing it.

I do agree that some race on some stack data structure could easily be
the cause of these issues. And yes, the vmap code obviously starts
reusing the stack much earlier, and would trigger problems that would
essentially be hidden by the fact that the kernel stack used to stay
around not just until exit(), but until the process was reaped.

I just think that in this case i_lock really looks like it should
serialize things correctly.

Or are you seeing something I'm not?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 2:17 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> The vmalloc/vfree code itself is a bit scary. In particular, we have a
> rather insane model of TLB flushing. We leave the virtual area on a
> lazy purge-list, and we delay flushing the TLB and actually freeing
> the virtual memory for it so that we can batch things up.

Never mind. If DaveJ is running with DEBUG_PAGEALLOC, then the code in
vmap_debug_free_range() should have forced a synchronous TLB flush fro
vmalloc ranges too, so that doesn't explain it either.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 1:46 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> So this is all some really subtle code, but I'm not seeing that it
> would be wrong.

Ahh... Except maybe..

The vmalloc/vfree code itself is a bit scary. In particular, we have a
rather insane model of TLB flushing. We leave the virtual area on a
lazy purge-list, and we delay flushing the TLB and actually freeing
the virtual memory for it so that we can batch things up.

But we've free'd the physical pages that are *mapped* by that area
when we do the vfree(). So there can be stale TLB entries that point
to pages that have gotten re-used. They shouldn't matter, because
nothing should be writing to those pages, but it strikes me that this
may also be hurting the DEBUG_PAGEALLOC thing. Maybe we're not getting
the page fautls that we *should* be getting, and there are hidden
reads and writes to those paghes that already got free'd.\

There was some nasty reason why we did that insane thing. I think it
was just that there are a few high-frequency vmalloc/vfree users and
the TLB flushing was killing some performance.

But it does strike me that we are playing very fast and loose with the
TLB on the vmalloc area.

So maybe all the new VMAP code is fine, and it's really vmalloc/vfree
that has been subtly broken but nobody has ever cared before?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 1:06 PM, Andy Lutomirski  wrote:
>>
>> [69943.450108] Oops: 0003 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> This is an unhandled kernel page fault.  The string "Oops" is so helpful :-/

I think there was a line above it that DaveJ just didn't include.

>
>> [69943.454452] CPU: 1 PID: 21558 Comm: trinity-c60 Not tainted 
>> 4.9.0-rc1-think+ #11
>> [69943.463510] task: 8804f8dd3740 task.stack: c9000b108000
>> [69943.468077] RIP: 0010:[]
>> [69943.472704]  [] __lock_acquire.isra.32+0x6b/0x8c0
>> [69943.477489] RSP: 0018:c9000b10b9e8  EFLAGS: 00010086
>> [69943.482368] RAX: 81789b90 RBX: 8804f8dd3740 RCX: 
>> 
>> [69943.487410] RDX:  RSI:  RDI: 
>> 
>> [69943.492515] RBP: c9000b10ba18 R08: 0001 R09: 
>> 
>> [69943.497666] R10: 0001 R11: 3f9cfa7f4e73 R12: 
>> 
>> [69943.502880] R13:  R14: c9000af7bd48 R15: 
>> 8804f8dd3740
>> [69943.508163] FS:  7f64904a2b40() GS:880507a0() 
>> knlGS:
>> [69943.513591] CS:  0010 DS:  ES:  CR0: 80050033
>> [69943.518917] CR2: 81789d28 CR3: 0004a8f16000 CR4: 
>> 001406e0
>> [69943.524253] DR0: 7f5b97fd4000 DR1:  DR2: 
>> 
>> [69943.529488] DR3:  DR6: 0ff0 DR7: 
>> 0600
>> [69943.534771] Stack:
>> [69943.540023]  880507bd74c0
>> [69943.545317]  8804f8dd3740 0046 
>> 0286[69943.545456]  c9000af7bd08
>> [69943.550930]  0100 c9000b10ba50 
>> 810c4b68[69943.551069]  810ba40c
>> [69943.556657]  8804  
>> c9000af7bd48[69943.556796] Call Trace:
>> [69943.562465]  [] lock_acquire+0x58/0x70
>> [69943.568354]  [] ? finish_wait+0x3c/0x70
>> [69943.574306]  [] _raw_spin_lock_irqsave+0x42/0x80
>> [69943.580335]  [] ? finish_wait+0x3c/0x70
>> [69943.586237]  [] finish_wait+0x3c/0x70
>> [69943.591992]  [] shmem_fault+0x167/0x1b0
>> [69943.597807]  [] ? prepare_to_wait_event+0x100/0x100
>> [69943.603741]  [] __do_fault+0x6d/0x1b0
>> [69943.609743]  [] handle_mm_fault+0xc58/0x1170
>> [69943.615822]  [] ? handle_mm_fault+0x43/0x1170
>> [69943.621971]  [] __do_page_fault+0x172/0x4e0
>> [69943.628184]  [] do_page_fault+0x20/0x70
>> [69943.634449]  [] ? debug_smp_processor_id+0x17/0x20
>> [69943.640784]  [] page_fault+0x1f/0x30
>> [69943.647170]  [] ? strncpy_from_user+0x5c/0x170
>> [69943.653480]  [] ? strncpy_from_user+0x46/0x170
>> [69943.659632]  [] setxattr+0x57/0x170
>> [69943.665846]  [] ? debug_smp_processor_id+0x17/0x20
>> [69943.672172]  [] ? get_lock_stats+0x19/0x50
>> [69943.678558]  [] ? sched_clock_cpu+0xb6/0xd0
>> [69943.685007]  [] ? __lock_acquire.isra.32+0x1cf/0x8c0
>> [69943.691542]  [] ? __this_cpu_preempt_check+0x13/0x20
>> [69943.698130]  [] ? preempt_count_add+0x7c/0xc0
>> [69943.704791]  [] ? __mnt_want_write+0x61/0x90
>> [69943.711519]  [] SyS_fsetxattr+0x78/0xa0
>> [69943.718300]  [] do_syscall_64+0x5c/0x170
>> [69943.724949]  [] entry_SYSCALL64_slow_path+0x25/0x25
>> [69943.731521] Code:
>> [69943.738124] 00 83 fe 01 0f 86 0e 03 00 00 31 d2 4c 89 f7 44 89 45 d0 89 
>> 4d d4 e8 75 e7 ff ff 8b 4d d4 48 85 c0 44 8b 45 d0 0f 84 d8 02 00 00  ff 
>> 80 98 01 00 00 8b 15 e0 21 8f 01 45 8b 8f 50 08 00 00 85
>
> That's lock incl 0x198(%rax).  I think this is:
>
> atomic_inc((atomic_t *)>ops);
>
> I suppose this could be stack corruption at work, but after a fair
> amount of staring, I still haven't found anything in the vmap_stack
> code that would cause stack corruption.

Well, it is intriguing that what faults is this:

finish_wait(shmem_falloc_waitq, _fault_wait);

where 'shmem_fault_wait' is a on-stack wait queue. So it really looks
very much like stack corruption.

What strikes me is that "finish_wait()" does this optimistic "has my
entry been removed" without holding the waitqueue lock (and uses
list_empty_careful() to make sure it does that "safely").

It has that big comment too:

/*
 * shmem_falloc_waitq points into the shmem_fallocate()
 * stack of the hole-punching task: shmem_falloc_waitq
 * is usually invalid by the time we reach here, but
 * finish_wait() does not dereference it in that case;
 * though i_lock needed lest racing with wake_up_all().
 */

the stack it comes from is the wait queue head from shmem_fallocate(),
which will do "wake_up_all()" under the inode lock.

On the face of it, the inode lock should make that safe and serialize
everything. And yes, finish_wait() does not touch the unsafe stuff if
the wait-queue (in the local stack) is empty, which wake_up_all()
*should* have guaranteed. It's just a regular 

Re: bio linked list corruption.

2016-10-19 Thread Linus Torvalds
On Wed, Oct 19, 2016 at 10:09 AM, Philipp Hahn  wrote:
>
> Nearly a month ago I reported also a "list_add corruption", but with 4.1.6:
> 
>
> That server rungs Samba4, which also is a heavy user of xattr.

That one looks very different. In fact, the list that got corrupted
for you has since been changed to a hlist (which is *similar* to our
doubly-linked list, but has a smaller head and does not allow adding
to the end of the list).

Also, the "should be" and "was" values are very close, and switched:

should be 81ab3ca8, but was 81ab3cc8
should be 81ab3cc8, but was 81ab3ca8

so it actually looks like it was the same data structure. In
particular, it looks like enqueue_timer() ended up racing on adding an
entry to one index in the "base->vectors[]" array, while hitting an
entry that was pointing to another index near-by.

So I don't think it's related. Yours looks like some subtle timer base
race. It smells like a locking problem with timers. I'm not seeing
what it might be, but it *might* have been fixed by doing the
TIMER_MIGRATING bit right in add_timer_on() (commit 22b886dd1018).

Adding some timer people just in case, but I don't think your 4.1
report is related.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-18 Thread Linus Torvalds
On Tue, Oct 18, 2016 at 5:10 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Adding Andy to the cc, because this *might* be triggered by the
> vmalloc stack code itself. Maybe the re-use of stacks showing some
> problem? Maybe Chris (who can't see the problem) doesn't have
> CONFIG_VMAP_STACK enabled?

I bet it's the plug itself that is the stack address. In fact, it's
probably that mq_list head pointer

I think every single users of block plugging uses the pattern

struct blk_plug plug;

blk_start_plug();

and then we'll have

INIT_LIST_HEAD(>mq_list);

which initializes that mq_list head with the stack addresses pointing to itself.

So when we see something like this:

  list_add corruption. prev->next should be next (e8806648),
but was c967fcd8. (prev=880503878b80)

and it comes from

list_add_tail(>queuelist, >mq_list);

which will expand to

__list_add(new, head->prev, head)

which in this case *should* be:

__list_add(>queuelist, plug->mq_list.prev, >mq_list);

so in fact we *should* have "next" be a stack address.

So that debug message is really really odd. I would expect that "next"
is the stack address (because we're adding to the tail of the list, so
"next" is the list head itself), but the debug message corruption
printout says that "was" is the stack address, but next isn't.

Weird.The "but was" value actually looks like the right address should
look, but the actual address (which *should* be just ">mq_list"
and really should be on the stack) looks bogus.

I'm now very confused.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-18 Thread Linus Torvalds
On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason  wrote:
>
> Seems to be the whole thing:

Ahh. On lkml, so I do have it in my mailbox, but Dave changed the
subject line when he tested on ext4 rather than btrfs..

Anyway, the corrupted address is somewhat interesting. As Dave Jones
said, he saw

  list_add corruption. prev->next should be next (e8806648),
but was c967fcd8. (prev=880503878b80).
  list_add corruption. prev->next should be next (e8c05648),
but was c928bcd8. (prev=880503a145c0).

and Dave Chinner reports

  list_add corruption. prev->next should be next (e8c02808),
but was c90005f6bda8. (prev=88013363bb80).

and it's worth noting that the "but was" is a remarkably consistent
vmalloc address (the c9000.. pattern gives it away). In fact, it's
identical across two boots for DaveJ in the low 14 bits, and fairly
high up in those low 14 bots (0x3cd8).

DaveC has a different address, but it's also in the vmalloc space, and
also looks like it is fairly high up in 14 bits (0x3da8). So in both
cases it's almost certainly a stack address with a fairly empty stack.
The differences are presumably due to different kernel configurations
and/or just different filesystems calling the same function that does
the same bad thing but now at different depths in the stack.

Adding Andy to the cc, because this *might* be triggered by the
vmalloc stack code itself. Maybe the re-use of stacks showing some
problem? Maybe Chris (who can't see the problem) doesn't have
CONFIG_VMAP_STACK enabled?

Andy - this is on lkml, under

Dave Chinner:
  [regression, 4.9-rc1] blk-mq: list corruption in request queue

Dave Jones:
  btrfs bio linked list corruption.
  Re: bio linked list corruption.

and they are definitely the same thing across three different
filesystems (xfs, btrfs and ext4), and they are consistent enough that
there is almost certainly a single very specific memory corrupting
issue that overwrites something with a stack pointer.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio linked list corruption.

2016-10-18 Thread Linus Torvalds
On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason  wrote:
>
> Jens, not sure if you saw the whole thread.  This has triggered bad page
> state errors, and also corrupted a btrfs list.  It hurts me to say, but it
> might not actually be your fault.

Where is that thread, and what is the "this" that triggers problems?

Looking at the "->mq_list" users, I'm not seeing any changes there in
the last year or so. So I don't think it's the list itself.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp] [nfsd] b44061d0b9: BUG: Dentry ffff880027d7c540{i=1846f,n=0a} still in use (1) [unmount of btrfs vda]

2016-08-10 Thread Linus Torvalds
On Tue, Aug 9, 2016 at 10:39 PM, kernel test robot
 wrote:
>
> [ 1537.558739] nfsd: last server has exited, flushing export cache
> [ 1540.627795] BUG: Dentry 880027d7c540{i=1846f,n=0a}  still in use (1) 
> [unmount of btrfs vda]
> [ 1540.633915] [ cut here ]
> [ 1540.636551] WARNING: CPU: 0 PID: 20552 at fs/dcache.c:1474 
> umount_check+0x72/0x80

Hmm. Adding Al and the btrfs people to the cc, and expanding the nfs
list. Unlike the flakey-IO warning, this one sounds like a real issue.
Whether it's some dentry leak by nfsd or a VFS or btrfs issue, I can't
begin to guess. Al, ideas?

More information in the original email on lkml.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp] [nfsd] b44061d0b9: BUG: Dentry ffff880027d7c540{i=1846f,n=0a} still in use (1) [unmount of btrfs vda]

2016-08-10 Thread Linus Torvalds
On Wed, Aug 10, 2016 at 11:22 AM, Josef Bacik <jba...@fb.com> wrote:
> On 08/10/2016 02:06 PM, Linus Torvalds wrote:
>>
>> More information in the original email on lkml.
>
> I'm not subscribed to lkml and for some reason I can't find the original
> email in any of the lkml/linux-nfs archives.  Could you forward more of the
> details?

Done.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp] [nfsd] b44061d0b9: BUG: Dentry ffff880027d7c540{i=1846f,n=0a} still in use (1) [unmount of btrfs vda]

2016-08-10 Thread Linus Torvalds
On Wed, Aug 10, 2016 at 11:46 AM, Josef Bacik  wrote:
>
> So my naive fix would be something like this

Bruce? Josef's patch looks ObviouslyCorrect(tm) to me now that I look
at it - all the other callers of fh_compose() also seem to just drop
the dentry unconditionally, knowing that fh_compose() took a ref to it
if needed.

In fact, the only thing I'd do differently would be to not even put
the comment there at all, since this call site isn't any different
from any of the others. If anything, it could go on fh_compose() if we
want to add comments.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp] [nfsd] b44061d0b9: BUG: Dentry ffff880027d7c540{i=1846f,n=0a} still in use (1) [unmount of btrfs vda]

2016-08-10 Thread Linus Torvalds
On Wed, Aug 10, 2016 at 12:09 PM, J. Bruce Fields  wrote:
>
> OK with me if you want to take it, otherwise I'll run it through my
> usual tests and send you a pull request probably today or tomorrow.

I'll let it go through your tree and your usual tests - it's not like
this seems to be a "needs to be in my tree *IMMEDIATELY* or the world
will end!!!11!" kind of issue.

Thanks everybody,

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp] [dm flakey] 99f3c90d0d: WARNING: CPU: 20 PID: 1027 at fs/btrfs/extent-tree.c:2973 btrfs_run_delayed_refs+0x2c5/0x2f0 [btrfs]

2016-08-08 Thread Linus Torvalds
This report should probably have gone to the btrfs people, although it
doesn't really look like a bug.

It looks like it's just dm-flakey now returning more errors, and
causing btrfs to complain more.

Which seems entirely expected.

Maybe the robot just was testing a error case that didn't actually
happen before.

 Linus

On Mon, Aug 8, 2016 at 1:00 AM, kernel test robot  wrote:
>
> FYI, we noticed the following commit:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit 99f3c90d0d85708e7401a81ce3314e50bf7f2819 ("dm flakey: error READ bios 
> during the down_interval")
>
> in testcase: xfstests
> with following parameters:
>
> disk: 4HDD
> fs: btrfs
> test: generic-slow3
>
>
> on test machine: 24 threads Westmere-EP with 16G memory
>
> caused below changes:
>
>
> kern  :err   : [  326.754793] BTRFS error (device dm-0): bdev 
> /dev/mapper/flakey-test errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
> kern  :err   : [  326.765309] BTRFS error (device dm-0): bdev 
> /dev/mapper/flakey-test errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
> kern  :warn  : [  326.775848] [ cut here ]
> kern  :warn  : [  326.780679] WARNING: CPU: 20 PID: 1027 at 
> fs/btrfs/extent-tree.c:2973 btrfs_run_delayed_refs+0x2c5/0x2f0 [btrfs]
> kern  :debug : [  326.792555] BTRFS: Transaction aborted (error -5)
> kern  :warn  : [  326.797436] Modules linked in: dm_flakey ipmi_devintf 
> dm_mod btrfs xor raid6_pq ses enclosure rpcsec_gss_krb5 auth_rpcgss nfsv4 
> dns_resolver sg sr_mod cdrom ata_generic sd_mod pata_acpi intel_powerclamp 
> coretemp kvm_intel mptsas mptscsih kvm snd_pcm irqbypass crct10dif_pclmul 
> crc32_pclmul snd_timer crc32c_intel pata_jmicron mptbase ahci libahci snd 
> ghash_clmulni_intel soundcore ppdev aesni_intel lrw gf128mul glue_helper 
> ablk_helper cryptd scsi_transport_sas i7core_edac ipmi_si serio_raw pcspkr 
> libata parport_pc edac_core shpchp parport ipmi_msghandler acpi_cpufreq
> kern  :warn  : [  326.851992] CPU: 20 PID: 1027 Comm: umount Not tainted 
> 4.7.0-10769-g99f3c90 #1
> kern  :warn  : [  326.859514] Hardware name: Supermicro X8DTN/X8DTN, BIOS 
> 4.6.3 01/06/2010
> kern  :warn  : [  326.866388]   8803f905bc30 
> 8143e1d9 8803f905bc80
> kern  :warn  : [  326.874322]   8803f905bc70 
> 8107e50b 0b9d3bc41000
> kern  :warn  : [  326.882251]  a04a25d0 88043b7e0850 
> 88043dd56730 
> kern  :warn  : [  326.890185] Call Trace:
> kern  :warn  : [  326.892810]  [] dump_stack+0x63/0x8a
> kern  :warn  : [  326.898123]  [] __warn+0xcb/0xf0
> kern  :warn  : [  326.903090]  [] 
> warn_slowpath_fmt+0x4f/0x60
> kern  :warn  : [  326.909019]  [] 
> btrfs_run_delayed_refs+0x2c5/0x2f0 [btrfs]
> kern  :warn  : [  326.916240]  [] ? 
> kmem_cache_alloc+0x1a6/0x1c0
> kern  :warn  : [  326.922429]  [] 
> btrfs_commit_transaction+0x43/0xa90 [btrfs]
> kern  :warn  : [  326.929744]  [] btrfs_sync_fs+0x66/0x130 
> [btrfs]
> kern  :warn  : [  326.936096]  [] sync_filesystem+0x73/0xa0
> kern  :warn  : [  326.941842]  [] 
> generic_shutdown_super+0x27/0x100
> kern  :warn  : [  326.948280]  [] kill_anon_super+0x12/0x20
> kern  :warn  : [  326.954030]  [] 
> btrfs_kill_super+0x18/0x110 [btrfs]
> kern  :warn  : [  326.960649]  [] 
> deactivate_locked_super+0x43/0x70
> kern  :warn  : [  326.967087]  [] deactivate_super+0x5a/0x60
> kern  :warn  : [  326.972918]  [] cleanup_mnt+0x3f/0x90
> kern  :warn  : [  326.978317]  [] __cleanup_mnt+0x12/0x20
> kern  :warn  : [  326.983890]  [] task_work_run+0x85/0xc0
> kern  :warn  : [  326.989464]  [] 
> exit_to_usermode_loop+0xc2/0xd0
> kern  :warn  : [  326.995729]  [] 
> syscall_return_slowpath+0xa1/0xb0
> kern  :warn  : [  327.002175]  [] 
> entry_SYSCALL_64_fastpath+0xa2/0xa4
> kern  :warn  : [  327.002223] ---[ end trace 0e82b94d9eb7c6ea ]---
> kern  :crit  : [  327.002225] BTRFS: error (device dm-0) in 
> btrfs_run_delayed_refs:2973: errno=-5 IO failure
>
>
>
> To reproduce:
>
> git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
>
>
>
> Thanks,
> Xiaolong
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-20 Thread Linus Torvalds
On Sun, Jun 19, 2016 at 5:26 PM, Deepa Dinamani  wrote:
> The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC 
> macros.

This version now looks ok to me.

I do have a comment (or maybe just a RFD) for future work.

It does strike me that once we actually change over the inode times to
use timespec64, the calling conventions are going to be fairly
horrendous on most 32-bit architectures.

Gcc handles 8-byte structure returns (on most architectures) by
returning them as two 32-bit registers (%edx:%eax on x86). But once it
is timespec64, that will no longer be the case, and the calling
convention will end up using a pointer to the local stack instead.

So for 32-bit code generation, we *may* want to introduce a new model of doing

set_inode_time(inode, ATTR_ATIME | ATTR_MTIME);

which basically just does

inode->i_atime = inode->i_mtime = current_time(inode);

but with a much easier calling convention on 32-bit architectures.

But that is entirely orthogonal to this patch-set, and should be seen
as a separate issue.

And maybe it doesn't end up helping anyway, but I do think those
"simple" direct assignments will really generate pretty disgusting
code on 32-bit architectures.

That whole

inode->i_atime = inode->i_mtime = CURRENT_TIME;

model really made a lot more sense back in the ancient days when inode
times were just simply 32-bit seconds (not even timeval structures).

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs

2016-03-21 Thread Linus Torvalds
On Mon, Mar 21, 2016 at 7:24 PM, Chris Mason  wrote:
>
> Hmmm, rereading my answer I realized I didn't actually answer.  I really
> think this is fixed.  I left the warning only because I originally
> expected something much more exotic.

Ok. It's more that you said the top commit fixes a problem, and the
only case where the top commit makes a difference it will also do that
WARN_ON_ONCE.

But it's pulled, test-built, and pushed out now.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs

2016-03-21 Thread Linus Torvalds
On Mon, Mar 21, 2016 at 5:24 PM, Chris Mason  wrote:
>
> I waited an extra day to send this one out because I hit a crash late
> last week with CONFIG_DEBUG_PAGEALLOC enabled (fixed in the top commit).

Hmm. If that commit helps, it will spit out a warning.

So is it actually fixed, or just hacked around to the point where you
don't get a page fault?

That WARN_ON_ONCE kind of implies it's a "this happens, but we don't know why".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs for 3.19-rc

2014-12-12 Thread Linus Torvalds
On Fri, Dec 12, 2014 at 11:07 AM, Chris Mason c...@fb.com wrote:

 From a feature point of view, most of the code here comes from Miao Xie
 and others at Fujitsu to implement scrubbing and replacing devices on
 raid56.  This has been in development for a while, and it's a big
 improvement.

So this has probably happened before, and I just haven't been looking,
but I thought I'd mention it.

There are merges from github for this feature, and those merges aren't
signed, and don't have merge messages. Maybe you actually verified all
of it other ways, but there's no sign of it. I generally push back on
merging unsigned stuff from random hosting places (to the point where
I just refuse to do it, although it's possible that some pass though
just due to inattention), and I think that's just good practice in
general. And merges that don't explain what the merge does are just
bad merges (they are extra annoying when they are back-merges, but
it's a problem even otherwise).

Now, sometimes the why did you merge is obvious in just the merge
itself (maybe the branch name is already sufficient to explain some
trivial pull). But I thought I'd mention this as an area where the
kernel development process can still improve. I strive to make sure
that my merge commits have good messages (generally by asking
submaintainers to explain things to me in email or in the signed tag),
and I'm starting to try to encourage others to the same.

Again, this is probably something you've done before without me ever
mentioning/noticing it, and I really don't think the btrfs tree is at
all alone in this, but I thought I'd mention it since I happened to
react to it this time.

Regardless - pulled,

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 11/12] security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c

2014-09-15 Thread Linus Torvalds
On Mon, Sep 15, 2014 at 12:30 AM,  beh...@converseincode.com wrote:
 From: Behan Webster beh...@converseincode.com

 Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
 compliant equivalent. This patch allocates the appropriate amount of memory
 using a char array using the SHASH_DESC_ON_STACK macro.

You only made the first case use SHASH_DESC_ON_STACK, the two other
cases you left in the ugly format. Was that just an oversight, or was
there some reason for it?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Smaller btrfs pull

2014-08-16 Thread Linus Torvalds
On Fri, Aug 15, 2014 at 11:26 AM, Chris Mason c...@fb.com wrote:

 I've been running xfstests with these against your current git
 overnight, but I'm queueing up longer tests as well.  I understand
 you may want to wait until rc2, but either way I'll get a sane queue
 into my linux-next branch for the rest of the rcs.

You completely mis-understand.

If something isn't appropriate for the merge window, then it sure as
HELL isn't appropriate for later rc's either.

So I'm not going to pull some stragglers later and just pull some
initial stuff now. That's not how this works.

The changes for the merge window should have been ready *before* the
merge window even opened. That has always been the rule. We don't do
random development and re-organization during the merge window, and we
certainly do not do it *later*.

I'm going to pull this reduced set, but that in no way means that I
will then pull later sets after the merge window.  The stuff after the
merge window should be pure fixes, and I'm in fact going to be
stricter with btrfs than usual, because I think this was such a
disaster. Oopses, regressions, data corruption, security fixes, and
things like that *only*. No cleanups, no random crap, nothing that
doesn't count as critical.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs

2014-08-14 Thread Linus Torvalds
On Thu, Aug 14, 2014 at 11:59 AM, Chris Mason c...@fb.com wrote:
 Hi Linus,

 Please pull my for-linus branch:

Yeah, I think this will be for the next merge window.

This clearly got rebased today. At the end of the merge window. After
I told people that I was traveling, and asked people to send big
things early.

So please re-send this for 3.18, and send just stable fixes for this release.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs Maintainer update

2013-12-05 Thread Linus Torvalds
On Thu, Dec 5, 2013 at 10:19 AM, Chris Mason c...@fb.com wrote:

 I had used git commit -S instead of signing the tag.

Ahh. And that was actually what you had done the first time around for
your test too - but back then I didn't actually *merge* it. Then I
just looked at the top commit and said yup, it's signed.

And now that I tried to merged it, I noticed my merge commit didn't
get the signage I expected.

And that only happens if I pull a tag that's signed (which also gets
me the message I expect from signed pulls)..

Anyway, now I have both a signature on the merge and on your actual
commit. Thanks,

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs

2013-03-02 Thread Linus Torvalds
On Sat, Mar 2, 2013 at 7:15 AM, Chris Mason chris.ma...@fusionio.com wrote:

 Our set of btrfs features, fixes and cleanups are in my for-linus
 branch:

I *really* wish that big pull requests like this would come in earlier
in the merge window. I hate seeing them the day before I close the
window - really.  A number of the latter commits are done in the last
few days, which also smells bad.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs fixes

2013-01-24 Thread Linus Torvalds
On Thu, Jan 24, 2013 at 1:52 PM, Chris Mason chris.ma...@fusionio.com wrote:

 Update on this, we've tracked down the crc errors and are doing final
 checks on the patches.  Linus are you planning on taking this pull?  If
 not I can just fold the new stuff into a bigger request.

If you have them basically ready, add them to this, I haven't pulled
yet. So I'll just ignore this and wait for another pull request.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 1/2] Btrfs fixes

2012-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2012 at 6:53 PM, Chris Samuel ch...@csamuel.org wrote:

 This pull request with a whole heap of btrfs fixes (46 commits) appears
 not to have been merged yet, does anyone know if it was rejected or just
 missed ?

Read my -rc2 release notes.

TL;DR: I rejected big pull requests that didn't convince me. Make a
damn good case for it, or send minimal fixes instead.

I'm tried of these oops, what we sent you for -rc1 wasn't ready, so
here's a thousand lines of changes crap.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] a large btrfs update

2012-07-26 Thread Linus Torvalds
On Thu, Jul 26, 2012 at 2:09 PM, Chris Mason chris.ma...@fusionio.com wrote:

 for-linus-merged has an extra commit on top as well that changes the
 btrfs send/receive code to Al's new dentry_open.  It's a small commit,
 and my guess is that you'll cherry pick it and do your own merge.

Indeed. And I dd the semantic merge differently from what you had
done, more like what Al did for XFS in commit 765927b2d508 (see the
fairly equivalent case in fs/xfs/xfs_ioctl.c).

IOW, instead of doing the unnecessary mntget() (and then undoing it
with the path_put()), I just let dentry_open() get the references it
needs, and then just dput() the dentry in the caller (to match the
dget() done by d_obtain_alias()). No need to mess with the mnt
refcounts, since dentry_open() will get whatever refcount it needs.

But while it compiles, I haven't actually *tested* the btrfs
BTRFS_IOC_SEND send/receive path, so please do double-check my merge.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates

2012-06-15 Thread Linus Torvalds
On Fri, Jun 15, 2012 at 11:09 AM, Chris Mason chris.ma...@fusionio.com wrote:

 Please pull my for-linus branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This seems to introduce a new warning:

  In file included from fs/btrfs/ctree.c:22:0:
  fs/btrfs/ctree.c: In function ‘btrfs_search_old_slot’:
  fs/btrfs/ctree.h:2117:240: warning: ‘old_generation’ may be used
uninitialized in this function [-Wmaybe-uninitialized]
  fs/btrfs/ctree.c:1178:6: note: ‘old_generation’ was declared here

which looks like just gcc being unable to see that it is only used
when set, but it's still annoying. I'd suggest initializing it to 0
just to shut up the compiler. Ok?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Linus Torvalds
On Fri, Jun 8, 2012 at 3:00 PM, Kirill A. Shutemov
kirill.shute...@linux.intel.com wrote:

 IIUC, moving rcu_barrier() up should help, but I can't say that I fully
 understand SLAB_DESTROY_BY_RCU semantics.

.. hmm. I think you may be right. Even if we do move it up, we
probably shouldn't use it.

We don't even want SLAB_DESTROY_BY_RCU, since we do the delayed RCU
free for other reasons anyway, so it would duplicate the RCU delaying
and cause problems. I forgot about that little complication.

We could have a separate RCU_BARRIER_ON_DESTROY thing, but that's
just silly too.

Maybe your patch is the right thing.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Linus Torvalds
On Fri, Jun 8, 2012 at 3:23 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 Note that module unload is *not* a hot path - not on any even remotely sane
 use.

Actually, I think we've had distributions that basically did a load
pretty much everything, and let God sort it out approach to modules.
I know some people *have* actually worried about module load/unload
performance.

Whether it is remotely sane I'm not going to argue for, but ..

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-08 Thread Linus Torvalds
On Fri, Jun 8, 2012 at 4:37 PM, Andrew Morton a...@linux-foundation.org wrote:

 So how about open-coding the rcu_barrier() in btrfs and gfs2 for the
 non-inode caches (which is the appropriate place), and hand the inode
 cache over to the vfs for treatment (which is the appropriate place).

The thing is, none of the inode caches are really up to the VFS. They
are per-filesystem caches, that just *embed* an inode as part of the
bigger ext4_inode or whatever. But apart from the fact that the
embedded inode requires them to then use the proper call_rcu() stuff
to do the delayed free, they really are pretty much filesystem data
structures. The VFS layer can neither free them or allocate them,
since the VFS layer doesn't even know how big the structures are, or
where the inodes are embedded, or how to initialize them (or even when
to allocate them).

Of course, if you just mean having a VFS wrapper that does

static void vfs_inode_kmem_cache_destroy(struct kmem_cache *cachep)
{
rcu_barrier();
kmem_cache_destroy(cachep);
}

then we could do that. Not much better than what Kirill's patch did,
but at least we could have that comment in just one single place.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs fixes and features

2012-03-30 Thread Linus Torvalds
On Fri, Mar 30, 2012 at 10:51 AM, Chris Mason chris.ma...@oracle.com wrote:

 git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

This causes a new warning for me:

  fs/btrfs/extent_io.c: In function ‘repair_eb_io_failure’:
  fs/btrfs/extent_io.c:1940:6: warning: ‘ret’ may be used
uninitialized in this function

Hmm?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs fixes and features

2012-03-30 Thread Linus Torvalds
On Fri, Mar 30, 2012 at 12:50 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 This causes a new warning for me:

  fs/btrfs/extent_io.c: In function ‘repair_eb_io_failure’:
  fs/btrfs/extent_io.c:1940:6: warning: ‘ret’ may be used
 uninitialized in this function

 Hmm?

Ok, so presumably num_pages (which is num_extent_pages(eb-start,
eb-len)) cannot be zero, so I guess the code is ok. But gcc can't
know that, and it's an annoying warning.

So please fix, but it's not urgent. In the meantime I've pulled and pushed out.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: use do_div to avoid compile errors on 32bit box

2011-08-20 Thread Linus Torvalds
On Sat, Aug 20, 2011 at 5:21 AM, Josef Bacik jo...@redhat.com wrote:

 I think Linus was less complaining about how you're dividing here and
 more about the fact that you are.  A divide by 2 is the same as a  1.
  I'll send a patch to fix this.  Thanks,

Indeed. A single-bit 64-bit double shift may be a few cycles, but
it's still pretty damn cheap. Especially when compared to a 64x32 bit
divide. We absolutely don't want to do do_div() in order to divide
by the constant 2.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates

2011-06-12 Thread Linus Torvalds
On Sun, Jun 12, 2011 at 4:57 AM, Chris Mason chris.ma...@oracle.com wrote:

 The for-linus branch of the btrfs unstable tree:

Chris, this is getting ridiculous.

You guys need to start honoring the merge window. None of these big
pulls afterwards. If the code wasn't ready, it damn well shouldn't
have been pushed to me in the first place.

I pulled this round, but next pull request had better make it very
clear that it's only minimal *regression* fixes. Nothing more.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates for 2.6.39

2011-03-28 Thread Linus Torvalds
On Mon, Mar 28, 2011 at 5:08 AM, Chris Mason chris.ma...@oracle.com wrote:

 Linus, I've pushed out two branches for you, for-linus and
 for-linus-unmerged

Thanks. I took the unmerged one, because I do like to see what's going
on. But I did end up verifying the end result against your merge, and
it ended up identical, so we're all good.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-24 Thread Linus Torvalds
On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt rost...@goodmis.org wrote:

 But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
 is running (spinning on A) it will spin as well waiting for A's owner to
 release it. Unfortunately, A's owner is also spinning waiting for B to
 release it.

 If both A and B's owners are real time tasks, then boom! deadlock.

Hmm. I think you're right. And it looks pretty fundamental - I don't
see any reasonable approach to avoid it.

I think the RT issue is a red herring too - afaik, you can get a
deadlock with two perfectly normal processes too. Of course, for
non-RT tasks, any other process will eventually disturb the situation
and you'd get kicked out due to need_resched(), but even that might be
avoided for a long time if there are other CPU's - leading to tons of
wasted CPU time.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-23 Thread Linus Torvalds
On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote:

 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, at least for
 btrfs anyway.

Hmm. Seems reasonable to me. The patch looks clean, although part of
that is just the mutex_spin() cleanup that is independent of actually
using it in trylock.

So no objections from me.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates for 2.6.37

2010-10-30 Thread Linus Torvalds
On Sat, Oct 30, 2010 at 6:51 AM, Chris Mason chris.ma...@oracle.com wrote:

 There were some minor conflicts with Linus' current tree, so my branch
 is merged with Linus' tree as of this morning.

Gaah. Please don't do this. Unless it's a _really_ messy merge, I
really do want to do the merge. It's fine to have an alternate
pre-merged branch for me to compare against, but please do that
separately.

So what I did was to just instead merge the state before your merge,
and in the process I:

 (a) noticed that your merge was incorrect (you had left around a
unused error: label in btrfs_mount()), since I did use your merge as
something to compare against (see above). That label had been removed
in your branch by  commit 0e78340f3c1f, but your merge resurrected it.

 (b) saw just how horribly nasty your writeback_inodes_sb() end result
was, and decided to clean up the estimation of dirty pages in order to
not end up with the function call argument from hell.

Now, it's obviously totally possible that I screwed things up entirely
in the process, but as mentioned elsewhere, I do feel that actually
seeing the merge conflicts really does help me get a feel for what I'm
merging, and what the points of conflict are.

And yes, maybe it's just me showing my insecurities again. I have
various mental hangups, and liking to feel like I know roughly what is
going on is one of them. Doing the merges and looking at the code that
clashes makes me feel like I have some kind of awareness of how things
are interacting in the development process.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates for 2.6.35

2010-06-11 Thread Linus Torvalds
On Fri, Jun 11, 2010 at 12:37 PM, Chris Mason chris.ma...@oracle.com wrote:

 The master branch of the btrfs-unstable tree is a collection of fixes
 and cleanups, including two btrfs regressions from rc1:

Ok, no pulling then. See all the millions of threads how I wanted only
critical fixes for -rc3 since I'll be offline.

You have a couple of hours for a minimal fix pull request with just
the regression fixes if you want to hit -rc3. Then I'll cut a release
and be gone for a while.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates for 2.6.35

2010-06-11 Thread Linus Torvalds
On Fri, Jun 11, 2010 at 12:48 PM, Chris Mason chris.ma...@oracle.com wrote:

 The others all fix oopsen or big problems, and I think fixing warnings
 helps avoid false negatives as others look for real problems?

 I'm happy to rebase out the 3 non-criticals.

There seems to be more than three non-criticals. There's the warning
fixes, the unused variables thing, the memdup_user() thing, a
couple of unnecessary NULL checks removed etc. On the whole, I do not
get the feeling that the pull request was actively trying to be
minimal, and that's what I really want to see.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates

2010-05-27 Thread Linus Torvalds


On Thu, 27 May 2010, Chris Mason wrote:
 
  fs/btrfs/extent-tree.c  | 2317 
 
  fs/btrfs/relocation.c   | 1991 +
  fs/btrfs/inode.c| 1797 +
  fs/btrfs/file.c |  304 +++---
  fs/btrfs/tree-log.c |  241 +++-
 ...

What kind of bogus diffstat is this? 

Don't do that. I cannot compare your bogus diffstat with what I get, 
because it's just random noise. You've apparently sorted it by size of the 
damage, but the numbers are total crap too. That is _not_ the actual size 
of the changes at all.

I suspect you have some script that adds up all the patches, but that's 
wrong. If a subsequent patch changes the things that an earlier patch has 
done, then the numbers don't just add up.

And if the diffstat doesn't match what I get when I pull, then my reaction 
inevitably is ok, that's not what they asked me to pull, so I'll just 
reject it out-of-hand. So don't play games with diffstats - that just 
means that things won't get pulled.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Btrfs updates

2010-05-27 Thread Linus Torvalds


On Thu, 27 May 2010, Chris Mason wrote:
 
 # git diff v2.6.34 HEAD | diffstat

That still has the potential to be wrong (but got the numbers I expected 
this time). It will be wrong in several cases:

 - diffstat has some random common prefix removal logic that I've never 
   figured out the exact rules for.

   You don't happen to see it, because you actually had changes outside of 
   fs/btrfs (so there was no common prefix to worry about), but if 
   everything had been inside fs/btrfs, then at least some versions of 
   diffstat will just remove that whole thing as common, and talk about 
   changes to 'ioctl.c' rather than 'fs/btrfs/ioctl.c'

   (And no, I'm not entirely sure what triggers it. It might only happen 
   for certain patterns and/or certain versions of diffstat, but it's a 
   reason to avoid diffstat in general, or at least use -p1 to make it 
   reliable)

 - it will do the wrong thing for renames and copies.

 - it doesn't give the summary that pull-request does, which talks about 
   new, deleted and file-mode-changed files.

So just do

git diff --stat --summary -M v2.6.34..HEAD

instead, which gets all the above cases right. Also, you don't even need 
to remember where you started - you might as well use git to do that too, 
and write it as (assuming you have an 'origin' branch that points to the 
upstream tree):

git diff --stat --summary -M origin...HEAD

(note the *three* dots: that says that you should diff against the common 
ancestor, so git will just figure out where you started on its own).

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >