Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 6:12 PM Jens Axboe  wrote:
>
>
> Linus, I just know notice you are not on the CC for the discussion about
> the patch. Don't pull this one yet. It'll solve the issue, but it'll also
> mess up the BUSY feedback loop that DM relies on for good merging of
> sequential IO. Testing a new patch now, will push it to you when folks
> are happy and when my testing has completed.

Ok, this pull request dropped from my queue.

   Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [GIT PULL] Block fixes for 4.20-rc2

2018-11-09 Thread Linus Torvalds
On Fri, Nov 9, 2018 at 1:29 PM Jens Axboe  wrote:
>
> A select set of fixes that should go into this release. This pull
> request contains:

This is part of a final few "ack" emails, pointing out that there is
now automation in place if you cc lkml in your pull request.

That automation will parse the pull request and automatically send an
ack when the end result gets merged into mainline, so I'll stop the
manual process.

But it currently only works on lkml (I think - maybe Konstantin
enabled it for all lists that get tracked by lore.kernel.org), so this
pull request probably won't get an automated response due to just
going to linux-block.

Just a heads-up.

   Linus


Re: [GIT PULL] Final block merge window changes/fixes

2018-11-02 Thread Linus Torvalds
On Fri, Nov 2, 2018 at 10:08 AM Jens Axboe  wrote:
>
> The biggest part of this pull request is the revert of the blkcg cleanup
> series. It had one fix earlier for a stacked device issue, but another
> one was reported. Rather than play whack-a-mole with this, revert the
> entire series and try again for the next kernel release.
>
> Apart from that, only small fixes/changes.

Pulled,

  Linus


Re: [GIT PULL] Followup block changes for 4.20-rc1

2018-10-26 Thread Linus Torvalds
On Fri, Oct 26, 2018 at 9:00 AM Jens Axboe  wrote:
>
> A followup pull request for this merge window.

Pulled,

   Linus


Re: [GIT PULL] Follow up block changes for this merge window

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 10:54 AM Jens Axboe  wrote:
>
> - Set of bcache fixes and changes (Coly)

Some of those bcache style fixes look questionable.

Maybe we should push back on some of the checkpatch rules instead?

Like having argument names in declarations - sometimes descriptive
names can be good documentation. And sometimes they are just noise,
because the type is what describes it.

Oh well. Taken.

   Linus


Re: [GIT PULL] Block changes for 4.18-rc

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 5:56 PM Kent Overstreet
 wrote:
>
> I like your patch for a less invasive version, but I did finish and test my
> version, which deletes more code :)

I certainly won't object to that.

Your patch looks fine, and looks like the right thing in the long run anyway.

Plus, it's apparently even tested. What a concept.

Regardless, I'll sleep better tonight that this will all be fixed one
way or the other.

Thanks,

Linus


Re: [GIT PULL] Block changes for 4.18-rc

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 5:42 PM Linus Torvalds
 wrote:
>
> How about just the attached?

Note: it probably goes without saying that the patch was entirely
untested, but it does build, and it does get rid of the insane stack
use.

Linus


Re: [GIT PULL] Block changes for 4.18-rc

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet
 wrote:
>
> However, that's not correct as is because mddev_delayed_put() calls
> kobject_put(), and the kobject isn't initialized when the mddev is first
> allocated, it's initialized when the gendisk is allocated... that isn't hard 
> to
> fix but that's getting into real refactoring that I'll need to put actual work
> into testing.

Well, it also removes the bioset_exit() calls entirely.

How about just the attached?

It simply does it as two different cases, and adds the bioset_exit()
calls to mddev_delayed_delete().

Hmm?

 Linus
 drivers/md/md.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fc692b7128bb..6a2494065ab2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -510,11 +510,6 @@ static void mddev_delayed_delete(struct work_struct *ws);
 
 static void mddev_put(struct mddev *mddev)
 {
-	struct bio_set bs, sync_bs;
-
-	memset(, 0, sizeof(bs));
-	memset(_bs, 0, sizeof(sync_bs));
-
 	if (!atomic_dec_and_lock(>active, _mddevs_lock))
 		return;
 	if (!mddev->raid_disks && list_empty(>disks) &&
@@ -522,10 +517,6 @@ static void mddev_put(struct mddev *mddev)
 		/* Array is not configured at all, and not held active,
 		 * so destroy it */
 		list_del_init(>all_mddevs);
-		bs = mddev->bio_set;
-		sync_bs = mddev->sync_set;
-		memset(>bio_set, 0, sizeof(mddev->bio_set));
-		memset(>sync_set, 0, sizeof(mddev->sync_set));
 		if (mddev->gendisk) {
 			/* We did a probe so need to clean up.  Call
 			 * queue_work inside the spinlock so that
@@ -534,12 +525,15 @@ static void mddev_put(struct mddev *mddev)
 			 */
 			INIT_WORK(>del_work, mddev_delayed_delete);
 			queue_work(md_misc_wq, >del_work);
-		} else
-			kfree(mddev);
+			spin_unlock(_mddevs_lock);
+			return;
+		}
 	}
 	spin_unlock(_mddevs_lock);
-	bioset_exit();
-	bioset_exit(_bs);
+
+	bioset_exit(>bio_set);
+	bioset_exit(>sync_set);
+	kfree(mddev);
 }
 
 static void md_safemode_timeout(struct timer_list *t);
@@ -5234,6 +5228,9 @@ static void mddev_delayed_delete(struct work_struct *ws)
 {
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
+	bioset_exit(>bio_set);
+	bioset_exit(>sync_set);
+
 	sysfs_remove_group(>kobj, _bitmap_group);
 	kobject_del(>kobj);
 	kobject_put(>kobj);


Re: [GIT PULL] Block changes for 4.18-rc

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo  wrote:
>
>
> Looking at the code, the fundamental problem seems to be that it's
> weaving different parts of sync and async paths.  I don't understand
> why it'd punt the destructin of mddev but destroy biosets
> synchronously.  Can't it do sth like the following?

Yeah, that looks like the right thing to do.

Jens/Kent?

 Linus


Re: [GIT PULL] Block changes for 4.18-rc

2018-06-04 Thread Linus Torvalds
On Mon, Jun 4, 2018 at 8:54 AM Jens Axboe  wrote:
>
> On 6/4/18 9:51 AM, Linus Torvalds wrote:
> >
> > Why the hell doesn't it just do bioset_exit() on the originals instead,
> > before freeing the mddev?
>
> CC'ing Neil to get his input on how best to clean that up, I'd
> be much more comfortable with that logic improved rather than
> just catering to the stack usage issue. Also include Arnd, as
> he had a test patch for it.

Also adding Tejun, because the only reason for that odd sequencing
seems to be that

 - we want to queue the deletion work *inside* the spinlock, because
it actually has synchronization between the workqueue and the
'all_mddevs_lock' spinlock.

 - we want to bioset_exit() the fields *outside* the spinlock.

So what it does now is to copy those big 'struct bio_set' fields
because they might go away from under it.

Tejun, the code in question is mddev_put() in drivers/md/md.c, and it
basically does

INIT_WORK(>del_work, mddev_delayed_delete);
queue_work(md_misc_wq, >del_work);

inside a spinlock, but then wants to do some stuff *outside* the
spinlock before that mddev_delayed_delete() thing actually gets
called.

Is there some way to "half-queue" the work - enough that a
flush_workqueue() will wait for it, but still guaranteeing that it
won't actually run until released?

IOW, what I think that code really wants to do is perhaps something like

/* under  _mddevs_lock spinlock */
.. remove from all lists so that it can't be found ..

.. but add it to the md_misc_wq so that people will wait for it ..

INIT_WORK_LOCKED(>del_work, mddev_delayed_delete);
queue_work(md_misc_wq, >del_work);
...

spin_unlock(_mddevs_lock);

/* mddev is still guaranteed live */
bioset_exit(>bio_set);
bioset_exit(>sync_set);

/* NOW we can release it */
if (queued)
unlock_work(>del_work);
else
kfree(mddev);

or something like that?

The above is *ENTIRELY* hand-wavy garbage - don't take it seriously
per se, consider it just pseudo-code for documentation reasons, not
serious in any other way.

 Linus


Re: [GIT PULL] Block changes for 4.18-rc

2018-06-04 Thread Linus Torvalds
On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe  wrote:
>
>  drivers/md/md.c |  61 +--
>  drivers/md/md.h |   4 +-

So I've pulled this, but I get a new warning:

  drivers/md/md.c: In function ‘mddev_put’:
  drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is
larger than 2048 bytes [-Wframe-larger-than=]

which seems to be due to commit afeee514ce7f ("md: convert to
bioset_init()/mempool_init()").

As of that commit, mddev_put() now allocates *two* "struct bio_set"
structures on the stack, and those really aren't small. Yes, part of
it is that I do my test-builds with all that debug stuff, but those
structures have several embedded mempool_t members, and they really
are pretty sizable.

And I _really_ do not think it's acceptable to have that kind of stack
usage in there. I'm not sure you can trigger mddev_put() in the IO
paths, but even without that I don't think it's acceptable.

Also, the code simply looks like complete and utter garbage. It does

bs = mddev->bio_set;
sync_bs = mddev->sync_set;

to _copy_ those structures, and then does bioset_exit() on them. WTF?

Why the hell doesn't it just do biset_exit() on the originals instead,
before freeing the mddev?

I've pulled this, but this needs to be fixed. That is just broken
garbage right now. No way in hell is it acceptable to waste 2kB of
stack space on something idiotic like this.

 Linus


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-22 Thread Linus Torvalds
On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
roman.peny...@profitbricks.com> wrote:

> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?

No. Even if we go your way, there is *one* single user, and that one is
special and needs to take a lot more care.

Just roll your own version, and make it an inline function like I've asked
now now many times, and add a shit-ton of explanations of why it's safe to
use in that *one* situation.

I don't want any crazy and unsafe stuff in the generic header file that
absolutely *nobody* should ever use.

  Linus


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-21 Thread Linus Torvalds
On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
roman.peny...@profitbricks.com> wrote:

> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.

Right. And that's exactly what has both me and Paul nervous. You're no
longer in the RCU domain. You're using a pointer where the lifetime has
nothing to do with RCU any more.

Can it be done? Sure. But you need *other* locking for it (that you haven't
explained), and it's fragile as hell.

It's probably best to not use RCU for it at all, but depend on that "other
locking" that you have to have anyway, to keep the pointer valid over the
non-RCU region.

Linus


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-19 Thread Linus Torvalds
On Sat, May 19, 2018 at 1:25 PM Roman Penyaev <
roman.peny...@profitbricks.com> wrote:

> Another one list_for_each_entry_rcu()-like macro I am aware of is used in
> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():


https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

> Can we do something generic with -rr semantics to cover both cases?

That loop actually looks more like what Paul was asking for, and makes it
(perhaps) a bit more obvious that the whole loop has to be done under the
same RCU read sequence that looked up that first 'skip' entry.

(Again, stronger locking than RCU is obviously also acceptable for the
"look up skip entry").

But another reason I really dislike that list_next_or_null_rr_rcu() macro
in the patch under discussion is that it's *really* not the right way to
skip one entry. It may work, but it's really ugly. Again, the
list_for_each_entry_rcu_rr() in blk-mq-sched.c looks better in that regard,
in that the skipping seems at least a _bit_ more explicit about what it's
doing.

And again, if you make this specific to one particular list (and it really
likely is just one particular list that wants this), you can use a nice
legible helper inline function instead of the macro with the member name.

Don't get me wrong - I absolutely adore our generic list handling macros,
but I think they work because they are simple. Once we get to "take the
next entry, but skip it if it's the head entry, and then return NULL if you
get back to the entry you started with" kind of semantics, an inline
function that takes a particular list and has a big comment about *why* you
want those semantics for that particular case sounds _much_ better to me
than adding some complex "generic" macro for a very very unusual special
case.

  Linus


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-19 Thread Linus Torvalds
On Sat, May 19, 2018 at 1:21 PM Roman Penyaev <
roman.peny...@profitbricks.com> wrote:

> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> connection from the list in order to send a request, i.e. my code is
> something like the following:
[ incomplete pseudoicode ]
> i.e. usage of the @next pointer is under an RCU critical section.

That's not enough. The whole chain to look up the pointer you are taking
'next' of needs to be under RCU, and that's not clear from your example.

It's *probably* the case, but basically you have to prove that the starting
point is still on the same RCU list. That wasn't clear from your example.

The above is (as Paul said) true of list_next_rcu() too, so it's not like
this is anything specific to the 'rr' version.

  Linus


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-18 Thread Linus Torvalds
On Fri, May 18, 2018 at 6:07 AM Roman Pen 
wrote:

> Function is going to be used in transport over RDMA module
> in subsequent patches.

Does this really merit its own helper macro in a generic header?

It honestly smells more like "just have an inline helper function that is
specific to rdma" to me. Particularly since it's probably just one specific
list where you want this oddly specific behavior.

Also, if we really want a round-robin list traversal macro, this isn't the
way it should be implemented, I suspect, and it probably shouldn't be
RCU-specific to begin with.

Side note: I notice that I should already  have been more critical of even
the much simpler "list_next_or_null_rcu()" macro. The "documentation"
comment above the macro is pure and utter cut-and-paste garbage.

Paul, mind giving this a look?

 Linus


Re: INFO: task hung in wb_shutdown (2)

2018-05-01 Thread Linus Torvalds
On Tue, May 1, 2018 at 3:27 AM Tetsuo Handa <
penguin-ker...@i-love.sakura.ne.jp> wrote:

> Can you review this patch? syzbot has hit this bug for nearly 4000 times
but
> is still unable to find a reproducer. Therefore, the only way to test
would be
> to apply this patch upstream and test whether the problem is solved.

Looks ok to me, except:

> >   smp_wmb();
> >   clear_bit(WB_shutting_down, >state);
> > + smp_mb(); /* advised by wake_up_bit() */
> > + wake_up_bit(>state, WB_shutting_down);

This whole sequence really should just be a pattern with a helper function.

And honestly, the pattern probably *should* be

 clear_bit_unlock(bit, );
 smp_mb__after_atomic()
 wake_up_bit(, bit);

which looks like it is a bit cleaner wrt memory ordering rules.

 Linus


Re: limits->max_sectors is getting set to 0, why/where? [was: Re: dm: kernel oops by divide error on v4.16+]

2018-04-09 Thread Linus Torvalds
On Mon, Apr 9, 2018 at 3:32 PM, Jens Axboe  wrote:
>
> The resulting min/max and friends would have been trivial to test, but
> clearly they weren't.

Well, the min/max macros themselves actually were tested in user space by me.

It was the interaction with the unrelated "min_not_zero()" that wasn't ;)

It's easy in hind-sight to say "that's not at all unrelated", but
within the context of doing min/max, it was.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Linus Torvalds
On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Linus Torvalds
On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi  wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Linus Torvalds
On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

  Linus


Re: [GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Linus Torvalds
On Mon, Jan 29, 2018 at 12:08 PM, Jens Axboe  wrote:
>
> Yes of course, I can switch to using signed tags for pull requests
> for you.

Thanks. I don't actually know the maintainership status of kernel.dk.

You show up as the 'whois' contact, but I'm also assuming it doesn't
have the same kind of fairly draconian access control as the
kernel.org suite of sites, and I'm actively trying to make sure we
either use signed tags or the kernel.org repos (and honestly, even
with the kernel.org ones I tend to prefer signed tags).

Right now I only _require_ it for public hosting, but I'm trying to
move to just signed tags being the default everywhere.

  Linus


Re: [GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Linus Torvalds
On Mon, Jan 29, 2018 at 7:40 AM, Jens Axboe  wrote:
>
> This is the main pull request for block IO related changes for the 4.16
> kernel. Nothing major in this pull request, but a good amount of
> improvements and fixes all over the map. This pull request contains:

Pulled.

However, can I request that you please start using signed tags?

I know you have a gpg key and know how to do signed tags, because in
the almost four hundred pulls that I've done from you over the many
years, I've actually got three such pull request from you.

Can we make that percentage go up a bit? Pretty please?

  Linus


Re: [PATCH] locking/lockdep: Add CONFIG_LOCKDEP_AGGRESSIVE

2017-12-12 Thread Linus Torvalds
On Mon, Dec 11, 2017 at 9:20 PM, Byungchul Park  wrote:
>
> The *problem* is false positives, since locks and waiters in
> kernel are not classified properly

So the problem is that those false positives apparently end up being a
big deal for the filesystem people.

I personally don't think the code itself has to be removed, but I do
think that it should never have been added on as part of the generic
lock proving, and should always have been a separate config option.

I also feel that you dismiss "false positives" much too easily. A
false positive is a big problem - because it makes people ignore the
real cases (or just disable the functionality entirely).

It's why I am very quick to disable compiler warnings that have false
positives, for example. Just a couple of "harmless" false positive
warnings will poison the real warnings for people because they'll get
used to seeing warnings while building, and no longer actually look at
them.

 Linus


Re: [PATCH] locking/lockdep: Add CONFIG_LOCKDEP_AGGRESSIVE

2017-12-11 Thread Linus Torvalds
On Sun, Dec 10, 2017 at 7:50 PM, Theodore Ts'o  wrote:
> CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS can result
> in a large number of false positives because lockdep doesn't
> understand how to deal with multiple stacked loop or MD devices.

Guys, can we just remove this nasty crud already?

It's not working. Give it up. It was complex, it was buggy, it was slow.

Now it's causing people to disable lockdep entirely, or play these
kinds of games in unrelated trees.

It's time to give up on bad debugging, and definitely _not_ enable it
by default for PROVE_LOCKING.

Ted: in the meantime, don't use PROVE_LOCKING. Just use
DEBUG_LOCK_ALLOC instead. Please drop this patch from your tree.

  Linus


Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Linus Torvalds
On Fri, Nov 17, 2017 at 12:18 PM, Paolo Valente
 wrote:
>
> Sorry for causing this problem.  Yours was our first version, but then
> we feared that leaving useless instructions was worse than adding a
> burst of ifdefs. I'll try not to repeat this mistake.

So I generally do not want people to depend on the compiler to do
non-obvious conversions (so please do not write bad code that you just
hope the compiler can sort out and make reasonable), but we do heavily
rely on the compiler doing the obvious dead code removal particularly
for when things end up not being used.

A lot of the time that's very much something you can and should depend
on, because we have abstraction layers that involves a lot of code
just becoming no-ops on many architectures, but other architectures
need some particular sequence.

Another example is a lot of helper functions to do verification of
various kernel rules that then are enabled by some debug option (eg
CONFIG_DEBUG_ATOMIC_SLEEP etc), where we know that (and depend on)
the compiler will do a reasonable job.

That's not so different from having "statistics gathering function
that compiles to nothing".

And yes, sometimes the code might need to be written so that the
compiler has an easier time _seeing_ that the core really goes away,
so it's not necessarily a blind trust in the compiler.

So for example,  the functions that can go away should obviously be
inline functions so that you don't end up having the compiler generate
the arguments and the call to an empty function body, and so that the
compiler can see that "oh, those arguments aren't even used at all, so
now I can remove all the code that generates them".

If you are comfy with assembler, it can actually be a great thing to
occasionally just verify the actual generated code if it's some piece
of code you care deeply about. Obviously for performance work,
profiling happens, and then you see it that way.

Because sometimes you find some silly "Duh!" moments, where you didn't
realize that the compiler wasn't able to really get rid of something
for some reason that is usually very obvious in hindsight, but wasn't
actually obvious at all until you looked at what the compiler
generated.

   Linus


Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Linus Torvalds
On Fri, Nov 17, 2017 at 11:29 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> "F*ck no, that code is too ugly, you need to write it so that it can
> be maintained".

Dammit, the rest of the pull looks ok, so I'll take it anyway.

But I really do expect you to

 (a) clean up that mess - maybe with my patch as an example, but maybe
some entirely different way. The patch I sent is already deleted from
my tree, it was purely meant as an example.

 (b) really push back on people when they send you ugly code

It shouldn't have to always be me being upset and pushing back on the
block pulls.

  Linus


Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Linus Torvalds
On Fri, Nov 17, 2017 at 8:51 AM, Jens Axboe  wrote:
>
> - Small BFQ updates series from Luca and Paolo.

Honestly, this code is too ugly to live.

Seriously. That update should have been rejected on the grounds of
being completely unmaintainable crap.

Why didn't you?

Why are you allowing code like that patch to bfq_dispatch_request()
into the kernel source tree?

Yes, it improves performance. But it does so by adding random
#iofdef's into the middle of some pretty crtitical code, with
absolutely no attempt at having a sane maintainable code-base.

That function is literally *three* lines of code without the #ifdef case:

spin_lock_irq(>lock);
rq = __bfq_dispatch_request(hctx);
spin_unlock_irq(>lock);

and you could actually see what it did.

But instead of trying to abstract it in some legible manner, that
three-line obvious function got *three* copies of the same #if mess
all enclosing rancom crap, and the end result is really hard to read.

For example, I just spent a couple of minutes trying to make sense of
the code, and stop that unreadable mess.  I came up with the appended
patch.

It may not work for some reason I can't see, but that's not the point.
The attached patch is not meant as a "apply this as-is".

It's meant as a "look, you can write legible code where the statistics
just go away when they are compiled out".

Now that "obvious three-liner function" is not three lines any more,
but it's at least *clear* straight-line code:

spin_lock_irq(>lock);

in_serv_queue = bfqd->in_service_queue;
waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);

rq = __bfq_dispatch_request(hctx);

idle_timer_disabled =
waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);

spin_unlock_irq(>lock);

bfq_dispatch_statistics(hctx, rq, in_serv_queue, waiting_rq,
idle_timer_disabled);

and I am  hopeful that when bfq_dispatch_statistics() is disabled, the
compiler will be smart enough to not generate extra code, certainly
not noticeably so.

See? One is a mess of horrible ugly #ifdef's in the middle of the code
that makes the end result completely illegible.

The other is actually somewhat legible, and has a clear separation of
the statistics gathering. And that *statistics* gathering is clearly
optionally enabled or not.

Not the mess of random code put together in random ways that are
completely illegble.

Your job as maintainer is _literally_ to tell people "f*ck no, that
code is too ugly, you need to write it so that it can be maintained".

And I'm doing my job.

"F*ck no, that code is too ugly, you need to write it so that it can
be maintained".

Show some taste.

Linus
 block/bfq-iosched.c | 55 -
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bcb6d21baf12..47d88d03dadd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3689,35 +3689,14 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
return rq;
 }
 
-static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
-{
-   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
-   struct request *rq;
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   struct bfq_queue *in_serv_queue, *bfqq;
-   bool waiting_rq, idle_timer_disabled;
-#endif
-
-   spin_lock_irq(>lock);
-
 #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   in_serv_queue = bfqd->in_service_queue;
-   waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
-
-   rq = __bfq_dispatch_request(hctx);
-
-   idle_timer_disabled =
-   waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
-
-#else
-   rq = __bfq_dispatch_request(hctx);
-#endif
-   spin_unlock_irq(>lock);
+static void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct request 
*rq,
+   struct bfq_queue *in_serv_queue, bool waiting_rq, bool 
idle_timer_disabled)
+{
+   struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL;
 
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   bfqq = rq ? RQ_BFQQ(rq) : NULL;
if (!idle_timer_disabled && !bfqq)
-   return rq;
+   return;
 
/*
 * rq and bfqq are guaranteed to exist until this function
@@ -3752,8 +3731,32 @@ static struct request *bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
}
spin_unlock_irq(hctx->queue->queue_lock);
+}
+#else
+static inline void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct 
request *rq, struct bfq_queue *in_serv_queue, bool waiting_rq, bool 
idle_timer_disabled) {}
 #endif
 
+static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+   

Re: [GIT PULL] Block fixes for 4.14-rc2

2017-09-27 Thread Linus Torvalds
On Wed, Sep 27, 2017 at 5:41 AM, Jens Axboe  wrote:
>
> So I reworked the series, to include three prep patches that end up
> killing off free_more_memory(). This means that we don't have to do the
> 1024 -> 0 change in there. On top of that, I added a separate bit to
> manage range cyclic vs non range cyclic flush all work. This means that
> we don't have to worry about the laptop case either.
>
> I think that should quell any of the concerns in the patchset, you can
> find the new series here:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all

Yeah, this I feel more confident about.

It still has some subtle changes (nr_pages is slightly different for
laptop mode, and the whole "we rely on the VM to not return NULL") but
I cannot for the life of me imagine that those changes are noticeable
or meaningful.

So now that last commit that limits the pending flushes is the only
one that seems to matter, and the one you wanted to build up to. It's
not a subtle change, but that's the one that fixes the actual bug you
see, so now the rest of the series really looks like "prep-work for
the bug fix".

Of course, with the change to support range_cycling for the laptop
mode case, there's actually possibly _two_ pending full flushes (the
range-cyclic and the "all" one), so that last changelog may not be
entirely accurate.

Long-term, I would prefer to maybe make the "range_whole" case to
always be range-cyclic, because I suspect that's the better behavior,
but I think that series is the one that changes existing semantics the
least for the existing cases, so I think your approach is better for
now.

As far as I can tell, the cases that do not set range_cyclic right now are:

 - wakeup_flusher_threads()

 - sync_inodes_sb()

and in neither case does that lack of range_cyclic really seem to make
any sense.

It might be a purely historical artifact (an _old_ one: that
range_cyclic logic goes back to 2006, as far as I can tell).

But there might also be something I'm missing.

Anyway, your series looks fine to me.

  Linus


Re: [GIT PULL] Block fixes for 4.14-rc2

2017-09-25 Thread Linus Torvalds
On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason  wrote:
>
> My understanding is that for order-0 page allocations and
> kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or
> at the very least OOM kill something before returning NULL?

That should generally be true. We've occasionally screwed up in the
VM, so an explicit GFP_NOFAIL would definitely be best if we then
remove the looping in fs/buffer.c.

>> What is it that triggers that many buffer heads in the first place?
>> Because I thought we'd gotten to the point where all normal file IO
>> can avoid the buffer heads entirely, and just directly work with
>> making bio's from the pages.
>
> We're not triggering free_more_memory().  I ran a probe on a few production
> machines and it didn't fire once over a 90 minute period of heavy load.  The
> main target of Jens' patchset was preventing shrink_inactive_list() ->
> wakeup_flusher_threads() from creating millions of work items without any
> rate limiting at all.

So the two things I reacted to in that patch series were apparently
things that you guys don't even care about.

I reacted to the fs/buffer.c code, and to the change in laptop mode to
not do circular writeback.

The latter is another "it's probably ok, but it can be a subtle
change". In particular, things that re-write the same thing over and
over again can get very different behavior, even when you write out
"all" pages.

And I'm assuming you're not using laptop mode either on your servers
(that sounds insane, but I remember somebody actually ended up using
laptop mode even on servers, simply because they did *not* want the
regular timed writeback model, so it's not quite as insane as it
sounds).

 Linus


Re: [GIT PULL] Block fixes for 4.14-rc2

2017-09-25 Thread Linus Torvalds
On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboe  wrote:
>
> Honestly, what I really wanted to do was kill that call completely.

I can understand why you'd want to, but at the same time it is not
entirely clear that is possible.

So the problem is that we need more memory for buffer head
allocations, and we *have* to get it (since we very likely need the
buffer heads for writeout under low-mem cirumstances) - but at the
same time we obviously must not recurse into the filesystem.

The end result being that any synchronous freeing is going to be very
limited (GFP_NOFS really ends up limiting a lot), and so there's a
fairly strong argument that we should then wake up other threads that
*can* do filesystem writeback.

Of course, the problem with _that_ is that it's quite likely that the
flusher threads doing filesystem writeback may in turn get back to
this very same "we need buffer heads".

And that is, I assume, the case that facebook then hits.

So it's kind of "damned in you do, damned if you don't" situation.

But this is very much also why that free_more_memory() code then

 (a) limits the writeout

 (b) has that "yield()" in it

with the logic being to not cause unnecessarily much writeout, and
hopefully avoid huge latency spikes because of the (potential) looping
over the unsiccessful GFP_NOFS allocations.

But yes, a lot has changed over time. For example, originally
filesystems that used buffer heads used them both for reading and
writing, so the act of reading in a page tended to also populate the
buffer heads for that page.

I think most filesystems these days avoid buffer heads entirely, and
the dynamics of bh allocations have likely changed radically.

So I wouldn't remove the "wakeup_flusher_threads()" call, but maybe it
could be moved to be a unusual fallback case if the direct freeing
doesn't work.

But none of this happens unless that alloc_buffer_head() failed, so at
least one GFP_NOFS allocation has already failed miserably, which is
why that "make other threads that _can_ do filesystem IO try to free
things" exists in the first place.

> If you read the free_more_memory() function, it looks like crap.

I really don't see that.

Think of what the preconditions are:

 - a GFP_NOFS allocation has failed

 - we obviously cannot call back into filesystem code

and that function makes a fair amount of sense. It's _supposed_ to be
an unusual nasty case, after all.

Now,

> What I really wanted to do was kill the
> wakeup_flusher_threads() completely, we want to rely on
> try_to_free_pages() starting targeted writeback instead of some magic
> number of pages across devices.

Yes, hopefully the regular memory allocation should already do the
right thing, but the problem here is that we do need to loop over it,
which it isn't really designed for.

But one option might be to just say "_we_ shouldn't loop, the page
allocator should just loop until it gets a successful end result".

So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL.

That is in fact what some filesystems use for their own metadata:
you'll find that the

GFP_NOFS | __GFP_NOFAIL

combination is not entirely unheard of, and in fact that buffer code
itself actually uses it for the backing page itself (not the buffer
heads) in grow_dev_page().

So rather than play more games in free_more_memory(), I'd personally
just rather remove that function entirely, and say "the buffer head
allocations simply cannot fail" and leave it to the VM code to do the
right thing.

> But I decided that we should probably
> just keep that random flusher wakeup, and tackle that part in a v2 for
> 4.15.

Oh, absolutely, I'd be more than happy to play with this code, but as
mentioned, I actually would do even more radical surgery on it.

So I'm not at all objecting to changing that function in major ways.

I just don't think that changing the flush count is in any way
obvious. In many ways it's a *much* scarier change than adding
__GFP_NOFAIL, in my opinion.

At least with __GFP_NOFAIL, we have some fairly good reason to believe
that now the memory management code won't be doing enormous flushing
work "just because".

So my only real objection was really the timing, and the "it's
obviously a fix".

> The workload is nothing special. Getting to the point where it's bad
> enough to trigger a softlockup probably requires some serious beating up
> of the box, but getting to a suboptimal state wrt work units pending and
> the flusher threads is pretty trivial.

I'm actually surprised that you have what appears to be a buffer-head
heavy workload at all.

What is it that triggers that many buffer heads in the first place?
Because I thought we'd gotten to the point where all normal file IO
can avoid the buffer heads entirely, and just directly work with
making bio's from the pages.

So I assume this is some metadata writeout (ie either directory
writeback or just the btrfs metadata tree?

 Linus


Re: [GIT PULL] Block fixes for 4.14-rc2

2017-09-24 Thread Linus Torvalds
On Sun, Sep 24, 2017 at 5:03 PM, Jens Axboe  wrote:
>
> NVMe fixes have sometimes been accepted for the current series, while
> they should have been pushed. I'm not going to argue with that, but I
> think we've managed to make that better the last few series. It's still
> not squeaky clean, but it is improving substantially.  I don't believe
> that's even been the case for core changes, which have always been
> scrutinized much more heavily.

The NVMe fixes actually looked like real fixes. It's not what I
reacted to, I would have happily pulled those.

The comment about "NVMe and generic block layer" was about these areas
in general having been problem spots, and not the particular NVMe
patches in this pull request.

> I knew you might object to the writeback changes. As mentioned in the
> pull request, if you diff the whole thing, it's really not big at all.

Oh, it wasn't horribly big, and I agree that it was easy to read
through, but it really was entirely new code and really should have
been a merge window issue, not in a "fixes" pull.

And I did check the history of those patches, and it was all very very
recent and had recent comments on it too - it wasn't some old patches
that had been around for a while.

And while the patches were small, their effect were decidedly *not* obvious.

For example, while the patches were easy to read, one of them changed
the "try to free some memory" behavior fairly radically, going from
write out "at most 1024 pages" to "everything".

That may be a simple one-liner patch to read, but it sure as hell
wasn't obvious what the behavioral change really is under different
loads.

And when the patch is two days old, I can pretty much guarantee that
it hasn't been tested under a huge variety of loads. Maybe it does
exactly the right thing for *some* load, but...

It honestly looked like that patch existed purely in order to make the
next few patches be trivial ("no functional changes in this patch")
simply because the functional changes were done earlier.

That's all actually very nice for bisecting etc, and honestly, I liked
how the patch series looked and was split out, but I liked how it
looked as a *development* series.

It didn't make me go "this is a bug fix".

> It's mostly shuffling some arguments and things around, to make the
> final patch trivial. And it does mean that it's super easy to review.

See above. I think it was "super easy to review" only on a very
superficial level, not on a deeper one.

For example, I doubt you actually went back and looked at where that
"1024" came from that you just changed to zero?

It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH]
low-latency page reclaim") in the historical tree:

[PATCH] low-latency page reclaim

Convert the VM to not wait on other people's dirty data.

 - If we find a dirty page and its queue is not congested, do some
writeback.

 - If we find a dirty page and its queue _is_ congested then just
   refile the page.

 - If we find a PageWriteback page then just refile the page.

 - There is additional throttling for write(2) callers.  Within
   generic_file_write(), record their backing queue in ->current.
   Within page reclaim, if this tasks encounters a page which is dirty
   or under writeback onthis queue, block on it.  This gives some more
   writer throttling and reduces the page refiling frequency.

It's somewhat CPU expensive - under really heavy load we only get a 50%
reclaim rate in pages coming off the tail of the LRU.  This can be
fixed by splitting the inactive list into reclaimable and
non-reclaimable lists.  But the CPU load isn't too bad, and latency is
much, much more important in these situations.

Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
took 35 minutes to compile a kernel.  With this patch, it took three
minutes, 45 seconds.

I haven't done swapcache or MAP_SHARED pages yet.  If there's tons of
dirty swapcache or mmap data around we still stall heavily in page
reclaim.  That's less important.

This patch also has a tweak for swapless machines: don't even bother
bringing anon pages onto the inactive list if there is no swap online.

which introduced that "long nr_pages" argument to wakeup_bdflush().

Here's that patch for people who don't keep that historical tree
around like I do:


https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f

Now, it's entirely possible that the issue that made us do that in the
first place is long long gone, and we don't need that page limiter in
the VM any more.

But is it "super easy to review"?

No, it sure as hell is not.

> That's why I sent it in for this series. Yes, it's not months old code,
> but it has been tested. And it's not like it's a lot of NEW code, it's
> mostly removing code, or moving things around.
>
> I can resend without 

Re: [GIT PULL] Block fixes for 4.14-rc2

2017-09-24 Thread Linus Torvalds
On Sun, Sep 24, 2017 at 6:03 AM, Christoph Hellwig <h...@infradead.org> wrote:
> On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote:
>> WTF? Why is this so hard? It used to be that IDE drove people crazy.
>> Now it's NVMe and generic block layer stuff.
>
> Can you please explain what problems you have with the nvme patches?

This time it wasn't the NVMe parts (that was the last few releases).
This time it was the random writeback patches that had been committed
only days before, and that completely change some fundamental code.
Some of the commits say "this changes no semantics", but that's after
the previous commit just changed the argument values exactly so that
the next commit wouldn't change semantics.

Don't get me wrong - all the commits look perfectly _sane_.  That's
not my issue. But these commits literally showed up on lkml just a
couple of days before getting sent to me, and even when they were sent
the cover letter for the series of six patches was literally

   More graceful flusher thread memory reclaim wakeup

which *really* does not say "this is an important big fix or regression" to me.

What made it ok to send that outside the merge window? It looks like a
nice cleanup, no question about it, and it probably fixes some
behavioral problem at FB. But that behavioral problem is not *new*, as
far as I can tell.

So why was it sent as a bug-fix pull request?

Now, if this was some other subsystem that _hadn't_ had problems in
EVERY SINGLE recent merge window, I'd likely have been more than happy
to take it early in the rc series because that subsystem wasn't on my
list of "increased scrutiny due to historical issues".

But as it is, the block layer _is_ on my list of increased scrutiny
(and part of that reason is NVMe patches - again, "fixes" being sent
that were really just normal on-going development)

Linus


Re: [GIT PULL] Block fixes for 4.14-rc2

2017-09-22 Thread Linus Torvalds
On Fri, Sep 22, 2017 at 9:32 AM, Jens Axboe  wrote:
> Hi Linus,
>
> A round of fixes for this series. This pull request contains:

No, Jens.

This isn't fixes. This is new code that does new things, or cleanups,
or other random things mixed up with some fixes.

Stop this fuckery already.

The block layer has now becomes the biggest maintenance problem I
have, because I get shit like this.

Keep the fixes separate. Stop the random crazy changes. Stop the churn.

WTF? Why is this so hard? It used to be that IDE drove people crazy.
Now it's NVMe and generic block layer stuff.

  Linus


Re: [GIT PULL] Followup block pull request for 4.14-rc1

2017-09-09 Thread Linus Torvalds
On Sat, Sep 9, 2017 at 12:54 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Other than that, we match.

Oh, and I almost cried when I saw this nasty thing:

   ilog2(SZ_4K) - 9

in that nvme code, but I left it alone.

Why the hell people think "SZ_4K" is somehow more legible than 4096, I
have no idea. And depending on our ilog2() just doing the right thing
at build time, instead of just saying "12" with a comment, is just
odd.

It's not like "ilog2(SZ_4K)" is a particularly good comment.

That code should probably just make it clear that 4096 is the NVME
page size, and document it as such, rather than picking a random name
for a random constant.

So something like

   #define NVME_PAGE_SIZE 4096
   #define NVME_PAGE_SHIFT 12

and then using *those* would actually have documented something,
rather than these crazy odd random expressions.

Linus


Re: [GIT PULL] Followup block pull request for 4.14-rc1

2017-09-09 Thread Linus Torvalds
On Fri, Sep 8, 2017 at 10:34 AM, Jens Axboe  wrote:
>
> Single merge conflict here, for nvme/rdma. Trivial to fixup, see my
> for-next branch here:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=for-next

Your earlier merge for mm/page_io.c is ugly, and mixes up the bi_disk
thing oddly in the middle of the task struct code.

Other than that, we match.

  Linus


Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Linus Torvalds
On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboe  wrote:
>
> Which was committed yesterday? It was not from my tree. I try to keep
> an eye out for potential conflicts or issues.

It was from Andrew, so I'm assuming it was in linux-next. Not as a git
tree, but as the usual akpm branch.

I'm not sure why I didn't see any reports from linux-next about it,
though - and I looked.

There was no actual visible merge problem, because the conflict was
not a data conflict but a semantic one.

But the issue showed up for a trivial allmodconfig build, so it
*should* have been caught automatically.

But it's not even the conflict that annoys me.

It's the *reason* for the conflict - the block layer churn that you
said was done. We've had too much of it.

>> We need *stability* by now, after these crazy changes. Make sure that 
>> happens.
>
> I am pushing back, but I can't embargo any sort of API change. This one has
> good reasoning behind it, which is actually nicely explained in the commit
> itself. It's not pointless churn, which I would define as change just for
> the sake of change itself. Or pointless cleanups.

You can definitely put your foot down on any more random changes to
the bio infrastructure. Including for good reasons.

We have had some serious issues in the block layer - and I'm not
talking about the merge conflicts. I'm talking about just the
collateral damage, with things like SCSI having to revert using blk-mq
by default etc.

Christ, things like that are pretty *fundamnetal*, wouldn't you say.
Get those right before doing more churn. And aim to have one or two
releases literally just fixing things, with a "no churn" rule.

Because the block layer really has been a black spot lately.

  Linus


Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Linus Torvalds
On Tue, Sep 5, 2017 at 7:31 AM, Jens Axboe  wrote:
>
> Note that you'll hit a conflict in block/bio-integrity.c and
> mm/page_io.c, since we had fixes later in the 4.13 series for both of
> those that ended up conflicting with new changes. Both are trivial to
> fix up, I've included my resolution at the end of this email.

Actually, the _real_ conflict is with commit 8e654f8fbff5 ("zram: read
page from backing device") and the preceding code, which added a
"zram->bdev" that is the backing dev for the zram device.

And then it does

 bio->bi_bdev = zram->bdev;

but now the block layer has AGAIN changed some stupid internal detail
for no obvious reason, so now it doesn't work.

My initial thought was to just do

bio->bi_disk = zram->disk;

instead, which *syntactically* seems to make sense, and match things
up in the obvious way.

But when I actually thought about it more, I decided that's bullshit.
"zram->disk" is not the backing device at all, it's the zram interface
itself.

The correct resolution seems to be

bio_set_dev(bio, zram->bdev);

but *dammit*, Jens, this insane crazy constant "let's randomly change
block layer details around" needs to STOP.

This has been going on for too many releases by now. What the f*ck is
*wrong* with you and Christoph that you can't just leave this thing
alone?

Stop the pointless churn already.

When I saw your comment:

  It's a quiet series this round, which I think we needed after
  the churn of the last few series.

I was like "finally". And then this shows that you were just lying,
and the pointless churn is still happening.

Stop it. Really.

Here's a hint: if some stupid change requires you to mindlessly do 150
changes using a wrapper helper, it's churn.

We need *stability* by now, after these crazy changes. Make sure that happens.

  Linus


Re: Lots of new warnings with gcc-7.1.1

2017-07-11 Thread Linus Torvalds
On Tue, Jul 11, 2017 at 8:10 PM, Guenter Roeck  wrote:
>
> The hwmon warnings are all about supporting no more than 9,999 sensors
> (applesmc) to 999,999,999 sensors (scpi) of a given type.

Yeah, I think that's enough.

> Easy "fix" would be to replace snprintf() with scnprintf(), presumably
> because gcc doesn't know about scnprintf().

If that's the case, I'd prefer just turning off the format-truncation
(but not overflow) warning with '-Wno-format-trunction".

But maybe we can at least start it on a subsystem-by-subsystem basis
after people have verified their own subsusystem?

  Linus


Lots of new warnings with gcc-7.1.1

2017-07-11 Thread Linus Torvalds
[ Very random list of maintainers and mailing lists, at least
partially by number of warnings generated by gcc-7.1.1 that is then
correlated with the get_maintainers script ]

So I upgraded one of my boxes to F26, which upgraded the compiler to gcc-7.1.1

Which in turn means that my nice clean allmodconfig compile is not an
unholy mess of annoying new warnings.

Normally I hate the stupid new warnings, but this time around they are
actually exactly the kinds of warnings you'd want to see and that are
hard for humans to pick out errors: lots of format errors wrt limited
buffer sizes.

At the same time, many of them *are* annoying. We have various limited
buffers that are limited for a good reason, and some of the format
truncation warnings are about numbers in the range {0-MAX_INT], where
we definitely know that we don't need to worry about the really big
ones.

After all, we're using "snprintf()" for a reason - we *want* to
truncate if the buffer is too small.

But a lot of the warnings look reasonable, and at least the warnings
are nice in how they actually explain why the warning is happening.
Example:

  arch/x86/platform/intel-mid/device_libs/platform_max7315.c: In
function ‘max7315_platform_data’:
  arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:35:
warning: ‘%d’ directive writing between 1 and 11 bytes into a region
of size 9 [-Wformat-overflow=]
 sprintf(base_pin_name, "max7315_%d_base", nr);
 ^~
  arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:26:
note: directive argument in the range [-2147483647, 2147483647]

Yeah, the compiler is technically correct, but we already made sure we
have at most MAX7315_NUM of those adapters, so no, "nr" is really not
going to be a 10-digit number.

So the warning is kind of bogus.

At the same time, others aren't quite as insane, and in many cases the
warnings might be easy to just fix.

And some actually look valid, although they might still require odd input:

  net/bluetooth/smp.c: In function ‘le_max_key_size_read’:
  net/bluetooth/smp.c:3372:29: warning: ‘snprintf’ output may be
truncated before the last format character [-Wformat-truncation=]
snprintf(buf, sizeof(buf), "%2u\n", SMP_DEV(hdev)->max_key_size);
   ^~~
  net/bluetooth/smp.c:3372:2: note: ‘snprintf’ output between 4 and 5
bytes into a destination of size 4

yeah, "max_key_size" is unsigned char, but if it's larger than 99 it
really does need 5 bytes for "%2u\n" with the terminating NUL
character.

Of course, the "%2d" implies that people expect it to be < 100, but at
the same time it doesn't sound like a bad idea to just make the buffer
be one byte bigger. So..

Anyway, it would be lovely if some of the more affected developers
would take a look at gcc-7.1.1 warnings. Right now I get about three
*thousand* lines of warnings from a "make allmodconfig" build, which
makes them a bit overwhelming.

I do suspect I'll make "-Wformat-truncation" (as opposed to
"-Wformat-overflow") be a "V=1" kind of warning.  But let's see how
many of these we can fix, ok?

  Linus


Re: [GIT PULL] Core block/IO changes for 4.13-rc

2017-07-03 Thread Linus Torvalds
On Sun, Jul 2, 2017 at 4:44 PM, Jens Axboe  wrote:
>
> This is the main pull request for the block layer for 4.13. Not a huge
> round in terms of features, but there's a lot of churn related to some
> core cleanups. Note that merge request will throw 3 merge failures for
> you. I've included how I resolved them in the for-next (or
> for-4.13/merge) branch after the pull stats, at the end of this email.

So Jens, I got a semi-complaint about this constant churn in an
entirely unrelated scheduler tracing thread.

Basically, the block IO layer is apparently really painful to do
various instrumentation on, because it keeps having these random
idiotic interface churn with the status and flags and various random
command bytes changing semantics.

Can we stop with the masturbatory "let's just randomly change things
around" already?

It doesn't just result in annoying merge conflicts, it's just
*confusing* to people when the interfaces keep changing. And confusion
is bad.

And the fact that these things keep changing is also, I think, a sign
of something being fundamentally wrong.

WTF is going on that the block layer *constantly* ends up having these
stupid "let's move random command flag bits from one field to another"
or randomly changing the error returns?

Anyway, all merged up in my tree, but your "for-4.13/merge" branch did
*not* match what I got from you. Your main for-4.13 branch had some
additional random lightnvm/pblk changes too.

   Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Linus Torvalds
On Fri, Feb 24, 2017 at 9:39 AM, Bart Van Assche
 wrote:
>
> So the crash is caused by an attempt to dereference address 0x6b6b6b6b6b6b6b6b
> at offset 0x270. I think this means the crash is caused by a use-after-free.

Yeah, that's POISON_FREE, and that might explain why you see crashes
that others don't - you obviously have SLAB poisoning enabled. Jens
may not have.

%rdi is "struct mapped_device *md", which came from dm_softirq_done() doing

struct dm_rq_target_io *tio = tio_from_request(rq);
struct request *clone = tio->clone;
int rw;

if (!clone) {
rq_end_stats(tio->md, rq);
rw = rq_data_dir(rq);
if (!rq->q->mq_ops)
blk_end_request_all(rq, tio->error);
else
blk_mq_end_request(rq, tio->error);
rq_completed(tio->md, rw, false);
return;
}

so it's the 'tio' pointer that has been free'd. But it's worth noting
that we did apparently successfully dereference "tio" earlier in that
dm_softirq_done() *without* getting the poison value, so what I think
might be going on is that the 'tio' thing gets free'd when the code
does the blk_end_request_all()/blk_mq_end_request() call.

Which makes sense - that ends the lifetime of the request, which in
turn also ends the lifetime of the "tio_from_request()", no?

So the fix may be as simple as just doing

if (!clone) {
struct mapped_device *md = tio->md;

rq_end_stats(md, rq);
...
rq_completed(md, rw, false);
return;
}

because the 'mapped_device' pointer hopefully is still valid, it's
just 'tio' that has been freed.

Jens? Bart? Christoph? Somebody who knows this code should
double-check my thinking above. I don't actually know the tio
lifetimes, I'm just going by looking at how earlier accesses seemed to
be fine (eg that "tio->clone" got us NULL, not a POISON_FREE pointer,
for example).

   Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 1:50 PM, Markus Trippelsdorf
 wrote:
>
> But what about e.g. SATA SSDs? Wouldn't they be better off without any
> scheduler?
> So perhaps setting "none" for queue/rotational==0 and mq-deadline for
> spinning drives automatically in the sq blk-mq case?

Jens already said that the merging advantage can outweigh the costs,
but he didn't actually talk much about it.

The scheduler advantage can outweigh the costs of running a scheduler
by an absolutely _huge_ amount.

An SSD isn't zero-cost, and each command tends to have some fixed
overhead on the controller, and pretty much all SSD's heavily prefer
fewer large request over lots of tiny ones.

There are also fairness/latency issues that tend to very heavily favor
having an actual scheduler, ie reads want to be scheduled before
writes on an SSD (within reason) in order to make latency better.

Ten years ago, there were lots of people who argued that you don't
want to do do scheduling for SSD's, because SSD's were so fast that
you only added overhead. Nobody really believes that fairytale any
more.

So you might have particular loads that look better with noop, but
they will be rare and far between. Try it, by all means, and if it
works for you, set it in your udev rules.

The main place where a noop scheduler currently might make sense is
likely for a ramdisk, but quite frankly, since the main real usecase
for a ram-disk tends to be to make it easy to profile and find the
bottlenecks for performance analysis (for emulating future "infinitely
fast" media), even that isn't true - using noop there defeats the
whole purpose.

  Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:58 AM, Jens Axboe <ax...@kernel.dk> wrote:
> On 02/22/2017 11:56 AM, Linus Torvalds wrote:
>
> OK, so here's what I'll do:
>
> 1) We'll kill the default scheduler choices. sq blk-mq will default to
>mq-deadline, mq blk-mq will default to "none" (at least for now, until
>the new scheduler is done).
> 2) The individual schedulers will be y/m/n selectable, just like any
>other driver.

Yes. That makes sense as options. I can (or, perhaps even more
importantly, a distro can) answer those kinds of questions.

   Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:52 AM, Jens Axboe  wrote:
>>
>> It's that simple.
>
> No, it's not that simple at all. Fact is, some optimizations make sense
> for some workloads, and some do not.

Are you even listening?

I'm saying no user can ever give a sane answer to your question. The
question is insane and wrong.

I already said you can have a dynamic configuration (and maybe even an
automatic heuristic - like saying that a ramdisk gets NOOP by default,
real hardware does not).

But asking a user at kernel config time for a default is insane. If
*you* cannot answer it, then the user sure as hell cannot.

Other configuration questions have problems too, but at least the
question about "should I support ext4" is something a user (or distro)
can sanely answer. So your comparisons are pure bullshit.

 Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:41 AM, Jens Axboe  wrote:
>
> The fact is that we have two different sets, until we can yank
> the old ones. So I can't just ask one question, since the sets
> aren't identical.

Bullshit.

I'm, saying: rip out the question ENTIRELY. For *both* cases.

If you cannot yourself give a good answer, then there's no f*cking way
any user can give a good answer. So asking the question is totally and
utterly pointless.

All it means is that different people will try different (in random
ways) configurations, and the end result is random crap.

So get rid of those questions. Pick a default, and live with it. And
if people complain about performance, fix the performance issue.

It's that simple.

Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER?

Basically, I'm pushing back on config options that I can't personally
even sanely answer.

If it's a config option about "do I have a particular piece of
hardware", it makes sense. But these new ones were just complete
garbage.

The whole "default IO scheduler" thing is a disease. We should stop
making up these shit schedulers and then say "we don't know which one
works best for you".

All it does is encourage developers to make shortcuts and create crap
that isn't generically useful, and then blame the user and say "well,
you should have picked a different scheduler" when they say "this does
not work well for me".

We have had too many of those kinds of broken choices.  And when the
new Kconfig options get so confusing and so esoteric that I go "Hmm, I
have no idea if my hardware does a single queue or not", I put my foot
down.

When the IO scheduler questions were about a generic IO scheduler for
everything, I can kind of understand them. I think it was still a
mistake (for the reasons outline above), but at least it was a
comprehensible question to ask.

But when it gets to "what should I do about a single-queue version of
a MQ scheduler", the question is no longer even remotely sensible. The
question should simply NOT EXIST. There is no possible valid reason to
ask that kind of crap.

   Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:14 AM, Jens Axboe  wrote:
>
> What do you mean by "the regular IO scheduler"? These are different
> schedulers.

Not to the user they aren't.

If the user already answered once about the IO schedulers, we damn
well shouldn't ask again abotu another small implementaiton detail.

How hard is this to understand? You're asking users stupid things.

It's not just about the wording. It's a fundamental issue.  These
questions are about internal implementation details. They make no
sense to a user. They don't even make sense to a kernel developer, for
chrissake!

Don't make the kconfig mess worse. This "we can't make good defaults
in the kernel, so ask users about random things that they cannot
possibly answer" model is not an acceptable model.

If the new schedulers aren't better than NOOP, they shouldn't exist.
And if you want people to be able to test, they should be dynamic.

And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER?

It's really that simple.

 Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-21 Thread Linus Torvalds
On Tue, Feb 21, 2017 at 3:15 PM, Jens Axboe  wrote:
>
> But under a device managed by blk-mq, that device exposes a number of
> hardware queues. For older style devices, that number is typically 1
> (single queue).

... but why would this ever be different from the normal IO scheduler?

IOW, what makes single-queue mq scheduling so special that

 (a) it needs its own config option

 (b) it is different from just the regular IO scheduler in the first place?

So the whole thing stinks. The fact that it then has an
incomprehensible config option seems to be just gravy on top of the
crap.

> "none" just means that we don't have a scheduler attached.

.. which makes no sense to me in the first place.

People used to try to convince us that doing IO schedulers was a
mistake, because modern disk hardware did a better job than we could
do in software.

Those people were full of crap. The regular IO scheduler used to have
a "NONE" option too. Maybe it even still has one, but only insane
people actually use it.

Why is the MQ stuff magically so different that NONE would make sense at all>?

And equally importantly: why do we _ask_ people these issues? Is this
some kind of sick "cover your ass" thing, where you can say "well, I
asked about it", when inevitably the choice ends up being the wrong
one?

We have too damn many Kconfig options as-is, I'm trying to push back
on them. These two options seem fundamentally broken and stupid.

The "we have no good idea, so let's add a Kconfig option" seems like a
broken excuse for these things existing.

So why ask this question in the first place?

Is there any possible reason why "NONE" is a good option at all? And
if it is the _only_ option (because no other better choice exists), it
damn well shouldn't be a kconfig option!

 Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-21 Thread Linus Torvalds
Hmm. The new config options are incomprehensible and their help
messages don't actually help.

So can you fill in the blanks on what

  Default single-queue blk-mq I/O scheduler
  Default multi-queue blk-mq I/O scheduler

config options mean, and why they default to none?

The config phase of the kernel is one of the worst parts of the whole
project, and adding these kinds of random and incomprehensible config
options does *not* help.

  Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-07 Thread Linus Torvalds
On Sat, Jan 7, 2017 at 6:02 PM, Johannes Weiner  wrote:
>
> Linus? Andrew?

Looks fine to me. Will apply.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all.

By "in the dev cycle", do you mean your XFS changes, or have you been
tracking the merge cycle at least for some testing?

> I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220

Well, part of the changes during the merge window were the shadow
entry tracking changes that came in through Andrew's tree. Adding
Johannes Weiner to the participants.

> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

Well, I suspect that anything that creates memory pressure will end up
triggering the working set code, so ..

That said, obviously memory corruption could be involved and result in
random issues too, but I wouldn't really expect that in this code.

It would probably be really useful to get more data points - is the
problem reliably in this area, or is it going to be random and all
over the place.

That said:

> And worse, on that last error, the /host/ is now going into meltdown
> (running 4.7.5) with 32 CPUs all burning down in ACPI code:

The obvious question here is how much you trust the environment if the
host ends up also showing problems. Maybe you do end up having hw
issues pop up too.

The primary suspect would presumably be the development kernel you're
testing triggering something, but it has to be asked..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Block fixes for 4.9-rc

2016-11-11 Thread Linus Torvalds
On Fri, Nov 11, 2016 at 4:11 PM, Jens Axboe  wrote:
> Hi Linus,
>
> Three small (really, one liners all of them!) fixes that should go into
> this series:

What about the aoeblk one? That seems to have come in with a tester
lately. From your original email:

 "I'm wondering if this is bio iteration breakage. aoeblk does this weird
  inc/dec of page counts, which should not be needed. At least others
  would be hit by this as well. In any case, should suffice for a test,
  before we look further. Can anyone test with this debug patch?"

Anyway, that bug seems to have been around forever and I'm not seeing
a lot of complaints, but I thought I'd ask.

Your oneliners pulled. Except when I pull, I don't actually get this one:

Matias Bjørling (1):
  lightnvm: invalid offset calculation for lba_shift

but since you know how very deeply I care about lighnvm, I'm not
finding it in myself to worry about why that one was missing.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html