Re: [GIT PULL] Follow up block fix
On Thu, Dec 6, 2018 at 6:12 PM Jens Axboe wrote: > > > Linus, I just know notice you are not on the CC for the discussion about > the patch. Don't pull this one yet. It'll solve the issue, but it'll also > mess up the BUSY feedback loop that DM relies on for good merging of > sequential IO. Testing a new patch now, will push it to you when folks > are happy and when my testing has completed. Ok, this pull request dropped from my queue. Linus smime.p7s Description: S/MIME Cryptographic Signature
Re: [GIT PULL] Block fixes for 4.20-rc2
On Fri, Nov 9, 2018 at 1:29 PM Jens Axboe wrote: > > A select set of fixes that should go into this release. This pull > request contains: This is part of a final few "ack" emails, pointing out that there is now automation in place if you cc lkml in your pull request. That automation will parse the pull request and automatically send an ack when the end result gets merged into mainline, so I'll stop the manual process. But it currently only works on lkml (I think - maybe Konstantin enabled it for all lists that get tracked by lore.kernel.org), so this pull request probably won't get an automated response due to just going to linux-block. Just a heads-up. Linus
Re: [GIT PULL] Final block merge window changes/fixes
On Fri, Nov 2, 2018 at 10:08 AM Jens Axboe wrote: > > The biggest part of this pull request is the revert of the blkcg cleanup > series. It had one fix earlier for a stacked device issue, but another > one was reported. Rather than play whack-a-mole with this, revert the > entire series and try again for the next kernel release. > > Apart from that, only small fixes/changes. Pulled, Linus
Re: [GIT PULL] Followup block changes for 4.20-rc1
On Fri, Oct 26, 2018 at 9:00 AM Jens Axboe wrote: > > A followup pull request for this merge window. Pulled, Linus
Re: [GIT PULL] Follow up block changes for this merge window
On Wed, Aug 22, 2018 at 10:54 AM Jens Axboe wrote: > > - Set of bcache fixes and changes (Coly) Some of those bcache style fixes look questionable. Maybe we should push back on some of the checkpatch rules instead? Like having argument names in declarations - sometimes descriptive names can be good documentation. And sometimes they are just noise, because the type is what describes it. Oh well. Taken. Linus
Re: [GIT PULL] Block changes for 4.18-rc
On Mon, Jun 4, 2018 at 5:56 PM Kent Overstreet wrote: > > I like your patch for a less invasive version, but I did finish and test my > version, which deletes more code :) I certainly won't object to that. Your patch looks fine, and looks like the right thing in the long run anyway. Plus, it's apparently even tested. What a concept. Regardless, I'll sleep better tonight that this will all be fixed one way or the other. Thanks, Linus
Re: [GIT PULL] Block changes for 4.18-rc
On Mon, Jun 4, 2018 at 5:42 PM Linus Torvalds wrote: > > How about just the attached? Note: it probably goes without saying that the patch was entirely untested, but it does build, and it does get rid of the insane stack use. Linus
Re: [GIT PULL] Block changes for 4.18-rc
On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet wrote: > > However, that's not correct as is because mddev_delayed_put() calls > kobject_put(), and the kobject isn't initialized when the mddev is first > allocated, it's initialized when the gendisk is allocated... that isn't hard > to > fix but that's getting into real refactoring that I'll need to put actual work > into testing. Well, it also removes the bioset_exit() calls entirely. How about just the attached? It simply does it as two different cases, and adds the bioset_exit() calls to mddev_delayed_delete(). Hmm? Linus drivers/md/md.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index fc692b7128bb..6a2494065ab2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -510,11 +510,6 @@ static void mddev_delayed_delete(struct work_struct *ws); static void mddev_put(struct mddev *mddev) { - struct bio_set bs, sync_bs; - - memset(, 0, sizeof(bs)); - memset(_bs, 0, sizeof(sync_bs)); - if (!atomic_dec_and_lock(>active, _mddevs_lock)) return; if (!mddev->raid_disks && list_empty(>disks) && @@ -522,10 +517,6 @@ static void mddev_put(struct mddev *mddev) /* Array is not configured at all, and not held active, * so destroy it */ list_del_init(>all_mddevs); - bs = mddev->bio_set; - sync_bs = mddev->sync_set; - memset(>bio_set, 0, sizeof(mddev->bio_set)); - memset(>sync_set, 0, sizeof(mddev->sync_set)); if (mddev->gendisk) { /* We did a probe so need to clean up. Call * queue_work inside the spinlock so that @@ -534,12 +525,15 @@ static void mddev_put(struct mddev *mddev) */ INIT_WORK(>del_work, mddev_delayed_delete); queue_work(md_misc_wq, >del_work); - } else - kfree(mddev); + spin_unlock(_mddevs_lock); + return; + } } spin_unlock(_mddevs_lock); - bioset_exit(); - bioset_exit(_bs); + + bioset_exit(>bio_set); + bioset_exit(>sync_set); + kfree(mddev); } static void md_safemode_timeout(struct timer_list *t); @@ -5234,6 +5228,9 @@ static void mddev_delayed_delete(struct work_struct *ws) { struct mddev *mddev = container_of(ws, struct mddev, del_work); + bioset_exit(>bio_set); + bioset_exit(>sync_set); + sysfs_remove_group(>kobj, _bitmap_group); kobject_del(>kobj); kobject_put(>kobj);
Re: [GIT PULL] Block changes for 4.18-rc
On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo wrote: > > > Looking at the code, the fundamental problem seems to be that it's > weaving different parts of sync and async paths. I don't understand > why it'd punt the destructin of mddev but destroy biosets > synchronously. Can't it do sth like the following? Yeah, that looks like the right thing to do. Jens/Kent? Linus
Re: [GIT PULL] Block changes for 4.18-rc
On Mon, Jun 4, 2018 at 8:54 AM Jens Axboe wrote: > > On 6/4/18 9:51 AM, Linus Torvalds wrote: > > > > Why the hell doesn't it just do bioset_exit() on the originals instead, > > before freeing the mddev? > > CC'ing Neil to get his input on how best to clean that up, I'd > be much more comfortable with that logic improved rather than > just catering to the stack usage issue. Also include Arnd, as > he had a test patch for it. Also adding Tejun, because the only reason for that odd sequencing seems to be that - we want to queue the deletion work *inside* the spinlock, because it actually has synchronization between the workqueue and the 'all_mddevs_lock' spinlock. - we want to bioset_exit() the fields *outside* the spinlock. So what it does now is to copy those big 'struct bio_set' fields because they might go away from under it. Tejun, the code in question is mddev_put() in drivers/md/md.c, and it basically does INIT_WORK(>del_work, mddev_delayed_delete); queue_work(md_misc_wq, >del_work); inside a spinlock, but then wants to do some stuff *outside* the spinlock before that mddev_delayed_delete() thing actually gets called. Is there some way to "half-queue" the work - enough that a flush_workqueue() will wait for it, but still guaranteeing that it won't actually run until released? IOW, what I think that code really wants to do is perhaps something like /* under _mddevs_lock spinlock */ .. remove from all lists so that it can't be found .. .. but add it to the md_misc_wq so that people will wait for it .. INIT_WORK_LOCKED(>del_work, mddev_delayed_delete); queue_work(md_misc_wq, >del_work); ... spin_unlock(_mddevs_lock); /* mddev is still guaranteed live */ bioset_exit(>bio_set); bioset_exit(>sync_set); /* NOW we can release it */ if (queued) unlock_work(>del_work); else kfree(mddev); or something like that? The above is *ENTIRELY* hand-wavy garbage - don't take it seriously per se, consider it just pseudo-code for documentation reasons, not serious in any other way. Linus
Re: [GIT PULL] Block changes for 4.18-rc
On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe wrote: > > drivers/md/md.c | 61 +-- > drivers/md/md.h | 4 +- So I've pulled this, but I get a new warning: drivers/md/md.c: In function ‘mddev_put’: drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is larger than 2048 bytes [-Wframe-larger-than=] which seems to be due to commit afeee514ce7f ("md: convert to bioset_init()/mempool_init()"). As of that commit, mddev_put() now allocates *two* "struct bio_set" structures on the stack, and those really aren't small. Yes, part of it is that I do my test-builds with all that debug stuff, but those structures have several embedded mempool_t members, and they really are pretty sizable. And I _really_ do not think it's acceptable to have that kind of stack usage in there. I'm not sure you can trigger mddev_put() in the IO paths, but even without that I don't think it's acceptable. Also, the code simply looks like complete and utter garbage. It does bs = mddev->bio_set; sync_bs = mddev->sync_set; to _copy_ those structures, and then does bioset_exit() on them. WTF? Why the hell doesn't it just do biset_exit() on the originals instead, before freeing the mddev? I've pulled this, but this needs to be fixed. That is just broken garbage right now. No way in hell is it acceptable to waste 2kB of stack space on something idiotic like this. Linus
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Tue, May 22, 2018 at 2:09 AM Roman Penyaev < roman.peny...@profitbricks.com> wrote: > Should I resend current patch with more clear comments about how careful > caller should be with a leaking pointer? No. Even if we go your way, there is *one* single user, and that one is special and needs to take a lot more care. Just roll your own version, and make it an inline function like I've asked now now many times, and add a shit-ton of explanations of why it's safe to use in that *one* situation. I don't want any crazy and unsafe stuff in the generic header file that absolutely *nobody* should ever use. Linus
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < roman.peny...@profitbricks.com> wrote: > No, I continue from the pointer, which I assigned on the previous IO > in order to send IO fairly and keep load balanced. Right. And that's exactly what has both me and Paul nervous. You're no longer in the RCU domain. You're using a pointer where the lifetime has nothing to do with RCU any more. Can it be done? Sure. But you need *other* locking for it (that you haven't explained), and it's fragile as hell. It's probably best to not use RCU for it at all, but depend on that "other locking" that you have to have anyway, to keep the pointer valid over the non-RCU region. Linus
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Sat, May 19, 2018 at 1:25 PM Roman Penyaev < roman.peny...@profitbricks.com> wrote: > Another one list_for_each_entry_rcu()-like macro I am aware of is used in > block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > Can we do something generic with -rr semantics to cover both cases? That loop actually looks more like what Paul was asking for, and makes it (perhaps) a bit more obvious that the whole loop has to be done under the same RCU read sequence that looked up that first 'skip' entry. (Again, stronger locking than RCU is obviously also acceptable for the "look up skip entry"). But another reason I really dislike that list_next_or_null_rr_rcu() macro in the patch under discussion is that it's *really* not the right way to skip one entry. It may work, but it's really ugly. Again, the list_for_each_entry_rcu_rr() in blk-mq-sched.c looks better in that regard, in that the skipping seems at least a _bit_ more explicit about what it's doing. And again, if you make this specific to one particular list (and it really likely is just one particular list that wants this), you can use a nice legible helper inline function instead of the macro with the member name. Don't get me wrong - I absolutely adore our generic list handling macros, but I think they work because they are simple. Once we get to "take the next entry, but skip it if it's the head entry, and then return NULL if you get back to the entry you started with" kind of semantics, an inline function that takes a particular list and has a big comment about *why* you want those semantics for that particular case sounds _much_ better to me than adding some complex "generic" macro for a very very unusual special case. Linus
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Sat, May 19, 2018 at 1:21 PM Roman Penyaev < roman.peny...@profitbricks.com> wrote: > I need -rr behaviour for doing IO load-balancing when I choose next RDMA > connection from the list in order to send a request, i.e. my code is > something like the following: [ incomplete pseudoicode ] > i.e. usage of the @next pointer is under an RCU critical section. That's not enough. The whole chain to look up the pointer you are taking 'next' of needs to be under RCU, and that's not clear from your example. It's *probably* the case, but basically you have to prove that the starting point is still on the same RCU list. That wasn't clear from your example. The above is (as Paul said) true of list_next_rcu() too, so it's not like this is anything specific to the 'rr' version. Linus
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Fri, May 18, 2018 at 6:07 AM Roman Penwrote: > Function is going to be used in transport over RDMA module > in subsequent patches. Does this really merit its own helper macro in a generic header? It honestly smells more like "just have an inline helper function that is specific to rdma" to me. Particularly since it's probably just one specific list where you want this oddly specific behavior. Also, if we really want a round-robin list traversal macro, this isn't the way it should be implemented, I suspect, and it probably shouldn't be RCU-specific to begin with. Side note: I notice that I should already have been more critical of even the much simpler "list_next_or_null_rcu()" macro. The "documentation" comment above the macro is pure and utter cut-and-paste garbage. Paul, mind giving this a look? Linus
Re: INFO: task hung in wb_shutdown (2)
On Tue, May 1, 2018 at 3:27 AM Tetsuo Handa < penguin-ker...@i-love.sakura.ne.jp> wrote: > Can you review this patch? syzbot has hit this bug for nearly 4000 times but > is still unable to find a reproducer. Therefore, the only way to test would be > to apply this patch upstream and test whether the problem is solved. Looks ok to me, except: > > smp_wmb(); > > clear_bit(WB_shutting_down, >state); > > + smp_mb(); /* advised by wake_up_bit() */ > > + wake_up_bit(>state, WB_shutting_down); This whole sequence really should just be a pattern with a helper function. And honestly, the pattern probably *should* be clear_bit_unlock(bit, ); smp_mb__after_atomic() wake_up_bit(, bit); which looks like it is a bit cleaner wrt memory ordering rules. Linus
Re: limits->max_sectors is getting set to 0, why/where? [was: Re: dm: kernel oops by divide error on v4.16+]
On Mon, Apr 9, 2018 at 3:32 PM, Jens Axboewrote: > > The resulting min/max and friends would have been trivial to test, but > clearly they weren't. Well, the min/max macros themselves actually were tested in user space by me. It was the interaction with the unrelated "min_not_zero()" that wasn't ;) It's easy in hind-sight to say "that's not at all unrelated", but within the context of doing min/max, it was. Linus
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Like the page table caching entries, the memory type range registers > are really just "secondary information". They don't actually select > between PCIe and RAM, they just affect the behavior on top of that. Side note: historically the two may have been almost the same, since the CPU only had one single unified bus for "memory" (whether that was memory-mapped PCI or actual RAM). The steering was external. But even back then you had extended bits to specify things like how the 640k-1M region got remapped - which could depend on not just the address, but on whether you read or wrote to it. The "lost" 384kB of RAM could either be remapped at a different address, or could be used for shadowing the (slow) ROM contents, or whatever. Linus
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshiwrote: > > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS. No. Or rather, that's simply just another (small) part of it all - and an architected and documented one at that. Like the page table caching entries, the memory type range registers are really just "secondary information". They don't actually select between PCIe and RAM, they just affect the behavior on top of that. The really nitty-gritty stuff is not architected, and generally not documented outside (possibly) the BIOS writer's guide that is not made public. Those magical registers contain details like how the DRAM is interleaved (if it is), what the timings are, where which memory controller handles which memory range, and what are goes to PCIe etc. Basically all the actual *steering* information is very much hidden away from the kernel (and often from the BIOS too). The parts we see at a higher level are just tuning and tweaks. Note: the details differ _enormously_ between different chips. The setup can be very different, with things like Knights Landing having the external cache that can also act as local memory that isn't a cache but maps at a different physical address instead etc. That's the kind of steering I'm talking about - at a low level how physical addresses get mapped to different cache partitions, memory controllers, or to the IO system etc. Linus
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidtwrote: > > Could be that x86 has the smarts to do the right thing, still trying to > untangle the code :-) Afaik, x86 will not cache PCI unless the system is misconfigured, and even then it's more likely to just raise a machine check exception than cache things. The last-level cache is going to do fills and spills directly to the memory controller, not to the PCIe side of things. (I guess you *can* do things differently, and I wouldn't be surprised if some people inside Intel did try to do things differently with trying nvram over PCIe, but in general I think the above is true) You won't find it in the kernel code either. It's in hardware with firmware configuration of what addresses are mapped to the memory controllers (and _how_ they are mapped) and which are not. You _might_ find it in the BIOS, assuming you understood the tables and had the BIOS writer's guide to unravel the magic registers. But you might not even find it there. Some of the memory unit timing programming is done very early, and by code that Intel doesn't even release to the BIOS writers except as a magic encrypted blob, afaik. Some of the magic might even be in microcode. The page table settings for cacheability are more like a hint, and only _part_ of the whole picture. The memory type range registers are another part. And magic low-level uarch, northbridge and memory unit specific magic is yet another part. So you can disable caching for memory, but I'm pretty sure you can't enable caching for PCIe at least in the common case. At best you can affect how the store buffer works for PCIe. Linus
Re: [GIT PULL] Block changes for 4.16-rc
On Mon, Jan 29, 2018 at 12:08 PM, Jens Axboewrote: > > Yes of course, I can switch to using signed tags for pull requests > for you. Thanks. I don't actually know the maintainership status of kernel.dk. You show up as the 'whois' contact, but I'm also assuming it doesn't have the same kind of fairly draconian access control as the kernel.org suite of sites, and I'm actively trying to make sure we either use signed tags or the kernel.org repos (and honestly, even with the kernel.org ones I tend to prefer signed tags). Right now I only _require_ it for public hosting, but I'm trying to move to just signed tags being the default everywhere. Linus
Re: [GIT PULL] Block changes for 4.16-rc
On Mon, Jan 29, 2018 at 7:40 AM, Jens Axboewrote: > > This is the main pull request for block IO related changes for the 4.16 > kernel. Nothing major in this pull request, but a good amount of > improvements and fixes all over the map. This pull request contains: Pulled. However, can I request that you please start using signed tags? I know you have a gpg key and know how to do signed tags, because in the almost four hundred pulls that I've done from you over the many years, I've actually got three such pull request from you. Can we make that percentage go up a bit? Pretty please? Linus
Re: [PATCH] locking/lockdep: Add CONFIG_LOCKDEP_AGGRESSIVE
On Mon, Dec 11, 2017 at 9:20 PM, Byungchul Parkwrote: > > The *problem* is false positives, since locks and waiters in > kernel are not classified properly So the problem is that those false positives apparently end up being a big deal for the filesystem people. I personally don't think the code itself has to be removed, but I do think that it should never have been added on as part of the generic lock proving, and should always have been a separate config option. I also feel that you dismiss "false positives" much too easily. A false positive is a big problem - because it makes people ignore the real cases (or just disable the functionality entirely). It's why I am very quick to disable compiler warnings that have false positives, for example. Just a couple of "harmless" false positive warnings will poison the real warnings for people because they'll get used to seeing warnings while building, and no longer actually look at them. Linus
Re: [PATCH] locking/lockdep: Add CONFIG_LOCKDEP_AGGRESSIVE
On Sun, Dec 10, 2017 at 7:50 PM, Theodore Ts'owrote: > CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS can result > in a large number of false positives because lockdep doesn't > understand how to deal with multiple stacked loop or MD devices. Guys, can we just remove this nasty crud already? It's not working. Give it up. It was complex, it was buggy, it was slow. Now it's causing people to disable lockdep entirely, or play these kinds of games in unrelated trees. It's time to give up on bad debugging, and definitely _not_ enable it by default for PROVE_LOCKING. Ted: in the meantime, don't use PROVE_LOCKING. Just use DEBUG_LOCK_ALLOC instead. Please drop this patch from your tree. Linus
Re: [GIT PULL] Followup merge window block fixes/changes
On Fri, Nov 17, 2017 at 12:18 PM, Paolo Valentewrote: > > Sorry for causing this problem. Yours was our first version, but then > we feared that leaving useless instructions was worse than adding a > burst of ifdefs. I'll try not to repeat this mistake. So I generally do not want people to depend on the compiler to do non-obvious conversions (so please do not write bad code that you just hope the compiler can sort out and make reasonable), but we do heavily rely on the compiler doing the obvious dead code removal particularly for when things end up not being used. A lot of the time that's very much something you can and should depend on, because we have abstraction layers that involves a lot of code just becoming no-ops on many architectures, but other architectures need some particular sequence. Another example is a lot of helper functions to do verification of various kernel rules that then are enabled by some debug option (eg CONFIG_DEBUG_ATOMIC_SLEEP etc), where we know that (and depend on) the compiler will do a reasonable job. That's not so different from having "statistics gathering function that compiles to nothing". And yes, sometimes the code might need to be written so that the compiler has an easier time _seeing_ that the core really goes away, so it's not necessarily a blind trust in the compiler. So for example, the functions that can go away should obviously be inline functions so that you don't end up having the compiler generate the arguments and the call to an empty function body, and so that the compiler can see that "oh, those arguments aren't even used at all, so now I can remove all the code that generates them". If you are comfy with assembler, it can actually be a great thing to occasionally just verify the actual generated code if it's some piece of code you care deeply about. Obviously for performance work, profiling happens, and then you see it that way. Because sometimes you find some silly "Duh!" moments, where you didn't realize that the compiler wasn't able to really get rid of something for some reason that is usually very obvious in hindsight, but wasn't actually obvious at all until you looked at what the compiler generated. Linus
Re: [GIT PULL] Followup merge window block fixes/changes
On Fri, Nov 17, 2017 at 11:29 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > "F*ck no, that code is too ugly, you need to write it so that it can > be maintained". Dammit, the rest of the pull looks ok, so I'll take it anyway. But I really do expect you to (a) clean up that mess - maybe with my patch as an example, but maybe some entirely different way. The patch I sent is already deleted from my tree, it was purely meant as an example. (b) really push back on people when they send you ugly code It shouldn't have to always be me being upset and pushing back on the block pulls. Linus
Re: [GIT PULL] Followup merge window block fixes/changes
On Fri, Nov 17, 2017 at 8:51 AM, Jens Axboewrote: > > - Small BFQ updates series from Luca and Paolo. Honestly, this code is too ugly to live. Seriously. That update should have been rejected on the grounds of being completely unmaintainable crap. Why didn't you? Why are you allowing code like that patch to bfq_dispatch_request() into the kernel source tree? Yes, it improves performance. But it does so by adding random #iofdef's into the middle of some pretty crtitical code, with absolutely no attempt at having a sane maintainable code-base. That function is literally *three* lines of code without the #ifdef case: spin_lock_irq(>lock); rq = __bfq_dispatch_request(hctx); spin_unlock_irq(>lock); and you could actually see what it did. But instead of trying to abstract it in some legible manner, that three-line obvious function got *three* copies of the same #if mess all enclosing rancom crap, and the end result is really hard to read. For example, I just spent a couple of minutes trying to make sense of the code, and stop that unreadable mess. I came up with the appended patch. It may not work for some reason I can't see, but that's not the point. The attached patch is not meant as a "apply this as-is". It's meant as a "look, you can write legible code where the statistics just go away when they are compiled out". Now that "obvious three-liner function" is not three lines any more, but it's at least *clear* straight-line code: spin_lock_irq(>lock); in_serv_queue = bfqd->in_service_queue; waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); rq = __bfq_dispatch_request(hctx); idle_timer_disabled = waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); spin_unlock_irq(>lock); bfq_dispatch_statistics(hctx, rq, in_serv_queue, waiting_rq, idle_timer_disabled); and I am hopeful that when bfq_dispatch_statistics() is disabled, the compiler will be smart enough to not generate extra code, certainly not noticeably so. See? One is a mess of horrible ugly #ifdef's in the middle of the code that makes the end result completely illegible. The other is actually somewhat legible, and has a clear separation of the statistics gathering. And that *statistics* gathering is clearly optionally enabled or not. Not the mess of random code put together in random ways that are completely illegble. Your job as maintainer is _literally_ to tell people "f*ck no, that code is too ugly, you need to write it so that it can be maintained". And I'm doing my job. "F*ck no, that code is too ugly, you need to write it so that it can be maintained". Show some taste. Linus block/bfq-iosched.c | 55 - 1 file changed, 29 insertions(+), 26 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index bcb6d21baf12..47d88d03dadd 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3689,35 +3689,14 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) return rq; } -static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) -{ - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; - struct request *rq; -#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - struct bfq_queue *in_serv_queue, *bfqq; - bool waiting_rq, idle_timer_disabled; -#endif - - spin_lock_irq(>lock); - #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - in_serv_queue = bfqd->in_service_queue; - waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); - - rq = __bfq_dispatch_request(hctx); - - idle_timer_disabled = - waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); - -#else - rq = __bfq_dispatch_request(hctx); -#endif - spin_unlock_irq(>lock); +static void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct request *rq, + struct bfq_queue *in_serv_queue, bool waiting_rq, bool idle_timer_disabled) +{ + struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL; -#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - bfqq = rq ? RQ_BFQQ(rq) : NULL; if (!idle_timer_disabled && !bfqq) - return rq; + return; /* * rq and bfqq are guaranteed to exist until this function @@ -3752,8 +3731,32 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) bfqg_stats_update_io_remove(bfqg, rq->cmd_flags); } spin_unlock_irq(hctx->queue->queue_lock); +} +#else +static inline void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct request *rq, struct bfq_queue *in_serv_queue, bool waiting_rq, bool idle_timer_disabled) {} #endif +static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) +{ +
Re: [GIT PULL] Block fixes for 4.14-rc2
On Wed, Sep 27, 2017 at 5:41 AM, Jens Axboewrote: > > So I reworked the series, to include three prep patches that end up > killing off free_more_memory(). This means that we don't have to do the > 1024 -> 0 change in there. On top of that, I added a separate bit to > manage range cyclic vs non range cyclic flush all work. This means that > we don't have to worry about the laptop case either. > > I think that should quell any of the concerns in the patchset, you can > find the new series here: > > http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all Yeah, this I feel more confident about. It still has some subtle changes (nr_pages is slightly different for laptop mode, and the whole "we rely on the VM to not return NULL") but I cannot for the life of me imagine that those changes are noticeable or meaningful. So now that last commit that limits the pending flushes is the only one that seems to matter, and the one you wanted to build up to. It's not a subtle change, but that's the one that fixes the actual bug you see, so now the rest of the series really looks like "prep-work for the bug fix". Of course, with the change to support range_cycling for the laptop mode case, there's actually possibly _two_ pending full flushes (the range-cyclic and the "all" one), so that last changelog may not be entirely accurate. Long-term, I would prefer to maybe make the "range_whole" case to always be range-cyclic, because I suspect that's the better behavior, but I think that series is the one that changes existing semantics the least for the existing cases, so I think your approach is better for now. As far as I can tell, the cases that do not set range_cyclic right now are: - wakeup_flusher_threads() - sync_inodes_sb() and in neither case does that lack of range_cyclic really seem to make any sense. It might be a purely historical artifact (an _old_ one: that range_cyclic logic goes back to 2006, as far as I can tell). But there might also be something I'm missing. Anyway, your series looks fine to me. Linus
Re: [GIT PULL] Block fixes for 4.14-rc2
On Mon, Sep 25, 2017 at 2:17 PM, Chris Masonwrote: > > My understanding is that for order-0 page allocations and > kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or > at the very least OOM kill something before returning NULL? That should generally be true. We've occasionally screwed up in the VM, so an explicit GFP_NOFAIL would definitely be best if we then remove the looping in fs/buffer.c. >> What is it that triggers that many buffer heads in the first place? >> Because I thought we'd gotten to the point where all normal file IO >> can avoid the buffer heads entirely, and just directly work with >> making bio's from the pages. > > We're not triggering free_more_memory(). I ran a probe on a few production > machines and it didn't fire once over a 90 minute period of heavy load. The > main target of Jens' patchset was preventing shrink_inactive_list() -> > wakeup_flusher_threads() from creating millions of work items without any > rate limiting at all. So the two things I reacted to in that patch series were apparently things that you guys don't even care about. I reacted to the fs/buffer.c code, and to the change in laptop mode to not do circular writeback. The latter is another "it's probably ok, but it can be a subtle change". In particular, things that re-write the same thing over and over again can get very different behavior, even when you write out "all" pages. And I'm assuming you're not using laptop mode either on your servers (that sounds insane, but I remember somebody actually ended up using laptop mode even on servers, simply because they did *not* want the regular timed writeback model, so it's not quite as insane as it sounds). Linus
Re: [GIT PULL] Block fixes for 4.14-rc2
On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboewrote: > > Honestly, what I really wanted to do was kill that call completely. I can understand why you'd want to, but at the same time it is not entirely clear that is possible. So the problem is that we need more memory for buffer head allocations, and we *have* to get it (since we very likely need the buffer heads for writeout under low-mem cirumstances) - but at the same time we obviously must not recurse into the filesystem. The end result being that any synchronous freeing is going to be very limited (GFP_NOFS really ends up limiting a lot), and so there's a fairly strong argument that we should then wake up other threads that *can* do filesystem writeback. Of course, the problem with _that_ is that it's quite likely that the flusher threads doing filesystem writeback may in turn get back to this very same "we need buffer heads". And that is, I assume, the case that facebook then hits. So it's kind of "damned in you do, damned if you don't" situation. But this is very much also why that free_more_memory() code then (a) limits the writeout (b) has that "yield()" in it with the logic being to not cause unnecessarily much writeout, and hopefully avoid huge latency spikes because of the (potential) looping over the unsiccessful GFP_NOFS allocations. But yes, a lot has changed over time. For example, originally filesystems that used buffer heads used them both for reading and writing, so the act of reading in a page tended to also populate the buffer heads for that page. I think most filesystems these days avoid buffer heads entirely, and the dynamics of bh allocations have likely changed radically. So I wouldn't remove the "wakeup_flusher_threads()" call, but maybe it could be moved to be a unusual fallback case if the direct freeing doesn't work. But none of this happens unless that alloc_buffer_head() failed, so at least one GFP_NOFS allocation has already failed miserably, which is why that "make other threads that _can_ do filesystem IO try to free things" exists in the first place. > If you read the free_more_memory() function, it looks like crap. I really don't see that. Think of what the preconditions are: - a GFP_NOFS allocation has failed - we obviously cannot call back into filesystem code and that function makes a fair amount of sense. It's _supposed_ to be an unusual nasty case, after all. Now, > What I really wanted to do was kill the > wakeup_flusher_threads() completely, we want to rely on > try_to_free_pages() starting targeted writeback instead of some magic > number of pages across devices. Yes, hopefully the regular memory allocation should already do the right thing, but the problem here is that we do need to loop over it, which it isn't really designed for. But one option might be to just say "_we_ shouldn't loop, the page allocator should just loop until it gets a successful end result". So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL. That is in fact what some filesystems use for their own metadata: you'll find that the GFP_NOFS | __GFP_NOFAIL combination is not entirely unheard of, and in fact that buffer code itself actually uses it for the backing page itself (not the buffer heads) in grow_dev_page(). So rather than play more games in free_more_memory(), I'd personally just rather remove that function entirely, and say "the buffer head allocations simply cannot fail" and leave it to the VM code to do the right thing. > But I decided that we should probably > just keep that random flusher wakeup, and tackle that part in a v2 for > 4.15. Oh, absolutely, I'd be more than happy to play with this code, but as mentioned, I actually would do even more radical surgery on it. So I'm not at all objecting to changing that function in major ways. I just don't think that changing the flush count is in any way obvious. In many ways it's a *much* scarier change than adding __GFP_NOFAIL, in my opinion. At least with __GFP_NOFAIL, we have some fairly good reason to believe that now the memory management code won't be doing enormous flushing work "just because". So my only real objection was really the timing, and the "it's obviously a fix". > The workload is nothing special. Getting to the point where it's bad > enough to trigger a softlockup probably requires some serious beating up > of the box, but getting to a suboptimal state wrt work units pending and > the flusher threads is pretty trivial. I'm actually surprised that you have what appears to be a buffer-head heavy workload at all. What is it that triggers that many buffer heads in the first place? Because I thought we'd gotten to the point where all normal file IO can avoid the buffer heads entirely, and just directly work with making bio's from the pages. So I assume this is some metadata writeout (ie either directory writeback or just the btrfs metadata tree? Linus
Re: [GIT PULL] Block fixes for 4.14-rc2
On Sun, Sep 24, 2017 at 5:03 PM, Jens Axboewrote: > > NVMe fixes have sometimes been accepted for the current series, while > they should have been pushed. I'm not going to argue with that, but I > think we've managed to make that better the last few series. It's still > not squeaky clean, but it is improving substantially. I don't believe > that's even been the case for core changes, which have always been > scrutinized much more heavily. The NVMe fixes actually looked like real fixes. It's not what I reacted to, I would have happily pulled those. The comment about "NVMe and generic block layer" was about these areas in general having been problem spots, and not the particular NVMe patches in this pull request. > I knew you might object to the writeback changes. As mentioned in the > pull request, if you diff the whole thing, it's really not big at all. Oh, it wasn't horribly big, and I agree that it was easy to read through, but it really was entirely new code and really should have been a merge window issue, not in a "fixes" pull. And I did check the history of those patches, and it was all very very recent and had recent comments on it too - it wasn't some old patches that had been around for a while. And while the patches were small, their effect were decidedly *not* obvious. For example, while the patches were easy to read, one of them changed the "try to free some memory" behavior fairly radically, going from write out "at most 1024 pages" to "everything". That may be a simple one-liner patch to read, but it sure as hell wasn't obvious what the behavioral change really is under different loads. And when the patch is two days old, I can pretty much guarantee that it hasn't been tested under a huge variety of loads. Maybe it does exactly the right thing for *some* load, but... It honestly looked like that patch existed purely in order to make the next few patches be trivial ("no functional changes in this patch") simply because the functional changes were done earlier. That's all actually very nice for bisecting etc, and honestly, I liked how the patch series looked and was split out, but I liked how it looked as a *development* series. It didn't make me go "this is a bug fix". > It's mostly shuffling some arguments and things around, to make the > final patch trivial. And it does mean that it's super easy to review. See above. I think it was "super easy to review" only on a very superficial level, not on a deeper one. For example, I doubt you actually went back and looked at where that "1024" came from that you just changed to zero? It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH] low-latency page reclaim") in the historical tree: [PATCH] low-latency page reclaim Convert the VM to not wait on other people's dirty data. - If we find a dirty page and its queue is not congested, do some writeback. - If we find a dirty page and its queue _is_ congested then just refile the page. - If we find a PageWriteback page then just refile the page. - There is additional throttling for write(2) callers. Within generic_file_write(), record their backing queue in ->current. Within page reclaim, if this tasks encounters a page which is dirty or under writeback onthis queue, block on it. This gives some more writer throttling and reduces the page refiling frequency. It's somewhat CPU expensive - under really heavy load we only get a 50% reclaim rate in pages coming off the tail of the LRU. This can be fixed by splitting the inactive list into reclaimable and non-reclaimable lists. But the CPU load isn't too bad, and latency is much, much more important in these situations. Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds. I haven't done swapcache or MAP_SHARED pages yet. If there's tons of dirty swapcache or mmap data around we still stall heavily in page reclaim. That's less important. This patch also has a tweak for swapless machines: don't even bother bringing anon pages onto the inactive list if there is no swap online. which introduced that "long nr_pages" argument to wakeup_bdflush(). Here's that patch for people who don't keep that historical tree around like I do: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f Now, it's entirely possible that the issue that made us do that in the first place is long long gone, and we don't need that page limiter in the VM any more. But is it "super easy to review"? No, it sure as hell is not. > That's why I sent it in for this series. Yes, it's not months old code, > but it has been tested. And it's not like it's a lot of NEW code, it's > mostly removing code, or moving things around. > > I can resend without
Re: [GIT PULL] Block fixes for 4.14-rc2
On Sun, Sep 24, 2017 at 6:03 AM, Christoph Hellwig <h...@infradead.org> wrote: > On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote: >> WTF? Why is this so hard? It used to be that IDE drove people crazy. >> Now it's NVMe and generic block layer stuff. > > Can you please explain what problems you have with the nvme patches? This time it wasn't the NVMe parts (that was the last few releases). This time it was the random writeback patches that had been committed only days before, and that completely change some fundamental code. Some of the commits say "this changes no semantics", but that's after the previous commit just changed the argument values exactly so that the next commit wouldn't change semantics. Don't get me wrong - all the commits look perfectly _sane_. That's not my issue. But these commits literally showed up on lkml just a couple of days before getting sent to me, and even when they were sent the cover letter for the series of six patches was literally More graceful flusher thread memory reclaim wakeup which *really* does not say "this is an important big fix or regression" to me. What made it ok to send that outside the merge window? It looks like a nice cleanup, no question about it, and it probably fixes some behavioral problem at FB. But that behavioral problem is not *new*, as far as I can tell. So why was it sent as a bug-fix pull request? Now, if this was some other subsystem that _hadn't_ had problems in EVERY SINGLE recent merge window, I'd likely have been more than happy to take it early in the rc series because that subsystem wasn't on my list of "increased scrutiny due to historical issues". But as it is, the block layer _is_ on my list of increased scrutiny (and part of that reason is NVMe patches - again, "fixes" being sent that were really just normal on-going development) Linus
Re: [GIT PULL] Block fixes for 4.14-rc2
On Fri, Sep 22, 2017 at 9:32 AM, Jens Axboewrote: > Hi Linus, > > A round of fixes for this series. This pull request contains: No, Jens. This isn't fixes. This is new code that does new things, or cleanups, or other random things mixed up with some fixes. Stop this fuckery already. The block layer has now becomes the biggest maintenance problem I have, because I get shit like this. Keep the fixes separate. Stop the random crazy changes. Stop the churn. WTF? Why is this so hard? It used to be that IDE drove people crazy. Now it's NVMe and generic block layer stuff. Linus
Re: [GIT PULL] Followup block pull request for 4.14-rc1
On Sat, Sep 9, 2017 at 12:54 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Other than that, we match. Oh, and I almost cried when I saw this nasty thing: ilog2(SZ_4K) - 9 in that nvme code, but I left it alone. Why the hell people think "SZ_4K" is somehow more legible than 4096, I have no idea. And depending on our ilog2() just doing the right thing at build time, instead of just saying "12" with a comment, is just odd. It's not like "ilog2(SZ_4K)" is a particularly good comment. That code should probably just make it clear that 4096 is the NVME page size, and document it as such, rather than picking a random name for a random constant. So something like #define NVME_PAGE_SIZE 4096 #define NVME_PAGE_SHIFT 12 and then using *those* would actually have documented something, rather than these crazy odd random expressions. Linus
Re: [GIT PULL] Followup block pull request for 4.14-rc1
On Fri, Sep 8, 2017 at 10:34 AM, Jens Axboewrote: > > Single merge conflict here, for nvme/rdma. Trivial to fixup, see my > for-next branch here: > > http://git.kernel.dk/cgit/linux-block/log/?h=for-next Your earlier merge for mm/page_io.c is ugly, and mixes up the bi_disk thing oddly in the middle of the task struct code. Other than that, we match. Linus
Re: [GIT PULL] First set of block changes for 4.14-rc1
On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboewrote: > > Which was committed yesterday? It was not from my tree. I try to keep > an eye out for potential conflicts or issues. It was from Andrew, so I'm assuming it was in linux-next. Not as a git tree, but as the usual akpm branch. I'm not sure why I didn't see any reports from linux-next about it, though - and I looked. There was no actual visible merge problem, because the conflict was not a data conflict but a semantic one. But the issue showed up for a trivial allmodconfig build, so it *should* have been caught automatically. But it's not even the conflict that annoys me. It's the *reason* for the conflict - the block layer churn that you said was done. We've had too much of it. >> We need *stability* by now, after these crazy changes. Make sure that >> happens. > > I am pushing back, but I can't embargo any sort of API change. This one has > good reasoning behind it, which is actually nicely explained in the commit > itself. It's not pointless churn, which I would define as change just for > the sake of change itself. Or pointless cleanups. You can definitely put your foot down on any more random changes to the bio infrastructure. Including for good reasons. We have had some serious issues in the block layer - and I'm not talking about the merge conflicts. I'm talking about just the collateral damage, with things like SCSI having to revert using blk-mq by default etc. Christ, things like that are pretty *fundamnetal*, wouldn't you say. Get those right before doing more churn. And aim to have one or two releases literally just fixing things, with a "no churn" rule. Because the block layer really has been a black spot lately. Linus
Re: [GIT PULL] First set of block changes for 4.14-rc1
On Tue, Sep 5, 2017 at 7:31 AM, Jens Axboewrote: > > Note that you'll hit a conflict in block/bio-integrity.c and > mm/page_io.c, since we had fixes later in the 4.13 series for both of > those that ended up conflicting with new changes. Both are trivial to > fix up, I've included my resolution at the end of this email. Actually, the _real_ conflict is with commit 8e654f8fbff5 ("zram: read page from backing device") and the preceding code, which added a "zram->bdev" that is the backing dev for the zram device. And then it does bio->bi_bdev = zram->bdev; but now the block layer has AGAIN changed some stupid internal detail for no obvious reason, so now it doesn't work. My initial thought was to just do bio->bi_disk = zram->disk; instead, which *syntactically* seems to make sense, and match things up in the obvious way. But when I actually thought about it more, I decided that's bullshit. "zram->disk" is not the backing device at all, it's the zram interface itself. The correct resolution seems to be bio_set_dev(bio, zram->bdev); but *dammit*, Jens, this insane crazy constant "let's randomly change block layer details around" needs to STOP. This has been going on for too many releases by now. What the f*ck is *wrong* with you and Christoph that you can't just leave this thing alone? Stop the pointless churn already. When I saw your comment: It's a quiet series this round, which I think we needed after the churn of the last few series. I was like "finally". And then this shows that you were just lying, and the pointless churn is still happening. Stop it. Really. Here's a hint: if some stupid change requires you to mindlessly do 150 changes using a wrapper helper, it's churn. We need *stability* by now, after these crazy changes. Make sure that happens. Linus
Re: Lots of new warnings with gcc-7.1.1
On Tue, Jul 11, 2017 at 8:10 PM, Guenter Roeckwrote: > > The hwmon warnings are all about supporting no more than 9,999 sensors > (applesmc) to 999,999,999 sensors (scpi) of a given type. Yeah, I think that's enough. > Easy "fix" would be to replace snprintf() with scnprintf(), presumably > because gcc doesn't know about scnprintf(). If that's the case, I'd prefer just turning off the format-truncation (but not overflow) warning with '-Wno-format-trunction". But maybe we can at least start it on a subsystem-by-subsystem basis after people have verified their own subsusystem? Linus
Lots of new warnings with gcc-7.1.1
[ Very random list of maintainers and mailing lists, at least partially by number of warnings generated by gcc-7.1.1 that is then correlated with the get_maintainers script ] So I upgraded one of my boxes to F26, which upgraded the compiler to gcc-7.1.1 Which in turn means that my nice clean allmodconfig compile is not an unholy mess of annoying new warnings. Normally I hate the stupid new warnings, but this time around they are actually exactly the kinds of warnings you'd want to see and that are hard for humans to pick out errors: lots of format errors wrt limited buffer sizes. At the same time, many of them *are* annoying. We have various limited buffers that are limited for a good reason, and some of the format truncation warnings are about numbers in the range {0-MAX_INT], where we definitely know that we don't need to worry about the really big ones. After all, we're using "snprintf()" for a reason - we *want* to truncate if the buffer is too small. But a lot of the warnings look reasonable, and at least the warnings are nice in how they actually explain why the warning is happening. Example: arch/x86/platform/intel-mid/device_libs/platform_max7315.c: In function ‘max7315_platform_data’: arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:35: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=] sprintf(base_pin_name, "max7315_%d_base", nr); ^~ arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:26: note: directive argument in the range [-2147483647, 2147483647] Yeah, the compiler is technically correct, but we already made sure we have at most MAX7315_NUM of those adapters, so no, "nr" is really not going to be a 10-digit number. So the warning is kind of bogus. At the same time, others aren't quite as insane, and in many cases the warnings might be easy to just fix. And some actually look valid, although they might still require odd input: net/bluetooth/smp.c: In function ‘le_max_key_size_read’: net/bluetooth/smp.c:3372:29: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=] snprintf(buf, sizeof(buf), "%2u\n", SMP_DEV(hdev)->max_key_size); ^~~ net/bluetooth/smp.c:3372:2: note: ‘snprintf’ output between 4 and 5 bytes into a destination of size 4 yeah, "max_key_size" is unsigned char, but if it's larger than 99 it really does need 5 bytes for "%2u\n" with the terminating NUL character. Of course, the "%2d" implies that people expect it to be < 100, but at the same time it doesn't sound like a bad idea to just make the buffer be one byte bigger. So.. Anyway, it would be lovely if some of the more affected developers would take a look at gcc-7.1.1 warnings. Right now I get about three *thousand* lines of warnings from a "make allmodconfig" build, which makes them a bit overwhelming. I do suspect I'll make "-Wformat-truncation" (as opposed to "-Wformat-overflow") be a "V=1" kind of warning. But let's see how many of these we can fix, ok? Linus
Re: [GIT PULL] Core block/IO changes for 4.13-rc
On Sun, Jul 2, 2017 at 4:44 PM, Jens Axboewrote: > > This is the main pull request for the block layer for 4.13. Not a huge > round in terms of features, but there's a lot of churn related to some > core cleanups. Note that merge request will throw 3 merge failures for > you. I've included how I resolved them in the for-next (or > for-4.13/merge) branch after the pull stats, at the end of this email. So Jens, I got a semi-complaint about this constant churn in an entirely unrelated scheduler tracing thread. Basically, the block IO layer is apparently really painful to do various instrumentation on, because it keeps having these random idiotic interface churn with the status and flags and various random command bytes changing semantics. Can we stop with the masturbatory "let's just randomly change things around" already? It doesn't just result in annoying merge conflicts, it's just *confusing* to people when the interfaces keep changing. And confusion is bad. And the fact that these things keep changing is also, I think, a sign of something being fundamentally wrong. WTF is going on that the block layer *constantly* ends up having these stupid "let's move random command flag bits from one field to another" or randomly changing the error returns? Anyway, all merged up in my tree, but your "for-4.13/merge" branch did *not* match what I got from you. Your main for-4.13 branch had some additional random lightnvm/pblk changes too. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Fri, Feb 24, 2017 at 9:39 AM, Bart Van Asschewrote: > > So the crash is caused by an attempt to dereference address 0x6b6b6b6b6b6b6b6b > at offset 0x270. I think this means the crash is caused by a use-after-free. Yeah, that's POISON_FREE, and that might explain why you see crashes that others don't - you obviously have SLAB poisoning enabled. Jens may not have. %rdi is "struct mapped_device *md", which came from dm_softirq_done() doing struct dm_rq_target_io *tio = tio_from_request(rq); struct request *clone = tio->clone; int rw; if (!clone) { rq_end_stats(tio->md, rq); rw = rq_data_dir(rq); if (!rq->q->mq_ops) blk_end_request_all(rq, tio->error); else blk_mq_end_request(rq, tio->error); rq_completed(tio->md, rw, false); return; } so it's the 'tio' pointer that has been free'd. But it's worth noting that we did apparently successfully dereference "tio" earlier in that dm_softirq_done() *without* getting the poison value, so what I think might be going on is that the 'tio' thing gets free'd when the code does the blk_end_request_all()/blk_mq_end_request() call. Which makes sense - that ends the lifetime of the request, which in turn also ends the lifetime of the "tio_from_request()", no? So the fix may be as simple as just doing if (!clone) { struct mapped_device *md = tio->md; rq_end_stats(md, rq); ... rq_completed(md, rw, false); return; } because the 'mapped_device' pointer hopefully is still valid, it's just 'tio' that has been freed. Jens? Bart? Christoph? Somebody who knows this code should double-check my thinking above. I don't actually know the tio lifetimes, I'm just going by looking at how earlier accesses seemed to be fine (eg that "tio->clone" got us NULL, not a POISON_FREE pointer, for example). Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Wed, Feb 22, 2017 at 1:50 PM, Markus Trippelsdorfwrote: > > But what about e.g. SATA SSDs? Wouldn't they be better off without any > scheduler? > So perhaps setting "none" for queue/rotational==0 and mq-deadline for > spinning drives automatically in the sq blk-mq case? Jens already said that the merging advantage can outweigh the costs, but he didn't actually talk much about it. The scheduler advantage can outweigh the costs of running a scheduler by an absolutely _huge_ amount. An SSD isn't zero-cost, and each command tends to have some fixed overhead on the controller, and pretty much all SSD's heavily prefer fewer large request over lots of tiny ones. There are also fairness/latency issues that tend to very heavily favor having an actual scheduler, ie reads want to be scheduled before writes on an SSD (within reason) in order to make latency better. Ten years ago, there were lots of people who argued that you don't want to do do scheduling for SSD's, because SSD's were so fast that you only added overhead. Nobody really believes that fairytale any more. So you might have particular loads that look better with noop, but they will be rare and far between. Try it, by all means, and if it works for you, set it in your udev rules. The main place where a noop scheduler currently might make sense is likely for a ramdisk, but quite frankly, since the main real usecase for a ram-disk tends to be to make it easy to profile and find the bottlenecks for performance analysis (for emulating future "infinitely fast" media), even that isn't true - using noop there defeats the whole purpose. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Wed, Feb 22, 2017 at 10:58 AM, Jens Axboe <ax...@kernel.dk> wrote: > On 02/22/2017 11:56 AM, Linus Torvalds wrote: > > OK, so here's what I'll do: > > 1) We'll kill the default scheduler choices. sq blk-mq will default to >mq-deadline, mq blk-mq will default to "none" (at least for now, until >the new scheduler is done). > 2) The individual schedulers will be y/m/n selectable, just like any >other driver. Yes. That makes sense as options. I can (or, perhaps even more importantly, a distro can) answer those kinds of questions. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Wed, Feb 22, 2017 at 10:52 AM, Jens Axboewrote: >> >> It's that simple. > > No, it's not that simple at all. Fact is, some optimizations make sense > for some workloads, and some do not. Are you even listening? I'm saying no user can ever give a sane answer to your question. The question is insane and wrong. I already said you can have a dynamic configuration (and maybe even an automatic heuristic - like saying that a ramdisk gets NOOP by default, real hardware does not). But asking a user at kernel config time for a default is insane. If *you* cannot answer it, then the user sure as hell cannot. Other configuration questions have problems too, but at least the question about "should I support ext4" is something a user (or distro) can sanely answer. So your comparisons are pure bullshit. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Wed, Feb 22, 2017 at 10:41 AM, Jens Axboewrote: > > The fact is that we have two different sets, until we can yank > the old ones. So I can't just ask one question, since the sets > aren't identical. Bullshit. I'm, saying: rip out the question ENTIRELY. For *both* cases. If you cannot yourself give a good answer, then there's no f*cking way any user can give a good answer. So asking the question is totally and utterly pointless. All it means is that different people will try different (in random ways) configurations, and the end result is random crap. So get rid of those questions. Pick a default, and live with it. And if people complain about performance, fix the performance issue. It's that simple. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? Basically, I'm pushing back on config options that I can't personally even sanely answer. If it's a config option about "do I have a particular piece of hardware", it makes sense. But these new ones were just complete garbage. The whole "default IO scheduler" thing is a disease. We should stop making up these shit schedulers and then say "we don't know which one works best for you". All it does is encourage developers to make shortcuts and create crap that isn't generically useful, and then blame the user and say "well, you should have picked a different scheduler" when they say "this does not work well for me". We have had too many of those kinds of broken choices. And when the new Kconfig options get so confusing and so esoteric that I go "Hmm, I have no idea if my hardware does a single queue or not", I put my foot down. When the IO scheduler questions were about a generic IO scheduler for everything, I can kind of understand them. I think it was still a mistake (for the reasons outline above), but at least it was a comprehensible question to ask. But when it gets to "what should I do about a single-queue version of a MQ scheduler", the question is no longer even remotely sensible. The question should simply NOT EXIST. There is no possible valid reason to ask that kind of crap. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Wed, Feb 22, 2017 at 10:14 AM, Jens Axboewrote: > > What do you mean by "the regular IO scheduler"? These are different > schedulers. Not to the user they aren't. If the user already answered once about the IO schedulers, we damn well shouldn't ask again abotu another small implementaiton detail. How hard is this to understand? You're asking users stupid things. It's not just about the wording. It's a fundamental issue. These questions are about internal implementation details. They make no sense to a user. They don't even make sense to a kernel developer, for chrissake! Don't make the kconfig mess worse. This "we can't make good defaults in the kernel, so ask users about random things that they cannot possibly answer" model is not an acceptable model. If the new schedulers aren't better than NOOP, they shouldn't exist. And if you want people to be able to test, they should be dynamic. And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER? It's really that simple. Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
On Tue, Feb 21, 2017 at 3:15 PM, Jens Axboewrote: > > But under a device managed by blk-mq, that device exposes a number of > hardware queues. For older style devices, that number is typically 1 > (single queue). ... but why would this ever be different from the normal IO scheduler? IOW, what makes single-queue mq scheduling so special that (a) it needs its own config option (b) it is different from just the regular IO scheduler in the first place? So the whole thing stinks. The fact that it then has an incomprehensible config option seems to be just gravy on top of the crap. > "none" just means that we don't have a scheduler attached. .. which makes no sense to me in the first place. People used to try to convince us that doing IO schedulers was a mistake, because modern disk hardware did a better job than we could do in software. Those people were full of crap. The regular IO scheduler used to have a "NONE" option too. Maybe it even still has one, but only insane people actually use it. Why is the MQ stuff magically so different that NONE would make sense at all>? And equally importantly: why do we _ask_ people these issues? Is this some kind of sick "cover your ass" thing, where you can say "well, I asked about it", when inevitably the choice ends up being the wrong one? We have too damn many Kconfig options as-is, I'm trying to push back on them. These two options seem fundamentally broken and stupid. The "we have no good idea, so let's add a Kconfig option" seems like a broken excuse for these things existing. So why ask this question in the first place? Is there any possible reason why "NONE" is a good option at all? And if it is the _only_ option (because no other better choice exists), it damn well shouldn't be a kconfig option! Linus
Re: [GIT PULL] Block pull request for- 4.11-rc1
Hmm. The new config options are incomprehensible and their help messages don't actually help. So can you fill in the blanks on what Default single-queue blk-mq I/O scheduler Default multi-queue blk-mq I/O scheduler config options mean, and why they default to none? The config phase of the kernel is one of the worst parts of the whole project, and adding these kinds of random and incomprehensible config options does *not* help. Linus
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Sat, Jan 7, 2017 at 6:02 PM, Johannes Weinerwrote: > > Linus? Andrew? Looks fine to me. Will apply. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinnerwrote: > > There may be deeper issues. I just started running scalability tests > (e.g. 16-way fsmark create tests) and about a minute in I got a > directory corruption reported - something I hadn't seen in the dev > cycle at all. By "in the dev cycle", do you mean your XFS changes, or have you been tracking the merge cycle at least for some testing? > I unmounted the fs, mkfs'd it again, ran the > workload again and about a minute in this fired: > > [628867.607417] [ cut here ] > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 > shadow_lru_isolate+0x171/0x220 Well, part of the changes during the merge window were the shadow entry tracking changes that came in through Andrew's tree. Adding Johannes Weiner to the participants. > Now, this workload does not touch the page cache at all - it's > entirely an XFS metadata workload, so it should not really be > affecting the working set code. Well, I suspect that anything that creates memory pressure will end up triggering the working set code, so .. That said, obviously memory corruption could be involved and result in random issues too, but I wouldn't really expect that in this code. It would probably be really useful to get more data points - is the problem reliably in this area, or is it going to be random and all over the place. That said: > And worse, on that last error, the /host/ is now going into meltdown > (running 4.7.5) with 32 CPUs all burning down in ACPI code: The obvious question here is how much you trust the environment if the host ends up also showing problems. Maybe you do end up having hw issues pop up too. The primary suspect would presumably be the development kernel you're testing triggering something, but it has to be asked.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Block fixes for 4.9-rc
On Fri, Nov 11, 2016 at 4:11 PM, Jens Axboewrote: > Hi Linus, > > Three small (really, one liners all of them!) fixes that should go into > this series: What about the aoeblk one? That seems to have come in with a tester lately. From your original email: "I'm wondering if this is bio iteration breakage. aoeblk does this weird inc/dec of page counts, which should not be needed. At least others would be hit by this as well. In any case, should suffice for a test, before we look further. Can anyone test with this debug patch?" Anyway, that bug seems to have been around forever and I'm not seeing a lot of complaints, but I thought I'd ask. Your oneliners pulled. Except when I pull, I don't actually get this one: Matias Bjørling (1): lightnvm: invalid offset calculation for lba_shift but since you know how very deeply I care about lighnvm, I'm not finding it in myself to worry about why that one was missing. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html