Re: [GIT] Networking
On Sun, Oct 28, 2018 at 7:46 PM David Miller wrote: > > Please pull, thanks a lot! Pulled, Linus
Re: [PATCH 1/2] net: handle NULL ->poll gracefully
On Fri, Jun 29, 2018 at 6:46 AM Linus Torvalds wrote: > > Hmm. I'll have to think about it. I took yours over the revert. It's just smaller and nicer, and the conditional should predict fine for any case where it might matter. Linus
Re: [PATCH 1/2] net: handle NULL ->poll gracefully
On Fri, Jun 29, 2018 at 6:37 AM Christoph Hellwig wrote: > > The big aio poll revert broke various network protocols that don't > implement ->poll as a patch in the aio poll serie removed sock_no_poll > and made the common code handle this case. Hmm. I just did the revert of commit 984652dd8b1f ("net: remove sock_no_poll") in my tree. But I haven't pushed it out, and your patch is certainly smaller/nicer than the revert. That said, the revert does have the advantage of avoiding a conditional in the network poll() function (at the cost of a more expensive indirect call for when sock_no_poll actually triggers, but only "odd" network protocols have that). Hmm. I'll have to think about it. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Fri, Jun 29, 2018 at 6:29 AM Christoph Hellwig wrote: > No need for poll_table_entry, we just need a wait_queue_head. > poll_table_entry is an select.c internal (except for two nasty driver) - > neither epoll nor most in-kernel callers use it. Well, you need the poll_table for the "poll_wait()", and you would need to be able to have multiple wait-queues even though you only have one file you're waiting for. But yes, it doesn't really need to be the full complex poll_table_entry model. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 4:37 PM Al Viro wrote: > > You underestimate the nastiness of that thing (and for the record, I'm > a lot *less* fond of AIO than you are, what with having had to read that > nest of horrors lately). It does not "copy the data to userland"; what it > does instead is copying into an array of pages it keeps, right from > IO completion callback. I Ugh. Oh well. I'd be perfectly happy to have somebody re-write and re-architect the aio code entirely. Much rather than that the ->poll() code. Because I know which one I think is well-desiged with a nice usable interface, and which one is a pile of shit. In the meantime, if AIO wants to do poll() in some irq callback, may I suggest just using workqueues. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 3:49 PM Al Viro wrote: > > aio_poll() is not a problem. It's wakeup callback that is one. No, it's not a problem either. The only thing the wakup callback wants to do is to remove the wait queue entry. And *that* doesn't need to sleep, and it has absolutely nothing to do with whether "->poll()" itself sleeps or not. All that needs to do is "poll_freewait()". Done. In practice, it's probably only going to be a single free_poll_entry(), but who cares? The AIO code should handle the "maybe ->poll() added multiple wait-queues" just fine. > You are misreading that mess. What he's trying to do (other than surviving > the awful clusterfuck around cancels) is to handle the decision what to > report to userland right in the wakeup callback. *That* is what really > drives the "make the second-pass ->poll() or something similar to it > non-blocking" (in addition to the fact that it is such in considerable > majority of instances). That's just crazy BS. Just call poll() again when you copy the data to userland (which by definition can block, again). Stop the idiotic "let's break poll for stupid AIO reasons, because the AIO people are morons". Just make the AIO code do the sane thing. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 3:20 PM Al Viro wrote: > > The rules for drivers change only in one respect - if your ->poll() is going > to > need to block, check poll_requested_events(pt) & EPOLL_ATOMIC and return > EPOLLNVAL > in such case. OI still don't even understand why you care. Yes, the AIO poll implementation did it under the spinlock. But there's no good *reason* for that. The "aio_poll()" function itself is called in perfectly fine blocking context. The only reason it does it under the spinlock is that apparently Christoph didn't understand how poll() worked. As far as I can tell, Christoph could have just done the first pass '->poll()' *without* taking a spinlock, and that adds the table entry to the table. Then, *under the spinlock*, you associate the table the the kioctx. And then *after* the spinlock, you can call "->poll()" again (now with a NULL table pointer), to verify that the state is still not triggered. That's the whole point of the two-phgase poll thing - the first phase adds the entry to the wait queues, and the second phase checks for the race of "did it the event happen in the meantime". There is absolutely no excuse for calling '->poll()' itself under the spinlock. I don't see any reason for it. The whole "AIO needs this to avoid races" was always complete and utter bullshit, as far as I can tell. So stop it with this crazy and pointless "poll() might block". IT DAMN WELL SHOULD BE ABLE TO BLOCK, AND NOBODY SANE WILL EVER CARE! If somebody cares, they are doing things wrong. So fix the AIO code, don't look at the poll() code, for chrissake! Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 2:30 PM Al Viro wrote: > > > Again, locking is permitted. It's not great, but it's not against the rules. > > Me: a *LOT* of ->poll() instances only block in __pollwait() called > (indirectly) > on the first pass. > > You: They are *all* supposed to do it. > > Me: Oh, I thought you were talking about the whole "first pass" adding to wait queues, as opposed to doing it on the second pass. The *blocking* is entirely immaterial. I didn't even react to it, because it's simply not an issue. I don't understand why you're even hung up about it. The only reason "blocking" seems to be an issu eis because AIO has shit-for-brains and wanted to do poll() under the spinlock. But that's literally just AIO being confused garbage. It has zero relevance for anything else. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 1:37 PM Al Viro wrote: > > > Speaking of obvious bogosities (a lot more so than a blocking allocation > several calls down into helper): > > static __poll_t ca8210_test_int_poll( > struct file *filp, > struct poll_table_struct *ptable > ) Ok, that's just garbage. Again, this is not an excuse for changing "->poll()". This is just a bogus driver that does stupid things. And again, don't use this as an example of "poll must not block" The fact is, poll() *can* block for locking and other such things, and it's perfectly normal and ok. It just shouldn't block for IO. I will not take any misguided patches to try to "fix" poll functions that might block. But I will take patches to fix bugs in drivers. In this case, I think the fix is to just remove that crazy "wait_event_interruptible()" entirely. Not that anybody cares, obviously. Apparently nobody has ever noticed how broken that poll function is. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 1:28 PM Al Viro wrote: > > > Sure, but... > > static __poll_t binder_poll(struct file *filp, > struct poll_table_struct *wait) > { > struct binder_proc *proc = filp->private_data; > struct binder_thread *thread = NULL; > bool wait_for_proc_work; > > thread = binder_get_thread(proc); > if (!thread) > return POLLERR; That's actually fine. In particular, it's ok to *not* add yourself to the wait-queues if you already return the value that you will always return. For example, the poll function for a regular file could just always return EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM because it never changes. Now, it doesn't actually *do* that, because we have the rule that "no ->poll function" means exactly that, but it's not wrong. Similarly, if a poll handler has an error that will not go away, the above is actually the *right* thing to do. There simply isn't anything to wait for. Arguably, it probably should return not just POLLERR, but also POLLIN/POLLOUT, since an error *also* means that it's going to return immediately from read/write, but that's a separate thing entirely. > And that's hardly unique - we have instances playing with timers, > allocations, whatnot. Even straight mutex_lock(), as in So? Again, locking is permitted. It's not great, but it's not against the rules. So none of the things you point to are excuses for changing interfaces or adding any flags. And no, we don't care at all about "blocking" in general. Somebody who cares about _performance_ may care, but it's not fundamnetal. Even select(*) and poll() itself will block for allocating select tables etc, but also for reading (and updating) user space. Anybody who thinks "select cannot block" or "->poll() musn't block" is just confused. It has *never* been about that. It waits asynchronously for IO, but it may well wait synchronously for locks or memory or just "lazy implementation". And none of this has antyhign to do with the ->poll() interface itself. If you want to improve performance on some _particular_ file and actually worry about locking etc, you fix that particular implementation. You don't go and change the interface for everybody. The fact is, those interface changes were just broken shit. They were confused. I don't actually believe that AIO even needed them. Christoph, do you have a test program for IOCB_CMD_POLL and what it's actually supposed to do? Because I think that what it can do is simply to do the ->poll() calls outside the iocb locks, and then just attach the poll table to the kioctx afterwards. This whole "poll must not block" is a complete red herring. It doesn't come from any other requirements than BAD AIO GARBAGE CODE. Seriously. Stop thinking this has to happen inside some spinlocked region. That's AIO braindamage, it's irrelevant. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 11:17 AM Al Viro wrote: > > As for what can be salvaged out of the whole mess, > * probably the most serious lesson is that INDIRECT CALLS ARE > COSTLY NOW and shouldn't be used lightly. Note that this has always been true, it's just _way_ more obvious now. Indirect calls are costly not just because of the nasty 20+ cycle cost of the stupid spectre overhead (which hopefully will be a thing of the past in a few years as people upgrade), but because they are a pretty basic barrier to optimization, both for compilers but also just for _people_. Look at a lot of vfs optimization we've done, like all the name hashing optimization etc. We basically fall flat on our face if a filesystem implements its own name hash function, not _just_ because of the cost of the indirect function call, but because it suddenly means that the filesystem is doing its own thing and all the clever work we did to integrate name hashing with copying the name no longer works. So I really want to avoid indirect calls. And when they *do* happen, I want to avoid the model where people think of them as low-level object accessor functions - the C++ disease. I want indirect function calls to make sense at a higher level, and do some real operation. End result: I really despised the new poll() model. Yes, the performance report was what made me *notice*, but then when I looked at the code I went "No". Using an indirect call as some low-level accessor function really is fundamentally wrong. Don't do it. It's badly designed. Out VFS operations are _high-level_ operations, where we do one single call for a whole "read()" operation. "->poll()" used to be the same. The new "->get_poll_head()" and "->poll_mask()" was just bad, bad, bad. > * having an optional pointer to wait_queue_head in struct file > is probably a good idea; a lot of ->poll() instances do have the same > form. Even if sockets do not (and I'm not all that happy about the > solution in the latest series), the instances that do are common and > important enough. Right. I don't hate the poll wait-queue pointer. That said, I do hope that we can simply write things so as to not even need it. > * a *LOT* of ->poll() instances only block in __pollwait() > called (indirectly) on the first pass. They are *all* supposed to do it. The whole idea with "->poll()" is that the model of operation is: - add_wait_queue() and return state on the first pass - on subsequent passes (or if somebody else already returned a state that means we already know we're not going to wait), the poll table is NULL, so you *CANNOT* add_wait_queue again, so you just return state. Additionally, once _anybody_ has returned a "I already have the event", we also clear the poll table queue, so subsequent accesses will purely be for returning the poll state. So I don't understand why you're asking for annotation. The whole "you add yourself to the poll table" is *fundamentally* only done on the first pass. You should never do it for later passes. > How much do you intend to revert? Untangling just the ->poll()-related > parts from the rest of changes in fs/aio.c will be unpleasant; we might > end up reverting the whole tail of the series and redoing the things > that are not poll-related on top of the revert... ;-/ I pushed out my revert. It was fairly straightforward, it just reverted all the poll_mask/get_poll_head changes, and the aio code that depended on them. Btw, I really don't understand the "aio has a race". The old poll() interface was fundamentally race-free. There simply *is* no way to race on it, exactly because of the whole "add yourself to the wait queue first, then ask for state afterwards" model. The model may be _odd_, but it has literally worked well for a quarter century exactly because it's really simple and fundamentally cannot have races. So I think it's the aio code that needs fixing, not the polling code. I do want that explanation for why AIO is _so_ special that it can introduce a race in poll(). Because I suspect it's not so special, and it's just buggy. Maybe Christoph didn't understand the two-phase model (how you call ->poll() _twice_ - first to add yourself to the queue, later to check status). Or maybe AIO interfaces are just shit (again!) and don't work right. Linus
Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer
On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig wrote: > > Note that for this removes the possibility of actually returning an > error before waiting in poll. We could still do this with an ERR_PTR > in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but > I'd like to defer that until actually required. I'm still going to just revert the whole poll mess for now. It's still completely broken. This helps things, but it doesn't fix the fundamental issue: the new interface is strictly worse than the old interface ever was. So Christoph, if you don't like the tradoitional ->poll() method, and you want something else for aio polling, I seriously will suggest that you introduce a new f_op for *that*. Don't mess with the existing ->poll() function, don't make select() and poll() use a slower and less capable function just because aio wants something else. In other words, you need to see AIO as the less important case, not as the driver for this. I also want to understand what the AIO race was, and what the problem with the poll() thing was. You claimed it was racy. I don't see it, and it was never ever explained in the whole series. I should never have pulled it in the first place if only for that reason, but I tend to trust Al when it comes to the VFS layer, so I did. My bad. So before we try this again, I most definitely want _explanations_. And I want the whole approach to be very clear that AIO is the ugly step-sister, not the driving force. Linus
Re: [GIT] Networking
On Fri, May 11, 2018 at 5:10 PM David Millerwrote: > I guess this is my reward for trying to break the monotony of > pull requests :-) I actually went back and checked a few older pull requests to see if this had been going on for a while and I just hadn't noticed. It just took me by surprise :^p Linus
Re: [GIT] Networking
David, is there something you want to tell us? Drugs are bad, m'kay.. Linus On Fri, May 11, 2018 at 2:00 PM David Millerwrote: > "from Kevin Easton", "Thanks to Bhadram Varka", "courtesy of Gustavo A. > R. Silva", "To Eric Dumazet we are most grateful for this fix", "This > fix from YU Bo, we do appreciate", "Once again we are blessed by the > honorable Eric Dumazet with this fix", "This fix is bestowed upon us by > Andrew Tomt", "another great gift from Eric Dumazet", "to Hangbin Liu we > give thanks for this", "Paolo Abeni, he gave us this", "thank you Moshe > Shemesh", "from our good brother David Howells", "Daniel Juergens, > you're the best!", "Debabrata Benerjee saved us!", "The ship is now > water tight, thanks to Andrey Ignatov", "from Colin Ian King, man we've > got holes everywhere!", "Jiri Pirko what would we do without you!
Re: [llc_ui_release] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
On Sat, Apr 28, 2018 at 7:12 PM Fengguang Wuwrote: > FYI this happens in mainline kernel 4.17.0-rc2. > It looks like a new regression. > It occurs in 5 out of 5 boots. > [main] 375 sockets created based on info from socket cachefile. > [main] Generating file descriptors > [main] Added 83 filenames from /dev > udevd[507]: failed to execute '/sbin/modprobe' '/sbin/modprobe -bv platform:regulatory': No such file or directory > [ 372.057947] caif:caif_disconnect_client(): nothing to disconnect > [ 372.082415] BUG: unable to handle kernel NULL pointer dereference at 0004 I think this is fixed by commit 3a04ce7130a7 ("llc: fix NULL pointer deref for SOCK_ZAPPED") Liunus
Re: Boot failures with net-next after rebase to v4.17.0-rc1
On Tue, Apr 24, 2018 at 12:54 PM, Jesper Dangaard Brouerwrote: > Hi all, > > I'm experiencing boot failures with net-next git-tree after it got > rebased/merged with Linus'es tree at v4.17.0-rc1. I suspect it's the global bit stuff that came in very late in the merge window, and had been developed and tested for a while before, but showed some problems under some configs. The fix is currently in the x86/pti tree in -tip, see: x86/pti: Fix boot problems from Global-bit setting and I expect it will percolate upstream soon. In the meantime, it would be good to verify that merging that x86/pti branch fixes it for you? There is another candidate for boot problems - do you happen to have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled? That can under certain circumstances get a percpu setup page fault because memory hadn't been initialized sufficiently. The fix there is to move the mm_init() call one step earlier in init_main(): start_kernel() (to before trap_init()). And if it's neither of the above, I think you'll need to help bisect it. Linus
Re: [Patch net] net_sched: fix a missing idr_remove() in u32_delete_key()
On Fri, Apr 6, 2018 at 5:19 PM, Cong Wangwrote: > When we delete a u32 key via u32_delete_key(), we forget to > call idr_remove() to remove its handle from IDR. > > Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles") > Reported-by: Marcin Kabiesz Marcin sent me a tested-by too, so this seems all good. Thanks, Linus
Fwd: Problem with the kernel 4.15 - cutting the band (tc)
Forwarding a report about what looks like a regression between 4.14 and 4.15. New ENOSPC issue? I don't even knew where to start guessing where to look. Help me, Davem-Wan Kenobi, you are my only hope. (But adding netdev just in case somebody else goes "That's obviously Xyz") Linus -- Forwarded message -- From: Marcin KabieszDate: Thu, Apr 5, 2018 at 10:38 AM Subject: Problem with the kernel 4.15 - cutting the band (tc) Hello, I have a problem with bandwidth cutting on kernel 4.15. On the version up to 4.15, i.e. 4.14, this problem does not occur. uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6 command to reproduce: tc qdisc add dev ifb0 root handle 1: htb r2q 2 tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil 10gbit quantum 16000 tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256 tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800:: match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1: tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32 ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2 tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32 tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32 ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2 tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32 This ok, no error/warnings and dmesg log. uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or 4.15.14 this same effect) command to reproduce: tc qdisc add dev ifb0 root handle 1: htb r2q 2 tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil 10gbit quantum 16000 tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256 tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800:: match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1: tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32 ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2 tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32 tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32 ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2 RTNETLINK answers: No space left on device We have an error talking to the kernel This not ok, on error/warnings and no dmesg log. Best Regards Please forgive my English Marcin Kabiesz
Re: [GIT PULL] remove in-kernel calls to syscalls
On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowskiwrote: > > This patchset removes all in-kernel calls to syscall functions in the > kernel with the exception of arch/. Ok, this finished off my arch updates for today, I'll probably move on to driver pulls tomorrow. Anyway, it's in my tree, will push out once my test build finishes. Linus
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidtwrote: > > This is why, I want (with your agreement) to define clearly and once > and for all, that the Linux semantics of writel are that it is ordered > with previous writes to coherent memory (*) Honestly, I think those are the sane semantics. In fact, make it "ordered with previous writes" full stop, since it's not only ordered wrt previous writes to memory, but also previous writel's. > Also, can I assume the above ordering with writel() equally applies to > readl() or not ? > > IE: > dma_buf->foo = 1; > readl(STUPID_DEVICE_DMA_KICK_ON_READ); If that KICK_ON_READ is UC, then that's definitely the case. And honestly, status registers like that really should always be UC. But if somebody sets the area WC (which is crazy), then I think it might be at least debatable. x86 semantics does allow reads to be done before previous writes (or, put another way, writes to be buffered - the buffers are ordered so writes don't get re-ordered, but reads can happen during the buffering). But UC accesses are always done entirely ordered, and honestly, any status register that starts a DMA would not make sense any other way. Of course, you'd have to be pretty odd to want to start a DMA with a read anyway - partly exactly because it's bad for performance since reads will be synchronous and not buffered like a write). Linus
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kayawrote: > > Basically changing it to > > dma_buffer->foo = 1;/* WB */ > wmb() > writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ > mmiowb() Why? Why not just remove the wmb(), and keep the barrier in the writel()? The above code makes no sense, and just looks stupid to me. It also generates pointlessly bad code on x86, so it's bad there too. Linus
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidtwrote: > > The discussion at hand is about > > dma_buffer->foo = 1;/* WB */ > writel(KICK, DMA_KICK_REGISTER);/* UC */ Yes. That certainly is ordered on x86. In fact, afaik it's ordered even if that writel() might be of type WC, because that only delays writes, it doesn't move them earlier. Whether people *do* that or not, I don't know. But I wouldn't be surprised if they do. So if it's a DMA buffer, it's "cached". And even cached accesses are ordered wrt MMIO. Basically, to get unordered writes on x86, you need to either use explicitly nontemporal stores, or have a writecombining region with back-to-back writes that actually combine. And nobody really does that nontemporal store thing any more because the hardware that cared pretty much doesn't exist any more. It was too much pain. People use DMA and maybe an UC store for starting the DMA (or possibly a WC buffer that gets multiple stores in ascending order as a stream of commands). Things like UC will force everything to be entirely ordered, but even without UC, loads won't pass loads, and stores won't pass stores. > Now it appears that this wasn't fully understood back then, and some > people are now saying that x86 might not even provide that semantic > always. Oh, the above UC case is absoutely guaranteed. And I think even if it's WC, the write to kick off the DMA is ordered wrt the cached write. On x86, I think you need barriers only if you do things like - do two non-temporal stores and require them to be ordered: put a sfence or mfence in between them. - do two WC stores, and make sure they do not combine: put a sfence or mfence between them. - do a store, and a subsequent from a different address, and neither of them is UC: put a mfence between them. But note that this is literally just "load after store". A "store after load" doesn't need one. I think that's pretty much it. For example, the "lfence" instruction is almost entirely pointless on x86 - it was designed back in the time when people *thought* they might re-order loads. But loads don't get re-ordered. At least Intel seems to document that only non-temporal *stores* can get re-ordered wrt each other. End result: lfence is a historical oddity that can now be used to guarantee that a previous load has finished, and that in turn meant that it is now used in the Spectre mitigations. But it basically has no real memory ordering meaning since nothing passes an earlier load anyway, it's more of a pipeline thing. But in the end, one question is just "how much do drivers actually _rely_ on the x86 strong ordering?" We so support "smp_wmb()" even though x86 has strong enough ordering that just a barrier suffices. Somebody might just say "screw the x86 memory ordering, we're relaxed, and we'll fix up the drivers we care about". The only issue really is that 99.9% of all testing gets done on x86 unless you look at specific SoC drivers. On ARM, for example, there is likely little reason to care about x86 memory ordering, because there is almost zero driver overlap between x86 and ARM. *Historically*, the reason for following the x86 IO ordering was simply that a lot of architectures used the drivers that were developed on x86. The alpha and powerpc workstations were *designed* with the x86 IO bus (PCI, then PCIe) and to work with the devices that came with it. ARM? PCIe is almost irrelevant. For ARM servers, if they ever take off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 test coverage" is simply not an issue. How much of an issue is it for Power? Maybe you decide it's not a big deal. Then all the above is almost irrelevant. Linus
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidtwrote: > > Well, we need to clarify that once and for all, because as I wrote > earlier, it was decreed by Linus more than a decade ago that writel > would be fully ordered by itself vs. previous memory stores (at least > on UC memory). Yes. So "writel()" needs to be ordered with respect to other writel() uses on the same thread. Anything else *will* break drivers. Obviously, the drivers may then do magic to say "do write combining etc", but that magic will be architecture-specific. The other issue is that "writel()" needs to be ordered wrt other CPU's doing "writel()" if those writel's are in a spinlocked region. So it's not that "writel()" needs to be ordered wrt the spinlock itself, but you *do* need to honor ordering if you have something like this: spin_lock(); writel(a); writel(b); spin_unlock(); and if two CPU's run the above code "at the same time", then the *ONLY* acceptable sequence is abab. You cannot, and must not, ever see "aabb" at the device, for example, because of how the writel would basically leak out of the spinlock. That sounds "obvious", but dammit, a lot of architectures got that wrong, afaik. Linus
Re: [PATCH v2 bpf-next 5/8] bpf: introduce BPF_RAW_TRACEPOINT
On Fri, Mar 23, 2018 at 6:43 PM, Alexei Starovoitovwrote: > > it's not the limit I'm hitting, but self referential issue. > Exactly half gets expanded. > I don't think there is an easy workaround other > than duplicating the whole chain of REPEAT macro twice > with slightly different name. Take a look at the __MAP() macro in include/linux/syscalls.h. It basically takes a "transformation" as its argument, and does it times, where 'n' is the first argument (but could be self-counting). Maybe it will give you some ideas. ... and maybe it will just drive you mad and make you gouge out your eyes with a spoon. Don't blame the messenger. Linus
Re: [PATCH v3 bpf-next 01/10] treewide: remove struct-pass-by-value from tracepoints arguments
On Thu, Mar 22, 2018 at 1:48 PM, Steven Rostedtwrote: > > OK, but instead of changing it to pass by reference, why not just pass > the value. That is: > > static void xen_set_pte_atomic(pte_t *ptep, pte_t pte) > { > - trace_xen_mmu_set_pte_atomic(ptep, pte); > + trace_xen_mmu_set_pte_atomic(ptep, native_pte_val(pte)); > set_64bit((u64 *)ptep, native_pte_val(pte)); > } That looks simple and clean, and makes sense since the function itself then uses that value anyway. Certainly simpler than my monster define. Linus
Re: [PATCH v3 bpf-next 01/10] treewide: remove struct-pass-by-value from tracepoints arguments
On Thu, Mar 22, 2018 at 12:31 PM, Alexei Starovoitovwrote: > > yeah. C doesn't allow casting of 'struct s { u64 var };' into u64 > without massive hacks and aliasing warnings by compiler. Without massive hacks? Yeah, no. But without warnings? You can do it. #define UINTTYPE(size) \ __typeof__(__builtin_choose_expr(size==1, (u8)1, \ __builtin_choose_expr(size==2, (u16)2, \ __builtin_choose_expr(size==4, (u32)3, \ __builtin_choose_expr(size==8, (u64)4, \ (void)5) #define CAST_TO_U64(x) ({ \ typeof(x) __src = (x); \ UINTTYPE(sizeof(x)) __dst; \ memcpy(&__dst, &__src, sizeof(__dst)); \ (u64)__dst; }) Yeah, I'm not proud of the above, but gcc actually seems to do the right thing for it. Doing struct d { unsigned char a,b; }; it generates movzwl %di, %eax for the CAST_TO_U64() (the above *looks* like it only casts to 32-bit, but it is actually a full cast to 64 bits because movzwl will also clear the top bits of the register). No warnings. But is ot "massively hacky"? You be the judge. Probably. Going the other way is trivial, you just use that same UINTTYPE() again and just the memcpy in reverse. NOTE! The above obviously only works for things that are actually proper nice easy sizes. If you want to encode a 6-byte thing in a u64, the sanest thing is likely to use a union, and just accept crap in some bytes of the u64 (and just expect it to be undone by the reversal operation). Honestly, I'd suggest against it, because of just horrible code generation and/or data leakage Linus
Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write
On Thu, Mar 22, 2018 at 3:48 AM, David Laight <david.lai...@aculab.com> wrote: > From: Linus Torvalds >> >> Note that we definitely have seen hardware that *depends* on the >> regular memcpy_fromio()" not doing big reads. I don't know how >> hardware people screw it up, but it's clearly possible. [ That should have been "big writes" ] > I wonder if that hardware works with the current kernel on recent cpus? > I bet it doesn't like the byte accesses that get generated either. The one case we knew about we just fixed to use the special "16-bit word at a time" scr_memcpyw(). If I recall correctly, it was some "enterprise graphics console". If it's something I've found over the years, is that "enterprise" hardware is absolutely the dregs. It's the low-volume stuff that almost nobody uses and where the customer is used to the whole notion of boutique hardware, so they're used to having to do special things for special hardware. And I very much use the word "special" in the "my mom told me I was special" sense. It's not the *good* kind of "special". It's the "short bus" kind of special. > For x86 being able to request a copy done as 'rep movsx' (for each x) > would be useful. The portability issues are horrendous. Particularly if the memory area (source or dest) might be unaligned. The rule of thumb should be: don't use PIO, and if you *do* use PIO, don't be picky about what you get. And most *definitely* don't complain about performance to software people. Blame the hardware people. Get a competent piece of hardware, or a competent hardware designer. Let's face it, PIO is stupid. Use it for command and status words. Not for data transfer. Linus
Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
On Thu, Mar 22, 2018 at 5:48 AM, David Laightwrote: > > So if we needed to do PIO reads using the AVX2 (or better AVX-512) > registers would make a significant difference. > Fortunately we can 'dma' most of the data we need to transfer. I think this is the really fundamental issue. A device that expects PIO to do some kind of high-performance transaction is a *broken* device. It really is that simple. We don't bend over for misdesigned hardware crap unless it is really common. > I've traced writes before, they are a lot faster and are limited > by things in the fpga fabric (they appear back to back). The write combine buffer really should be much more effective than any AVX or similar can ever be. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Thu, Mar 22, 2018 at 8:01 AM, Kees Cookwrote: > > Seems like it doesn't like void * arguments: Yeah, that was discussed separately, I just didn't realize we had any such users. As David said, just adding a (long) cast to it should be fine, ie #define __is_constant(a) \ (sizeof(int) == sizeof(*(1 ? ((void*)((long)(a) * 0l)) : (int*)1))) and is probably a good idea even outside of pointers (because "long long" constants could cause warnings too due to the cast to (void *)). Linus
Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write
On Tue, Mar 20, 2018 at 7:42 AM, Alexander Duyckwrote: > > Instead of framing this as an enhanced version of the read/write ops > why not look at replacing or extending something like the > memcpy_fromio or memcpy_toio operations? Yes, doing something like "memcpy_fromio_avx()" is much more palatable, in that it works like the crypto functions do - if you do big chunks, the "kernel_fpu_begin/end()" isn't nearly the issue it can be otherwise. Note that we definitely have seen hardware that *depends* on the regular memcpy_fromio()" not doing big reads. I don't know how hardware people screw it up, but it's clearly possible. So it really needs to be an explicitly named function that basically a driver can use to say "my hardware really likes big aligned accesses" and explicitly ask for some AVX version if possible. Linus
Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time
On Wed, Mar 21, 2018 at 11:54 AM, Alexei Starovoitovwrote: > > add fancy macro to compute number of arguments passed into tracepoint > at compile time and store it as part of 'struct tracepoint'. We should probably do this __COUNT() thing in some generic header, we just talked last week about another use case entirely. And wouldn't it be nice to just have some generic infrastructure like this: /* * This counts to ten. * * Any more than that, and we'd need to take off our shoes */ #define __GET_COUNT(_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_n,...) _n #define __COUNT(...) \ __GET_COUNT(__VA_ARGS__,10,9,8,7,6,5,4,3,2,1,0) #define COUNT(...) __COUNT(dummy,##__VA_ARGS__) #define __CONCAT(a,b) a##b #define __CONCATENATE(a,b) __CONCAT(a,b) and then you can do things like: #define fn(...) __CONCATENATE(fn,COUNT(__VA_ARGS__))(__VA_ARGS__) which turns "fn(x,y,z..)" into "fn(x,y,z)". That can be useful for things like "max(a,b,c,d)" expanding to "max4()", and then you can just have the trivial #define max3(a,b,c) max2(a,max2(b.c)) etc (with proper parentheses, of course). And I'd rather not have that function name concatenation be part of the counting logic, because we actually may have different ways of counting, so the concatenation is separate. In particular, the other situation this came up for, the counting was in arguments _pairs_, so you'd use a "COUNT_PAIR()" instead of "COUNT()". NOTE NOTE NOTE! The above was slightly tested and then cut-and-pasted. I might have screwed up at any point. Think of it as pseudo-code. Linus
Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
On Wed, Mar 21, 2018 at 12:46 AM, Ingo Molnarwrote: > > So I added a bit of instrumentation and the current state of things is that on > 64-bit x86 every single task has an initialized FPU, every task has the exact > same, fully filled in xfeatures (XINUSE) value: Bah. Your CPU is apparently some old crud that barely has AVX. Of course all those features are enabled. > Note that this is with an AVX (128-bit) supporting CPU: That's weak even by modern standards. I have MPX bounds and MPX CSR on my old Skylake. And the real worry is things like AVX-512 etc, which is exactly when things like "save and restore one ymm register" will quite likely clear the upper bits of the zmm register. And yes, we can have some statically patched code that takes that into account, and saves the whole zmm register when AVX512 is on, but the whole *point* of the dynamic XSAVES thing is actually that Intel wants to be able enable new user-space features without having to wait for OS support. Literally. That's why and how it was designed. And saving a couple of zmm registers is actually pretty hard. They're big. Do you want to allocate 128 bytes of stack space, preferably 64-byte aligned, for a save area? No. So now it needs to be some kind of per-thread (or maybe per-CPU, if we're willing to continue to not preempt) special save area too. And even then, it doesn't solve the real worry of "maybe there will be odd interactions with future extensions that we don't even know of". All this to do a 32-byte PIO access, with absolutely zero data right now on what the win is? Yes, yes, I can find an Intel white-paper that talks about setting WC and then using xmm and ymm instructions to write a single 64-byte burst over PCIe, and I assume that is where the code and idea came from. But I don't actually see any reason why a burst of 8 regular quad-word bytes wouldn't cause a 64-byte burst write too. So right now this is for _one_ odd rdma controller, with absolutely _zero_ performance numbers, and a very high likelihood that it does not matter in the least. And if there are some atomicity concerns ("we need to guarantee a single atomic access for race conditions with the hardware"), they are probably bogus and misplaced anyway, since (a) we can't guarantee that AVX2 exists in the first place (b) we can't guarantee that %ymm register write will show up on any bus as a single write transaction anyway So as far as I can tell, there are basically *zero* upsides, and a lot of potential downsides. Linus
Re: [PATCH] kbuild: disable clang's default use of -fmerge-all-constants
On Tue, Mar 20, 2018 at 5:18 PM, Daniel Borkmannwrote: > Prasad reported that he has seen crashes in BPF subsystem with netd > on Android with arm64 in the form of (note, the taint is unrelated): Ack. This looks good to me. And thanks for noticing the behavior wrt the correct gcc merging. > [ Hi Linus, feel free to take this fix directly if you want. >Alternatively, we could route it via bpf tree. Thanks a >lot for your feedback! ] So since it's your patch and the only known issue comes from the bpf side, I think it should just go through the bpf tree, and I expect it to get to me through all the usual channels. Thanks, Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Tue, Mar 20, 2018 at 4:23 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Hmm. So thanks to the diseased mind of Martin Uecker, there's a better > test for "__is_constant()": > > /* Glory to Martin Uecker <martin.uec...@med.uni-goettingen.de> */ > #define __is_constant(a) \ > (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1))) > > that is actually *specified* by the C standard to work, and doesn't > even depend on any gcc extensions. Well, it does depend on 'sizeof(*(void *)X)' being 1 and the compiler not complaining about it, and that sizeof(int) is not 1. But since we depend on those things in the kernel anyway, that's fine. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Sat, Mar 17, 2018 at 1:07 PM, Kees Cookwrote: > > No luck! :( gcc 4.4 refuses to play along. And, hilariously, not only > does it not change the complaint about __builtin_choose_expr(), it > also thinks that's a VLA now. Hmm. So thanks to the diseased mind of Martin Uecker, there's a better test for "__is_constant()": /* Glory to Martin Uecker */ #define __is_constant(a) \ (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1))) that is actually *specified* by the C standard to work, and doesn't even depend on any gcc extensions. The reason is some really subtle pointer conversion rules, where the type of the ternary operator will depend on whether one of the pointers is NULL or not. And the definition of NULL, in turn, very much depends on "integer constant expression that has the value 0". Are you willing to do one final try on a generic min/max? Same as my last patch, but using the above __is_constant() test instead of __builtin_constant_p? Linus
Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnarwrote: > > So assuming the target driver will only load on modern FPUs I *think* it > should > actually be possible to do something like (pseudocode): > > vmovdqa %ymm0, 40(%rsp) > vmovdqa %ymm1, 80(%rsp) > > ... > # use ymm0 and ymm1 > ... > > vmovdqa 80(%rsp), %ymm1 > vmovdqa 40(%rsp), %ymm0 > > ... without using the heavy XSAVE/XRSTOR instructions. No. The above is buggy. It may *work*, but it won't work in the long run. Pretty much every single vector extension has traditionally meant that touching "old" registers breaks any new register use. Even if you save the registers carefully like in your example code, it will do magic and random things to the "future extended" version. So I absolutely *refuse* to have anything to do with the vector unit. You can only touch it in the kernel if you own it entirely (ie that "kernel_fpu_begin()/_end()" thing). Anything else is absolutely guaranteed to cause problems down the line. And even if you ignore that "maintenance problems down the line" issue ("we can fix them when they happen") I don't want to see games like this, because I'm pretty sure it breaks the optimized xsave by tagging the state as being dirty. So no. Don't use vector stuff in the kernel. It's not worth the pain. The *only* valid use is pretty much crypto, and even there it has had issues. Benchmarks use big arrays and/or dense working sets etc to "prove" how good the vector version is, and then you end up in situations where it's used once per fairly small packet for an interrupt, and it's actually much worse than doing it by hand. Linus
Re: [PATCH bpf] bpf: fix crash due to inode i_op mismatch with clang/llvm
On Mon, Mar 19, 2018 at 6:50 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Add it to everything. If it's an invalid optimization, it shouldn't be on. IOW, why isn't this just something like diff --git a/Makefile b/Makefile index d65e2e229017..01abedc2e79f 100644 --- a/Makefile +++ b/Makefile @@ -826,6 +826,9 @@ KBUILD_CFLAGS += $(call cc-disable-warning, pointer-sign) # disable invalid "can't wrap" optimizations for signed / pointers KBUILD_CFLAGS+= $(call cc-option,-fno-strict-overflow) +# disable invalid optimization on clang +KBUILD_CFLAGS += $(call cc-option,-fno-merge-all-constants) + # Make sure -fstack-check isn't enabled (like gentoo apparently did) KBUILD_CFLAGS += $(call cc-option,-fno-stack-check,) (whitespace-damaged, but you get the gist of it). We disable some optimizations that are technically _valid_, because they are too dangerous and a bad idea. Disabling an optimization that isn't valid EVEN IN THEORY is an absolute no-brainer, particularly if it has already shown itself to cause problems. We have other situations where we generate multiple static structures and expect them to be unique. I'm not sure any of them would trigger the clang rules, but the clang rules are obviously complete garbage anyway, so who knows? That optimization seems to teuly be pure and utter garbage. Clang can even *see* the address comparison happening in that file. Some clang person needs to be publicly shamed for enabling this kind of garbage by default, particularly since they apparently _knew_ it was invalid. Linus
Re: [PATCH bpf] bpf: fix crash due to inode i_op mismatch with clang/llvm
On Mon, Mar 19, 2018 at 6:17 PM, Daniel Borkmannwrote: > > Reason for this miscompilation is that clang has the more aggressive > -fmerge-all-constants enabled by default. In fact, clang source code > has an 'insightful' comment about it in its source code under > lib/AST/ExprConstant.cpp: > > // Pointers with different bases cannot represent the same object. > // (Note that clang defaults to -fmerge-all-constants, which can > // lead to inconsistent results for comparisons involving the address > // of a constant; this generally doesn't matter in practice.) > > gcc on the other hand does not enable -fmerge-all-constants by default > and *explicitly* states in it's option description that using this > flag results in non-conforming behavior, quote from man gcc: > > Languages like C or C++ require each variable, including multiple > instances of the same variable in recursive calls, to have distinct > locations, so using this option results in non-conforming behavior. > > Given there are users with clang/LLVM out there today that triggered > this, fix this mess by explicitly adding -fno-merge-all-constants to > inode.o as CFLAGS via Kbuild system. Oh, please do *NOT* add it to just that one file. Add it to everything. If it's an invalid optimization, it shouldn't be on. And even if it happens to trigger in only that one file, then disabling it globally is just the safe thing to do. What is the code generation difference if you just enable it globally? I would certainly _hope_ that it's not noticeable, but if it's noticeable that would certainly imply that it's very dangerous somewhere else too! Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Mon, Mar 19, 2018 at 2:43 AM, David Laightwrote: > > Is it necessary to have the full checks for old versions of gcc? > > Even -Wvla could be predicated on very recent gcc - since we aren't > worried about whether gcc decides to generate a vla, but whether > the source requests one. You are correct. We could just ignore the issue with old gcc versions, and disable -Wvla rather than worry about it. Linus
Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
On Mon, Mar 19, 2018 at 8:53 AM, David Laightwrote: > > The x87 and SSE registers can't be changed - they can contain callee-saved > registers. > But (IIRC) the AVX and AVX2 registers are all caller-saved. No. The kernel entry is not the usual function call. On kernel entry, *all* registers are callee-saved. Of course, some may be return values, and I have a slight hope that I can trash %eflags. But basically, a system call is simply not a function call. We have different calling conventions on the argument side too. In fact, the arguments are in different registers depending on just *which* system call you take. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Sun, Mar 18, 2018 at 3:59 PM, Rasmus Villemoeswrote: > > OK, I missed where this was made about side effects of x and y We never made it explicit, since all we really cared about in the end is the constantness. But yes: > but I suppose the idea was to use > > no_side_effects(x) && no_side_effects(y) ? trivial_max(x, y) : > old_max(x, y) Exactly. Except with __builtin_choose_expr(), because we need the end result to be seen as a integer constant expression, so that we can then use it as an array size. So it needs that early parse-time resolution. >> We only really use __builtin_constant_p() as a (bad) approximation of >> that in this case, since it's the best we can do. > > I don't think you should parenthesize bad, rather capitalize it. ({x++; > 0;}) is constant in the eyes of __builtin_constant_p, but not > side-effect free. Hmm. Yeah, but probably don't much care for the kernel. For example, we do things like this: #define __swab64(x) \ (__builtin_constant_p((__u64)(x)) ? \ ___constant_swab64(x) : \ __fswab64(x)) where that "___constant_swab64()" very much uses the same argument over and over. And we do that for related reasons - we really want to do the constant folding at build time for some cases, and this was the only sane way to do it. Eg code like return (addr & htonl(0xff00)) == htonl(0x7f00); wants to do the right thing, and long ago gcc didn't have builtins for byte swapping, so we had to just do nasty horribly macros that DTRT for constants. But since the kernel is standalone, we don't need to really worry about the *generic* case, we just need to worry about our own macros, and if somebody does that example you show I guess we'll just have to shun them ;) Of course, our own macros are often macros from hell, exactly because they often contain a lot of type-checking and/or type-(size)-based polymorphism. But then we actually *want* gcc to simplify things, and avoid side effects, just have potentially very complex expressions. But we basically never have that kind of intentionally evil macros, so we are willing to live with a bad substitute. But yes, it would be better to have some more control over things, and actually have a way to say "if X is a constant integer expression, do transformation Y, otherwise call function y()". Actually sparse started out with the idea in the background that it could become not just a checker, but a "transformation tool". Partly because of long gone historical issues (ie gcc people used to be very anti-plugin due to licensing issues). Of course, a more integrated and powerful preprocessor language is almost always what we *really* wanted, but traditionally "powerful preprocessor" has always meant "completely incomprehensible and badly integrated preprocessor". "cpp" may be a horrid language, but it's there and it's fast (when integrated with the front-end, like everybody does now) But sadly, cpp is really bad for the above kind of "if argument is constant" kind of tricks. I suspect we'd use it a lot otherwise. >> So the real use would be to say "can we use the simple direct macro >> that just happens to also fold into a nice integer constant >> expression" for __builtin_choose_expr(). >> >> I tried to find something like that, but it really doesn't exist, even >> though I would actually have expected it to be a somewhat common >> concern for macro writers: write a macro that works in any arbitrary >> environment. > > Yeah, I think the closest is a five year old suggestion > (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57612) to add a > __builtin_assert_no_side_effects, that could be used in macros that for > some reason cannot be implemented without evaluating some argument > multiple times. It would be more useful to have the predicate version, > which one could always turn into a build bug version. But we have > neither, unfortunately. Yeah, and since we're in the situation that *new* gcc versions work for us anyway, and we only have issues with older gcc's (that sadly people still use), even if there was a new cool feature we couldn't use it. Oh well. Thanks for the background. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Sun, Mar 18, 2018 at 2:13 PM, Rasmus Villemoes <li...@rasmusvillemoes.dk> wrote: > On 2018-03-17 19:52, Linus Torvalds wrote: >> >> Ok, so it really looks like that same "__builtin_constant_p() doesn't >> return a constant". >> >> Which is really odd, but there you have it. > > Not really. We do rely on builtin_constant_p not being folded too > quickly to a 0/1 answer, so that gcc still generates good code even if > the argument is only known to be constant at a late(r) optimization > stage (through inlining and all). Hmm. That makes sense. It just doesn't work for our case when we really want to choose the expression based on side effects or not. > So unlike types_compatible_p, which > can obviously be answered early during parsing, builtin_constant_p is > most of the time a yes/no/maybe/it's complicated thing. The silly thing is, the thing we *really* want to know _is_ actually easily accessible during the early parsing, exactly like __builtin_compatible_p(): it's not really that we care about the expressions being constant, as much as the "can this have side effects" question. We only really use __builtin_constant_p() as a (bad) approximation of that in this case, since it's the best we can do. So the real use would be to say "can we use the simple direct macro that just happens to also fold into a nice integer constant expression" for __builtin_choose_expr(). I tried to find something like that, but it really doesn't exist, even though I would actually have expected it to be a somewhat common concern for macro writers: write a macro that works in any arbitrary environment. I guess array sizes are really the only true limiting environment (static initializers is another one). How annoying. Odd that newer gcc's seem to do so much better (ie gcc-5 seems to be fine). So _something_ must have changed. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Sat, Mar 17, 2018 at 12:27 AM, Kees Cookwrote: > > Unfortunately my 4.4 test fails quickly: > > ./include/linux/jiffies.h: In function ‘jiffies_delta_to_clock_t’: > ./include/linux/jiffies.h:444: error: first argument to > ‘__builtin_choose_expr’ not a constant Ok, so it really looks like that same "__builtin_constant_p() doesn't return a constant". Which is really odd, but there you have it. I wonder if you can use that "sizeof()" to force evaluation of it, because sizeof() really does end up being magical when it comes to "integer constant expression". So instead of this: #define __no_side_effects(a,b) \ (__builtin_constant_p(a)&&__builtin_constant_p(b)) that just assumes that __builtin_constant_p() itself always counts as a constant expression, what happens if you do #define __is_constant(a) \ (sizeof(char[__builtin_constant_p(a)])) #define __no_side_effects(a,b) \ (__is_constant(a) && __is_constant(b)) I realize that the above looks completely insane: the whole point is to *not* have VLA's, and we know that __builtin_constant_p() isn't always evaliated as a constant. But hear me out: if the issue is that there's some evaluation ordering between the two builtins, and the problem is that the __builtin_choose_expr() part of the expression is expanded *before* the __builtin_constant_p() has been expanded, then just hiding it inside that bat-shit-crazy sizeof() will force that to be evaluated first (because a sizeof() is defined to be a integer constant expression. So the above is completely insane, bit there is actually a chance that using that completely crazy "x -> sizeof(char[x])" conversion actually helps, because it really does have a (very odd) evaluation-time change. sizeof() has to be evaluated as part of the constant expression evaluation, in ways that "__builtin_constant_p()" isn't specified to be done. But it is also definitely me grasping at straws. If that doesn't work for 4.4, there's nothing else I can possibly see. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Fri, Mar 16, 2018 at 10:44 AM, David Laightwrote: > > I looked at the generated code for one of the constant sized VLA that > the compiler barfed at. > It seemed to subtract constants from %sp separately for the VLA. > So it looks like the compiler treats them as VLA even though it > knows the size. > That is probably missing optimisation. Looking at the code is definitely an option. In fact, instead of depending on -Wvla, we could just make 'objtool' warn about real variable-sized stack frames. That said, if that "sizeof()" trick of Al's actually works with older gcc versions too (it *should*, but it's not like __builtin_choose_expr() and __builtin_constant_p() have well-defined rules in the standard), that may just be the solution. And if gcc ends up generating bad code for those "constant sized vlas" anyway, then -Wvla would effectively warn about that code generation problem. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Fri, Mar 16, 2018 at 1:14 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > It does not work with gcc-4.1.x, but works with gcc-4.4.x. > > I can't seem to see the errors any way, I wonder if > __builtin_choose_expr() simply didn't exist back then. No, that goes further back. It seems to be -Wvla itself that doesn't exist in 4.1, so the test build failed simply because I used that flag ;) Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Fri, Mar 16, 2018 at 1:12 PM, Al Virowrote: > > That's C99, straight from N1256.pdf (C99-TC3)... I checked C90, since the error is ISO C90 forbids variable length array and I didn't see anything there. Admittedly I only found a draft copy. Linus
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Fri, Mar 16, 2018 at 1:03 PM, Miguel Ojedawrote: >> >> Kees - is there some online "gcc-4.4 checker" somewhere? This does >> seem to work with my gcc. I actually tested some of those files you >> pointed at now. > > I use this one: > > https://godbolt.org/ Well, my *test* code works on that one and -Wvla -Werror. It does not work with gcc-4.1.x, but works with gcc-4.4.x. I can't seem to see the errors any way, I wonder if __builtin_choose_expr() simply didn't exist back then. Odd that you can't view warnings/errors with it. But it's possible that it fails on more complex stuff in the kernel. I've done a "allmodconfig" build with that patch, and the only issue it found was that (real) type issue in tpm_tis_core.h. Linus
Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2 == netdev)
On Fri, Mar 16, 2018 at 11:30 AM, David Millerwrote: > > I imagine one of the things you'd like to do is declare that syscall > entries use a different (better) argument passing scheme. For > example, passing values in registers instead of on the stack. Actually, it's almost exactly the reverse. On x86-64, we'd like to just pass the 'struct pt_regs *' pointer, and have the sys_xyz() function itself just pick out the arguments it needs from there. That has a few reasons for it: - we can clear all registers at system call entry, which helps defeat some of the "pass seldom used register with user-controlled value that survives deep into the callchain" things that people used to leak information - we can streamline the low-level system call code, which needs to pass around 'struct pt_regs *' anyway, and the system call only picks up the values it actually needs - it's really quite easy(*) to just make the SYSCALL_DEFINEx() macros just do it all with a wrapper inline function but it fundamentally means that you *cannot* call 'sys_xyz()' from within the kernel, unless you then do it with something crazy like struct pt_regs myregs; ... fill in the right registers for this architecture _if_ this architecture uses ptregs .. sys_xyz(); which I somehow really doubt you want to do in the networking code. Now, I did do one version that just created two entrypoints for every single system call - the "kernel version" and the "real" system call version. That sucks, because you have two choices: - either pointlessly generate extra code for the 200+ system calls that are *not* used by the kernel - or let gcc just merge the two, and make code generation suck where the real system call just loads the registers and jumps to the common code. That second option really does suck, because if you let the compiler just generate the _single_ system call, it will do the "load actual value from ptregs" much more nicely, and only when it needs it, and schedules it all into the system call code. So just making the rule be: "you mustn't call the SYSCALL_DEFINEx() functions from anything but the system call code" really makes everything better. Then you only need to fix up the *handful* of so system calls that actually have in-kernel callers. Many of them end up being things that could be improved on further anyway (ie there's discussion about further cleanup and trying to avoid using "set_fs()" for arguments etc, because there already exists helper functions that take the kernel-space versions, and the sys_xyz() version is actually just going through stupid extra work for a kernel user). Linus (*) The "really quite easy" is only true on 64-bit architectures. 32-bit architectures have issues with packing 64-bit values into two registers, so using macro expansion with just the number of arguments doesn't work.
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Fri, Mar 16, 2018 at 10:55 AM, Al Virowrote: > > That's not them, that's C standard regarding ICE. Yes. The C standard talks about "integer constant expression". I know. It's come up in this very thread before. The C standard at no point talks about - or forbids - "variable length arrays". That never comes up in the whole standard, I checked. So we are right now hindered by a _syntactic_ check, without any way to have a _semantic_ check. That's my problem. The warnings are misleading and imply semantics. And apparently there is no way to actually check semantics. > 1,100 is *not* a constant expression as far as the standard is concerned, I very much know. But it sure isn't "variable" either as far as the standard is concerned, because the standard doesn't even have that concept (it uses "variable" for argument numbers and for variables). So being pedantic doesn't really change anything. > Would you argue that in > void foo(char c) > { > int a[(c<<1) + 10 - c + 2 - c]; Yeah, I don't think that even counts as a constant value, even if it can be optimized to one. I would not at all be unhppy to see __builtin_constant_p() to return zero. But that is very much different from the syntax issue. So I would like to get some way to get both type-checking and constant checking without the annoying syntax issue. > expr, constant_expression is not a constant_expression. And in > this particular case the standard is not insane - the only reason > for using that is typechecking and _that_ can be achieved without > violating 6.6p6: > sizeof(expr,0) * 0 + ICE > *is* an integer constant expression, and it gives you exact same > typechecking. So if somebody wants to play odd games, they can > do that just fine, without complicating the logics for compilers... Now that actually looks like a good trick. Maybe we can use that instead of the comma expression that causes problems. And using sizeof() to make sure that __builtin_choose_expression() really gets an integer constant expression and that there should be no ambiguity looks good. Hmm. This works for me, and I'm being *very* careful (those casts to pointer types are inside that sizeof, because it's not an integral type, and non-integral casts are not valid in an ICE either) but somebody needs to check gcc-4.4: #define __typecheck(a,b) \ (!!(sizeof((typeof(a)*)1==(typeof(b)*)1))) #define __no_side_effects(a,b) \ (__builtin_constant_p(a)&&__builtin_constant_p(b)) #define __safe_cmp(a,b) \ (__typecheck(a,b) && __no_side_effects(a,b)) #define __cmp(a,b,op) ((a)op(b)?(a):(b)) #define __cmp_once(a,b,op) ({ \ typeof(a) __a = (a);\ typeof(b) __b = (b);\ __cmp(__a,__b,op); }) #define __careful_cmp(a,b,op) \ __builtin_choose_expr(__safe_cmp(a,b), __cmp(a,b,op), __cmp_once(a,b,op)) #define min(a,b) __careful_cmp(a,b,<) #define max(a,b) __careful_cmp(a,b,>) #define min_t(t,a,b) __careful_cmp((t)(a),(t)(b),<) #define max_t(t,a,b) __careful_cmp((t)(a),(t)(b),>) and yes, it does cause new warnings for that comparison between ‘enum tis_defaults’ and ‘enum tpm2_const’ in drivers/char/tpm/tpm_tis_core.h due to #define TIS_TIMEOUT_A_MAX max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A) but technically that warning is actually correct, I'm just confused why gcc cares about the cast placement or something. That warning is easy to fix by turning it into a "max_t(int, enum1, enum2)' and that is technically the right thing to do, it's just not warned about for some odd reason with the current code. Kees - is there some online "gcc-4.4 checker" somewhere? This does seem to work with my gcc. I actually tested some of those files you pointed at now. Linus include/linux/kernel.h | 77 +- 1 file changed, 20 insertions(+), 57 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 3fd291503576..23c31bf1d7fb 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -787,37 +787,29 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } * strict type-checking.. See the * "unnecessary" pointer comparison. */ -#define __min(t1, t2, min1, min2, x, y) ({ \ - t1 min1 = (x); \ - t2 min2 = (y); \ - (void) ( == ); \ - min1 < min2 ? min1 : min2; }) +#define __typecheck(a,b) \ + (!!(sizeof((typeof(a)*)1==(typeof(b)*)1))) -/** - * min - return minimum of two values of the same or compatible types - * @x: first value - * @y: second value - */ -#define min(x, y) \ - __min(typeof(x), typeof(y), \ - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ - x, y) +#define __no_side_effects(a,b) \ + (__builtin_constant_p(a)&&__builtin_constant_p(b)) -#define __max(t1, t2, max1, max2, x, y) ({ \ - t1 max1 = (x); \ - t2 max2 = (y); \ - (void) ( == ); \ - max1
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Fri, Mar 16, 2018 at 4:47 AM, Florian Weimerwrote: > > If you want to catch stack frames which have unbounded size, > -Werror=stack-usage=1000 or -Werror=vla-larger-than=1000 (with the constant > adjusted as needed) might be the better approach. No, we want to catch *variable* stack sizes. Does "-Werror=vla-larger-than=0" perhaps work for that? No, because the stupid compiler says that is "meaningless". And no, using "-Werror=vla-larger-than=1" doesn't work either, because the moronic compiler continues to think that "vla" is about the _type_, not the code: t.c: In function ‘test’: t.c:6:6: error: argument to variable-length array is too large [-Werror=vla-larger-than=] int array[(1,100)]; Gcc people are crazy. Is there really no way to just say "shut up about the stupid _syntax_ issue that is entirely irrelevant, and give us the _code_ issue". Linus
Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
On Thu, Mar 15, 2018 at 4:46 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > Well, the explicit typing allows that mixing, in that you can just > have "const_max_t(5,sizeof(x))" I obviously meant "const_max_t(size_t,5,sizeof(x))". Heh. Linus
Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
On Thu, Mar 15, 2018 at 4:41 PM, Kees Cookwrote: > > I much prefer explicit typing, but both you and Rasmus mentioned > wanting the int/sizeof_t mixing. Well, the explicit typing allows that mixing, in that you can just have "const_max_t(5,sizeof(x))" So I'm ok with that. What I'm *not* so much ok with is "const_max(5,sizeof(x))" erroring out, or silently causing insane behavior due to hidden subtle type casts.. Linus
Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
On Thu, Mar 15, 2018 at 3:46 PM, Kees Cookwrote: > > So, AIUI, I can either get strict type checking, in which case, this > is rejected (which I assume there is still a desire to have): > > int foo[const_max(6, sizeof(whatever))]; Ehh, yes, that looks fairly sane, and erroring out would be annoying. But maybe we should just make the type explicit, and make it "const_max_t()"? I think all the existing users are of type "max_t()" anyway due to the very same issue, no? At least if there's an explicit type like 'size_t', then passing in "-1" becoming a large unsigned integer is understandable and clear, not just some odd silent behavior. Put another way: I think it's unacceptable that const_max(-1,6) magically becomes a huge positive number like in that patch of yours, but const_max_t(size_t, -1, 6) *obviously* is a huge positive number. The two things would *do* the same thing, but in the second case the type is explicit and visible. > due to __builtin_types_compatible_p() rejecting it, or I can construct > a "positive arguments only" test, in which the above is accepted, but > this is rejected: That sounds acceptable too, although the "const_max_t()" thing is presumably going to be simpler? Linus
Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
On Thu, Mar 15, 2018 at 3:16 PM, Kees Cookwrote: > > size_t __error_not_const_arg(void) \ > __compiletime_error("const_max() used with non-compile-time constant arg"); > #define const_max(x, y) \ > __builtin_choose_expr(__builtin_constant_p(x) &&\ > __builtin_constant_p(y), \ > (typeof(x))(x) > (typeof(y))(y) ? \ > (x) : (y), \ > __error_not_const_arg()) > > Is typeof() forcing enums to int? Regardless, I'll put this through > larger testing. How does that look? Ok, that alleviates my worry about one class of insane behavior, but it does raise a few other questions: - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky. - this does have the usual "what happen if you do const_max(-1,sizeof(x)) where the comparison will now be done in 'size_t', and -1 ends up being a very very big unsigned integer. Is there no way to get that type checking inserted? Maybe now is a good point for that __builtin_types_compatible(), and add it to the constness checking (and change the name of that error case function)? Linus
Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
On Thu, Mar 15, 2018 at 12:47 PM, Kees Cookwrote: > > To gain the ability to compare differing types, the arguments are > explicitly cast to size_t. Ugh, I really hate this. It silently does insane things if you do const_max(-1,6) and there is nothing in the name that implies that you can't use negative constants. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Mon, Mar 12, 2018 at 3:55 PM, Andrew Mortonwrote: > > Replacing the __builtin_choose_expr() with ?: works of course. Hmm. That sounds like the right thing to do. We were so myopically staring at the __builtin_choose_expr() problem that we overlooked the obvious solution. Using __builtin_constant_p() together with a ?: is in fact our common pattern, so that should be fine. The only real reason to use __builtin_choose_expr() is if you want to get the *type* to vary depending on which side you choose, but that's not an issue for min/max. > What will be the runtime effects? There should be none. Gcc will turn the conditional for the ?: into a constant, and DTRT. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sun, Mar 11, 2018 at 4:05 AM, Ingo Molnarwrote: > > BTW., while I fully agree with everything you said, it's not entirely correct > to > claim that if a C compiler can generate VLA code it is necessarily able to > parse > and evaluate constant array sizes "just fine". > > Constant expressions are typically parsed very early on, at the preprocessing > stage. They can be used with some preprocessor directives as well, such as > '#if' > (with some further limitations on their syntax). Yes. But constant simplification and CSE etc is just a very fundamental part of a compiler, and anybody who actually implements VLA's would have to do it anyway. So no, a message like warning: Array declaration is not a C90 constant expression, resulting in VLA code generation would be moronic. Only some completely mindless broken shit would do "oh, it's not a parse-time constant, so it will be variable". The two just do not follow AT ALL. So the message might be about _possibly_ resulting in VLA code generation, but honestly, at that point you should just add the warning when you actually generate the code to do the stack allocation. Because at that point, you know whether it's variable or not. And trust me, it won't be variable for things like (2,3), or even for our "max()" thing with other odd builtins. Not unless the compiler doesn't really support VLA at all (maybe some bolted-on crazy thing that just turns a VLA at the front-end time into just an alloca), or the user has explicitly asked to disable some fundamental optimization phase. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 9:34 AM, Miguel Ojedawrote: > > So the warning is probably implemented to just trigger whenever VLAs > are used but the given standard does not allow them, for all > languages. The problem is why the ISO C90 frontend is not giving an > error for using invalid syntax for array sizes to begin with? So in *historical* context - when a compiler didn't do variable length arrays at all - the original semantics of C "constant expressions" actually make a ton of sense. You can basically think of a constant expression as something that can be (historically) handled by the front-end without any real complexity, and no optimization phases - just evaluating a simple parse tree with purely compile-time constants. So there's a good and perfectly valid reason for why C limits certain expressions to just a very particular subset. It's not just array sizes, it's case statements etc too. And those are very much part of the C standard. So an error message like warning: ISO C90 requires array sizes to be constant-expressions would be technically correct and useful from a portability angle. It tells you when you're doing something non-portable, and should be automatically enabled with "-ansi -pedantic", for example. So what's misleading is actually the name of the warning and the message, not that it happens. The warning isn't about "variable length", it's literally about the rules for what a "constant-expression" is. And in C, the expression (2,3) has a constant _value_ (namely 3), but it isn't a constant-expression as specified by the language. Now, the thing is that once you actually do variable length arrays, those old front-end rules make no sense any more (outside of the "give portability warnings" thing). Because once you do variable length arrays, you obviously _parse_ everything just fine, and you're doing to evaluate much more complex expressions than some limited constant-expression rule. And at that point it would make a whole lot more sense to add a *new* warning that basically says "I have to generate code for a variable-sized array", if you actually talk about VLA's. But that's clearly not what gcc actually did. So the problem really is that -Wvla doesn't actually warn about VLA's, but about something technically completely different. And that's why those stupid syntactic issues with min/max matter. It's not whether the end result is a compile-time constant or not, it's about completely different issues, like whether there is a comma-expression in it. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cookwrote: > > Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or > some other name for the simple macro. Bleh. Oh, and I'm starting to see the real problem. It's not that our current "min/max()" are broiken. It's that "-Wvla" is garbage. Lookie here: int array[(1,2)]; results in gcc saying warning: ISO C90 forbids variable length array ‘array’ [-Wvla] int array[(1,2)]; ^~~ and that error message - and the name of the flag - is obviously pure garbage. What is *actually* going on is that ISO C90 requires an array size to be not a constant value, but a constant *expression*. Those are two different things. A constant expression has little to do with "compile-time constant". It's a more restricted form of it, and has actual syntax requirements. A comma expression is not a constant expression, for example, which was why I tested this. So "-Wvla" is garbage, with a misleading name, and a misleading warning string. It has nothing to do with "variable length" and whether the compiler can figure it out at build time, and everything to do with a _syntax_ rule. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cookwrote: > > And sparse freaks out too: > >drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect > type in initializer (different address spaces) @@expected void > *addr @@got struct mac_regs [noderef]
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 11:03 PM, Miguel Ojedawrote: > > Just compiled 4.9.0 and it seems to work -- so that would be the > minimum required. > > Sigh... > > Some enterprise distros are either already shipping gcc >= 5 or will > probably be shipping it soon (e.g. RHEL 8), so how much does it hurt > to ask for a newer gcc? Are there many users/companies out there using > enterprise distributions' gcc to compile and run the very latest > kernels? I wouldn't mind upping the compiler requirements, and we have other reasons to go to 4.6. But _this_ particular issue doesn't seem worth it to then go even further. Annoying. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 5:31 PM, Kees Cookwrote: > > WTF, gmail just blasted HTML into my explicitly plain-text email?! > Apologies... There's more now in your email, I think maybe it's triggered by your signature file and some gmail web interface bug. Or it just tries to force quote using html. There's some html email disease inside google, where some parts of the company seems to think that html email is _good_. The official gmail Android app is a big pain, int hat it doesn't *have* a plain-text mode, even though it knows how to format the plain-text part of the email. You might want to be on the lookout for some bad drugs at the office. Because the world would thank you if you took them away from the gmail app people. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 4:32 PM, Andrew Mortonwrote: > > I wonder which gcc versions actually accept Kees's addition. Note that we already do have this pattern, as seen by: git grep -2 __builtin_choose_expr | grep -2 __builtin_constant_p which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern already exists current kernels and _works_ - it apparently just doesn't work in slightly more complicated cases. It's one reason why I wondered if simplifying the expression to have just that single __builtin_constant_p() might not end up working.. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 4:07 PM, Andrew Mortonwrote: > > A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear > to know that __builtin_constant_p(x) is a constant. Or something. LOL. I suspect it might be that it wants to evaluate __builtin_choose_expr() at an earlier stage than it evaluates __builtin_constant_p(), so it's not that it doesn't know that __builtin_constant_p() is a constant, it just might not know it *yet*. Maybe. Side note, if it's not that, but just the "complex" expression that has the logical 'and' etc, maybe the code could just use __builtin_constant_p((x)+(y)) or something. But yeah: > Sigh. Wasn't there some talk about modernizing our toolchain > requirements? Maybe it's just time to give up on 4.4. We wanted 4.5 for "asm goto", and once we upgrade to 4.5 I think Arnd said that no distro actually ships it, so we might as well go to 4.6. So maybe this is just the excuse to finally make that official, if there is no clever workaround any more. Linus
Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()
On Fri, Mar 9, 2018 at 12:05 PM, Kees Cookwrote: > When max() is used in stack array size calculations from literal values > (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler > thinks this is a dynamic calculation due to the single-eval logic, which > is not needed in the literal case. This change removes several accidental > stack VLAs from an x86 allmodconfig build: Ok, looks good. I just have a couple of questions about applying it. In particular, if this will help people working on getting rid of VLA's in the short term, I can apply it directly. But if people who are looking at it (anybody else than Kees?) don't much care, then this might be a 4.17 thing or at least "random -mm queue"? The other unrelated reaction I had to this was that "we're passing those types down very deep, even though nobody _cares_ about them all that much at that deep level". Honestly, the only case that really cares is the very top level, and the rest could just take the properly cast versions. For example, "max_t/min_t" really don't care at all, since they - by definition - just take the single specified type. So I'm wondering if we should just drop the types from __max/__min (and everything they call) entirely, and instead do #define __check_type(x,y) ((void)((typeof(x)*)1==(typeof(y)*)1)) #define min(x,y) (__check_type(x,y),__min(x,y)) #define max(x,y) (__check_type(x,y),__max(x,y)) #define min_t(t,x,y) __min((t)(x),(t)(y)) #define max_t(t,x,y) __max((t)(x),(t)(y)) and then __min/__max and friends are much simpler (and can just assume that the type is already fine, and the casting has been done). This is technically entirely independent of this VLA cleanup thing, but the "passing the types around unnecessarily" just becomes more obvious when there's now another level of macros, _and_ it's a more complex expression too. Yes, yes, the __single_eval_xyz() functions still end up wanting the types for the declaration of the new single-evaluation variables, but the 'typeof' pattern is the standard pattern, so #define __single_eval_max(max1, max2, x, y) ({ \ typeof (x) max1 = (x); \ typeof (y) max2 = (y); \ max1 > max2 ? max1 : max2; }) actually looks more natural to me than passing the two types in as arguments to the macro. (That form also is amenable to things like "__auto_type" etc simplifications). Side note: do we *really* need the unique variable names? That's what makes those things _really_ illegible. I thgink it's done just for a sparse warning that we should probably ignore. It's off by default anyway exactly because it doesn't work that well due to nested macro expansions like this. There is very real value to keeping our odd macros legible, I feel. Even when they are complicated by issues like this, it would be good to keep them as simple as possible. Comments? Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > How are you going to handle five processes doing the same setup concurrently? Side note: it's not just serialization. It's also "is it actually up and running". The rule for "request_module()" (for a real module) has been that it returns when the module is actually alive and active and have done their initcalls. The UMH_WAIT_EXEC behavior (ignore the serialization - you could do that in the caller) behavior doesn't actually have any semantics AT ALL. It only means that you get the error returns from execve() itself, so you know that the executable file actually existed and parsed right enough to get started. But you don't actually have any reason to believe that it has *done* anything, and started processing any requests. There's no reason what-so-ever to believe that it has registered itself for any asynchronous requests or anything like that. So in the real module case, you can do request_module("modulename"); and just start using whatever resource you just requested. So the netfilter code literally does request_module("nft-chain-%u-%.*s", family, nla_len(nla), (const char *)nla_data(nla)); nfnl_lock(NFNL_SUBSYS_NFTABLES); type = __nf_tables_chain_type_lookup(nla, family); if (type != NULL) return ERR_PTR(-EAGAIN); and doesn't even care about error handling for request_module() itself, because it knows that either the module got loaded and is ready, or something failed. And it needs to look that chain type up anyway, so the failure is indicated by _that_. With a UMH_WAIT_EXEC? No. You have *nothing*. You know the thing started, but it might have SIGSEGV'd immediately, and you have absolutely no way of knowing, and absolutely no way of even figuring it out. You can wait - forever - for something to bind to whatever dynamic resource you're expecting. You'll just fundamentally never know. You can try again, of course. Add a timeout, and try again in five seconds or something. Maybe it will work then. Maybe it won't. You won't have any way to know the _second_ time around either. Or the third. Or... See why I say it has to be synchronous? If it's synchronous, you can actually do things like (a) maybe you only need a one-time thing, and don't have any state ("load fixed tables, be done") and that's it. If the process returns with no error code, you're all done, and you know it's fine. (b) maybe the process wants to start a listener daemon or something like the traditional inetd model. It can open the socket, it can start listening on it, and it can fork off a child and check it's status. It can then do exit(0) if everything is fine, and now request_module() returns. see the difference? Even if you ended up with a background process (like in that (b) case), you did so with *error* handling, and you did so knowing that the state has actually been set up by the time the request_module() returns. And if you use the proper module loading exclusion, it also means that that (b) can know it's the only process starting up, and it's not racing with another one. It might still want to do the usual lock-files in user space to protect against just the admin starting it manually, but at least you don't have the situation that a hundred threads just had a thundering herd where they all ended up using the same kernel facility, and they all independently started a hundred usermode helpers. Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 10:57 AM, David Millerwrote: > > Once the helper UMH is invoked, it runs asynchronously taking eBPF > translation requests. How? Really. See my comment about mutual exclusion. The current patch is *broken* because it doesn't handle it. Really. Think of it this way - you may have now started *five* of those things concurrently by mistake. The actual module loading case never does that, because the actual module loading case has per-module serialization that got short-circuited. How are you going to handle five processes doing the same setup concurrently? Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 10:48 AM, Andy Lutomirski <l...@amacapital.net> wrote: >> On Mar 9, 2018, at 10:17 AM, Linus Torvalds <torva...@linux-foundation.org> >> wrote: >> >> Hmm. I wish we had an "execute blob" model, but we really don't, and >> it would be hard/impossible to do without pinning the pages in memory. >> > > Why so hard? We can already execute a struct file for execveat, and Alexei > already has this working for umh. > Surely we can make an immutable (as in even root can’t write it) > kernel-internal tmpfs file, execveat it, then unlink it. And what do you think that does? It pins the memory for the whole time. As a *copy* of the original file. Anyway, see my other suggestion that makes this all irrelevant. Just wait synchronously (until the exit), and just use deny_write_access(). The "synchronous wait" means that you don't have the semantic change (and really., it's *required* anyway for the whole mutual exclusion against another thread racing to load the same module), and the deny_write_access() means that we don't neeed to make another copy. Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Fri, Mar 9, 2018 at 10:43 AM, Kees Cookwrote: > > Module loading (via kernel_read_file()) already uses > deny_write_access(), and so does do_open_execat(). As long as module > loading doesn't call allow_write_access() before the execve() has > started in the new implementation, I think we'd be covered here. No. kernel_read_file() only does it *during* the read. So there's a huge big honking gap between the two. Also, the second part of my suggestion was to be entirely synchronous with the whole execution of the process, and do it within the "we do mutual exclusion fo rmodules with the same name" logic. Note that Andrei's patch uses UMH_WAIT_EXEC. That's basically "vfork+exec" - it only waits for the exec to have started, it doesn't wait for the whole thing. So I'm saying "use UMH_WAIT_PROC, do it in a different place, and make sure you cover the whole sequence with deny_write_access()". Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 9:08 PM, Alexei Starovoitovwrote: > > there is not abi breakage and file cannot disappear from running task. > One cannot umount fs while file is still being used. I think that "cannot umount" part _is_ the ABI breakage that Andy is talking about. > Not only "read twice", but "read many". > If .text sections of elf that are not yet in memory can be modified > by malicious user, later they will be brought in with different code. > I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN. I don't think it actually fixes anything. It might just break things. For all we know, people run modprobe with CAP_SYS_MODULE only, since that is obviously the only capability it needs. Hmm. I wish we had an "execute blob" model, but we really don't, and it would be hard/impossible to do without pinning the pages in memory. My gut feel is that the right direction to explore is: - consider the module loaded for the whole duration of the execve. So the execution is a *blocking* operation (and we get the correct exclusion semantics) - use deny_write_access() to make sure that we don't have active writers and cannot get them during the execve. The above mean that something that executes to load a new ebpf rule will work very well. But a "start and forget" will not work (although you can obviously do so with a internal fork/exec). Hmm? Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 7:06 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > So I don't like Andy's "let's make it a kernel module and then that > kernel module can execve() a blob". THAT seems like just stupid > indirection. > > But I do like Andy's "execve a blob" part, because it is the *blob* > that has had its signature verified, not the file! To be honest., Andy's approach has the advantage that all the synchronization we do for modules still works. In particular, we have module loading logic where we synchronize loading of modules with the same name, so that two things that do request_module() concurrently will not lead to two modules being loaded. See that whole "module_mutex" thing, and the logic in (for example) "add_unformed_module()". Andrei's patch short-circuits the module loading before that, so if you have two concurrent request_module() calls, you'll end up with no serialization, and two executables. So maybe Andy is right. He often is, after all. Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 5:44 PM, Kees Cookwrote: > > My concerns are mostly about crossing namespaces. If a container > triggers an autoload, the result runs in the init_ns. Heh. I saw that as an advantage. It's basically the same semantics as a normal module load does - in that the "kernel namespace" is init_ns. My own personal worry is actually different - we do check the signature of the file we're loading, but we're then passing it off to execve() not as the image we loaded, but as the file pointer. So the execve() will end up not using the actual buffer we checked the signature on, but instead just re-reading the file. Now, that has two issues: (a) it means that 'init_module' doesn't work, only 'finit_module'. Not a big deal, but I do think that we should fail the 'info->file == NULL' case explicitly (instead of failing when it does an execve() of /dev/null, which is what I think it does now - but that's just from the reading the patch, not from testing it). (b) somebody could maybe try to time it and modify the file after-the-fact of the signature check, and then we execute something else. Honestly, that "read twice" thing may be what scuttles this. Initially, I thought it was a non-issue, because anybody who controls the module subdirectory enough to rewrite files would be in a position to just execute the file itself directly instead. But it turns out that isn't needed. Some bad actor could just do "finit_module()" with a file that they just *copied* from the module directory. Yes, yes, they still need CAP_SYS_MODULE to even get that far, but it does worry me. So this does seem to turn "CAP_SYS_MODULE" into meaning "can run any executable as root in the init namespace". They don't have to have the module signing key that I thought would protect us, because they can do that "rewrite after signature check". So that does seem like a bad problem with the approach. So I don't like Andy's "let's make it a kernel module and then that kernel module can execve() a blob". THAT seems like just stupid indirection. But I do like Andy's "execve a blob" part, because it is the *blob* that has had its signature verified, not the file! Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirskiwrote: > > Also, I don't see how this is any more exploitable than any other > init_module(). Absolutely. If Kees doesn't trust the files to be loaded, an executable - even if it's running with root privileges and in the initns - is still fundamentally weaker than a kernel module. So I don't understand the security argument AT ALL. It's nonsensical. The executable loading does all the same security checks that the module loading does, including the signing check. And the whole point is that we can now do things with building and loading a ebpf rule instead of having a full module. Linus
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 4:45 PM, Kees Cookwrote: > > Rasmus mentioned this too. What I said there was that I was shy to > make that change, since we already can't mix that kind of thing with > the existing min()/max() implementation. The existing min()/max() is > already extremely strict, so there are no instances of this in the > tree. Yes, but I also didn't want to add any new cases in case people add new min/max() users. But: > If I explicitly add one, I see this with or without the patch: > > In file included from drivers/misc/lkdtm.h:7:0, > from drivers/misc/lkdtm_core.c:33: > drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’: > ./include/linux/kernel.h:809:16: warning: comparison of distinct > pointer types lacks a cast Oh, ok, in that case, just drop the __builtin_types_compatible_p() entirely. It's not adding anything. I was expecting the non-chosen expression in __builtin_choose_expr() to not cause type warnings. I'm actually surprised it does. Type games is why __builtin_choose_expr() tends to exist in the first place. > So are you saying you _want_ the type enforcement weakened here, or > that I should just not use __builtin_types_compatible_p()? I don't want to weaken the type enforcement, and I _thought_ you had done that __builtin_types_compatible_p() to keep it in place. But if that's not why you did it, then why was it there at all? If the type warning shows through even if it's in the other expression, then just a #define __min(t1, t2, x, y) \ __builtin_choose_expr( \ __builtin_constant_p(x) & \ __builtin_constant_p(y),\ (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y), \ __single_eval_min(t1, t2, \ ... would seem to be sufficient? Because logically, the only thing that matters is that x and y don't have any side effects and can be evaluated twice, and "__builtin_constant_p()" is already a much stronger version of that. Hmm? The __builtin_types_compatible_p() just doesn't seem to matter for the only thing I thought it was there for. Linus
Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()
On Thu, Mar 8, 2018 at 1:40 PM, Kees Cookwrote: > +#define __min(t1, t2, x, y)\ > + __builtin_choose_expr(__builtin_constant_p(x) &&\ > + __builtin_constant_p(y) &&\ > + __builtin_types_compatible_p(t1, t2), \ > + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ I understand why you use __builtin_types_compatible_p(), but please don't. It will mean that trivial constants like "5" and "sizeof(x)" won't simplify, because they have different types. The ?: will give the right combined type anyway, and if you want the type comparison warning, just add a comma-expression with something like like (t1 *)1 == (t2 *)1 to get the type compatibility warning. Yeah, yeah, maybe none of the VLA cases triggered that, but it seems silly to not just get that obvious constant case right. Hmm? Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Tue, Mar 6, 2018 at 12:01 PM, Andy Lutomirskiwrote: > > I assume I'm missing some context here, but why does this need to be > handled by the kernel rather than, say, a change to how modprobe > works? Honestly, the less we have to mess with user-mode tooling, the better. We've been *so* much better off moving most of the module loading logic to the kernel, we should not go back in the old broken direction. I do *not* want the kmod project that is then taken over by systemd, and breaks it the same way they broke firmware loading. Keep modprobe doing one thing, and one thing only: track dependencies and mindlessly just load the modules. Do *not* ask for it to do anything else. Right now kmod is a nice simple project. Lots of testsuite stuff, and a very clear goal. Let's keep kmod doing one thing, and not even have to care about internal kernel decisions like "oh, this module might not be a module, but an executable". If anything, I think we want to keep our options open, in the case we need or want to ever consider short-circuiting things and allowing direct loading of the simple cases and bypassing modprobe entirely. Linus
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitovwrote: > As the first step in development of bpfilter project [1] the request_module() > code is extended to allow user mode helpers to be invoked. Idea is that > user mode helpers are built as part of the kernel build and installed as > traditional kernel modules with .ko file extension into distro specified > location, such that from a distribution point of view, they are no different > than regular kernel modules. Thus, allow request_module() logic to load such > user mode helper (umh) modules via: [,,] I like this, but I have one request: can we make sure that this action is visible in the system messages? When we load a regular module, at least it shows in lsmod afterwards, although I have a few times wanted to really see module load as an event in the logs too. When we load a module that just executes a user program, and there is no sign of it in the module list, I think we *really* need to make that event show to the admin some way. .. and yes, maybe we'll need to rate-limit the messages, and maybe it turns out that I'm entirely wrong and people will hate the messages after they get used to the concept of these pseudo-modules, but particularly for the early implementation when this is a new thing, I really want a message like executed user process xyz-abc as a pseudo-module or something in dmesg. I do *not* want this to be a magical way to hide things. Linus
Re: [4.15-rc9] fs_reclaim lockdep trace
On Sat, Jan 27, 2018 at 2:24 PM, Dave Joneswrote: > On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote: > > Just triggered this on a server I was rsync'ing to. > > Actually, I can trigger this really easily, even with an rsync from one > disk to another. Though that also smells a little like networking in > the traces. Maybe netdev has ideas. Is this new to 4.15? Or is it just that you're testing something new? If it's new and easy to repro, can you just bisect it? And if it isn't new, can you perhaps check whether it's new to 4.14 (ie 4.13 being ok)? Because that fs_reclaim_acquire/release() debugging isn't new to 4.15, but it was rewritten for 4.14.. I'm wondering if that remodeling ended up triggering something. Adding PeterZ to the participants list in case he has ideas. I'm not seeing what would be the problem in that call chain from hell. Linus
Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
On Wed, Jan 17, 2018 at 7:06 PM, Al Virowrote: > > Similar to the way put_cmsg() handles 32bit case on biarch > targets, introduce a flag telling put_cmsg() to treat > ->msg_control as kernel pointer, using memcpy instead of > copy_to_user(). That allows to avoid the use of kernel_recvmsg() > with its set_fs(). If this is the only case that kernel_recvmsg() exists for, then by all means, that patch certainly looks like a good thing. Linus
Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution
On Fri, Jan 12, 2018 at 4:15 PM, Tony Luckwrote: > > Here there isn't any reason for speculation. The core has the > value of 'x' in a register and the upper bound encoded into the > "cmp" instruction. Both are right there, no waiting, no speculation. So this is an argument I haven't seen before (although it was brought up in private long ago), but that is very relevant: the actual scope and depth of speculation. Your argument basically depends on just what gets speculated, and on the _actual_ order of execution. So your argument depends on "the uarch will actually run the code in order if there are no events that block the pipeline". Or at least it depends on a certain latency of the killing of any OoO execution being low enough that the cache access doesn't even begin. I realize that that is very much a particular microarchitectural detail, but it's actually a *big* deal. Do we have a set of rules for what is not a worry, simply because the speculated accesses get killed early enough? Apparently "test a register value against a constant" is good enough, assuming that register is also needed for the address of the access. Linus
Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution
On Thu, Jan 11, 2018 at 4:46 PM, Dan Williamswrote: > > This series incorporates Mark Rutland's latest ARM changes and adds > the x86 specific implementation of 'ifence_array_ptr'. That ifence > based approach is provided as an opt-in fallback, but the default > mitigation, '__array_ptr', uses a 'mask' approach that removes > conditional branches instructions, and otherwise aims to redirect > speculation to use a NULL pointer rather than a user controlled value. Do you have any performance numbers and perhaps example code generation? Is this noticeable? Are there any microbenchmarks showing the difference between lfence use and the masking model? Having both seems good for testing, but wouldn't we want to pick one in the end? Also, I do think that there is one particular array load that would seem to be pretty obvious: the system call function pointer array. Yes, yes, the actual call is now behind a retpoline, but that protects against a speculative BTB access, it's not obvious that it protects against the mispredict of the __NR_syscall_max comparison in arch/x86/entry/entry_64.S. The act of fetching code is a kind of read too. And retpoline protects against BTB stuffing etc, but what if the _actual_ system call function address is wrong (due to mis-prediction of the system call index check)? Should the array access in entry_SYSCALL_64_fastpath be made to use the masking approach? Linus
Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williamswrote: > > I looks like there is another problem, or I'm misreading the > cleverness. I think you were misreading it. I was basically saying that this: unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ doesn't work, and that the "_m -1 - _i" needs to be replaced by "_i | _m -1 -_i". So you have unsigned long _mask = ~(long)(_i (_m - 1 - _i)) >> BITS_PER_LONG - 1;\ which should give the right result. No? But as mentioned, I think you can do it with two instructions if you do an architecture-specific inline asm: unsigned long mask; asm("cmpq %1,%2; sbbq %0,%0" :"=r" (mask) :"g" (max),"r" (idx)); which is likely much faster, and has much better register usage ("max" can be a constant or loaded directly from memory, and "mask" could be assigned the same register as idx). But once again, I didn't really test it. Note that the "cmpq/sbbq" version works regardless of max/idx values, since it literally does the math in BITS_ION_LONG+1 bits. In contrast, the "binary or with idx" version only works if the high bit set in idx cannot be valid (put another way: 'max' must not be bigger than MAXLONG+1). Linus
Re: Re: dvb usb issues since kernel 4.9
On Tue, Jan 9, 2018 at 9:57 AM, Eric Dumazetwrote: > > Your patch considers TASKLET_SOFTIRQ being a candidate for 'immediate > handling', but TCP Small queues heavily use TASKLET, > so as far as I am concerned a revert would have the same effect. Does it actually? TCP ends up dropping packets outside of the window etc, so flooding a machine with TCP packets and causing some further processing up the stack sounds very different from the basic packet flooding thing that happens with NET_RX_SOFTIRQ. Also, honestly, the kinds of people who really worry about flooding tend to have packet filtering in the receive path etc. So I really think "you can use up 90% of CPU time with a UDP packet flood from the same network" is very very very different - and honestly not at all as important - as "you want to be able to use a USB DVB receiver and watch/record TV". Because that whole "UDP packet flood from the same network" really is something you _fundamentally_ have other mitigations for. I bet that whole commit was introduced because of a benchmark test, rather than real life. No? In contrast, now people are complaining about real loads not working. Linus
Re: dvb usb issues since kernel 4.9
On Tue, Jan 9, 2018 at 9:42 AM, Mauro Carvalho Chehabwrote: > > On my preliminar tests, writing to a file on an ext4 partition at a > USB stick loses data up to the point to make it useless (1/4 of the data > is lost!). However, writing to a class 10 microSD card is doable. Note that most USB sticks are horrible crap. They can have write latencies counted in _seconds_. You can cause VM issues and various other non-hardware stalls with them, simply because something gets stuck waiting for a page writeout that should take a few ms on any reasonable hardware, but ends up talking half a second or more. For example, even really well-written software that tries to do things like threaded write-behind to smooth out the IO will be _totally_ screwed by the USB stick behavior (where you might write a few MB at high speeds, and then the next write - however small - takes a second because the stupid USB stick does a synchronous garbage collection. Suddenly all that clever software that tried to keep things moving along smoothly without any hiccups, and tried hard to make the USB bus have a nice constant loadm can't do anything at all about the crap hardware. So when testing writes to USB sticks, I'm not convinced you're actually testing any USB bus limitations or even really any other hardware limitations than the USB stick itself. Linus
Re: Re: dvb usb issues since kernel 4.9
On Tue, Jan 9, 2018 at 9:27 AM, Eric Dumazetwrote: > > So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has > shown up multiple times in various 'regressions' > simply because it could surface the problem more often. > But even if you revert it, you can still make the faulty > driver/subsystem misbehave by adding more stress to the cpu handling > the IRQ. ..but that's always true. People sometimes live on the edge - often by design (ie hardware has been designed/selected to be the crappiest possible that still work). That doesn't change anything. A patch that takes "bad things can happen" to "bad things DO happen" is a bad patch. > Maybe the answer is to tune the kernel for small latencies at the > price of small throughput (situation before the patch) Generally we always want to tune for latency. Throughput is "easy", but almost never interesting. Sure, people do batch jobs. And yes, people often _benchmark_ throughput, because it's easy to benchmark. It's much harder to benchmark latency, even when it's often much more important. A prime example is the SSD benchmarks in the last few years - they improved _dramatically_ when people noticed that the real problem was latency, not the idiotic maximum big-block bandwidth numbers that have almost zero impact on most people. Put another way: we already have a very strong implicit bias towards bandwidth just because it's easier to see and measure. That means that we generally should strive to have a explicit bias towards optimizing for latency when that choice comes up. Just to balance things out (and just to not take the easy way out: bandwidth can often be improved by adding more layers of buffering and bigger buffers, and that often ends up really hurting latency). > 1) Revert the patch Well, we can revert it only partially - limiting it to just networking for example. Just saying "act the way you used to for tasklets" already seems to have fixed the issue in DVB. > 2) get rid of ksoftirqd since it adds unexpected latencies. We can't get rid of it entirely, since the synchronous softirq code can cause problems too. It's why we have that "maximum of ten synchronous events" in __do_softirq(). And we don't *want* to get rid of it. We've _always_ had that small-scale "at some point we can't do it synchronously any more". That is a small-scale "don't have horrible latency for _other_ things" protection. So it's about latency too, it's just about protecting latency of the rest of the system. The problem with commit 4cd13c21b207 is that it turns the small-scale latency issues in softirq handling (they get larger latencies for lots of hardware interrupts or even from non-preemptible kernel code) into the _huge_ scale latency of scheduling, and does so in a racy way too. > 3) Let applications that expect to have high throughput make sure to > pin their threads on cpus that are not processing IRQ. > (And make sure to not use irqbalance, and setup IRQ cpu affinities) The only people that really deal in "thoughput only" tend to be the HPC people, and they already do things like this. (The other end of the spectrum is the realtime people that have extreme latency requirements, who do things like that for the reverse reason: keeping one or more CPU's reserved for the particular low-latency realtime job). Linus
Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > # carry will be clear if idx >= max > cmpq %idx,%max Bah. Other way around. cmpq %max,%idx I'm a moron. > # mask will be clear if carry was clear, ~0 otherwise > sbbq %mask,%mask > > to generate mask directly. I might have screwed that up. Worth perhaps trying? More importantly, worth _testing_ and fixing my hand-waving "asm like this" crap. But I do think that simple two-instruction cmpq/sbbq sequence could get it right in just two trivial ALU instructions. Linus
Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution
On Mon, Jan 8, 2018 at 7:42 PM, Dan Williamswrote: > > originally from Linus and tweaked by Alexei and I: Sadly, that tweak - while clever - is wrong. > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\ Why? Because "(long)(_m-1-_i)" is not negative just because "i >= m". It can still be positive. Think "m = 100", "i=bignum". The subtraction will overflow and become positive again, the shift will shift to zero, and then the mask will become ~0. Now, you can fix it, but you need to be a tiny bit more clever. In particular, just make sure that you retain the high bit of "_i", basically making the rule be that a negative index is not ever valid. And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))". Now the sign bit is set if _i had it set, _or_ if the subtraction turned negative, and you don't have to worry about the overflow situation. But it does require that extra step to be trustworthy. Still purely cheap arithmetic operations, although there is possibly some additional register pressure there. Somebody might be able to come up with something even more minimal (or find a fault in my fix of your tweak). Obviously, with architecture-specific code, you may well be able to do better, using the carry flag of the subtraction. For example, on x86, I think you could do it with just two instructions: # carry will be clear if idx >= max cmpq %idx,%max # mask will be clear if carry was clear, ~0 otherwise sbbq %mask,%mask to generate mask directly. I might have screwed that up. Worth perhaps trying? Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Mon, Jan 8, 2018 at 3:53 PM, Dan Williamswrote: > > I've been thinking the "and" is only suitable for the array bounds > check, for get_user() we're trying to block speculation past > access_ok() at which point we can only do the lfence? Well, we *could* do the "and", at least for the simple cases (ie the true "get_user()" that integrates the access_ok with the access). IOW, mainly the code in arch/x86/lib/getuser.S. But it probably is a lot simpler to just add the "lfence" to ASM_STAC, because by definition those cases don't tend to be the truly critical ones - people who use those functions tend to do one or two accesses, and the real cost is likely the I$ misses and the D$ miss to get current->addr_limit. Not to mention the "stac" itself, which is much more expensive than the access on current microarchitectures. But something like this *might* work: index c97d935a29e8..7fa3d293beaf 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -38,8 +38,11 @@ .text ENTRY(__get_user_1) mov PER_CPU_VAR(current_task), %_ASM_DX - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX + mov TASK_addr_limit(%_ASM_DX),%_ASM_DX + cmp %_ASM_DX,%_ASM_AX jae bad_get_user + or $0xfff,%_ASM_DX + and %_ASM_DX,%_ASM_AX ASM_STAC 1: movzbl (%_ASM_AX),%edx xor %eax,%eax (this only does the one-byte case - the 2/4/8 byte cases are exactly the same). The above is completely untested and might have some stupid thinko/typo, so take it purely as a "example patch" to show the concept, rather than actually do it. But just adding "lfence" to the existing ASM_STAC is a hell of a lot easier, and the performance difference between that trivial patch and the above "let's be clever with 'and'" might not be measurable. I really have no idea how expensive lfence might actually end up being in practice. It's possible that lfence is actually fairly cheap in kernel code, since we tend to not have very high IPC anyway. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams <dan.j.willi...@intel.com> wrote: > On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds > <torva...@linux-foundation.org> wrote: >> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams <dan.j.willi...@intel.com> >> wrote: >>> >>> I assume if we put this in uaccess_begin() we also need audit for >>> paths that use access_ok but don't do on to call uaccess_begin()? A >>> quick glance shows a few places where we are open coding the stac(). >>> Perhaps land the lfence in stac() directly? >> >> Yeah, we should put it in uaccess_begin(), and in the actual user >> accessor helpers that do stac. Some of them probably should be changed >> to use uaccess_begin() instead while at it. >> >> One question for the CPU people: do we actually care and need to do >> this for things that might *write* to something? The speculative write >> obviously is killed, but does it perhaps bring in a cacheline even >> when killed? > > As far as I understand a write could trigger a request-for-ownership > read for the target cacheline. Oh, absolutely. I just wonder at what point that happens. Honestly, trying to get exclusive access to a cacheline can be _very_ expensive (not just for the local thread), so I would actually expect that doing so for speculative writes is actually bad for performance. That's doubly true because - unlike reads - there is no critical latency issue, so trying to get the cache access started as early as possible simply isn't all that important. So I suspect that a write won't actually try to allocate the cacheline until the write has actually retired. End result: writes - unlike reads - *probably* will not speculatively perturb the cache with speculative write addresses. > Even though writes can trigger reads, as far as I can see the write > needs to be dependent on the first out-of-bounds read Yeah. A write on its own wouldn't matter, even if it were to perturb the cache state, because the address already comes from user space, so there's no new information in the cache perturbation for the attacker. But that all implies that we shouldn't need the lfence for the "put_user()" case, only for the get_user() (where the value we read would then perhaps be used to do another access). So we want to add the lfence (or "and") to get_user(), but not necessarily put_user(). Agreed? Linus
Re: dvb usb issues since kernel 4.9
On Mon, Jan 8, 2018 at 11:15 AM, Alan Sternwrote: > > Both dwc2_hsotg and ehci-hcd use the tasklets embedded in the > giveback_urb_bh member of struct usb_hcd. See usb_hcd_giveback_urb() > in drivers/usb/core/hcd.c; the calls are > > else if (high_prio_bh) > tasklet_hi_schedule(>bh); > else > tasklet_schedule(>bh); > > As it turns out, high_prio_bh gets set for interrupt and isochronous > URBs but not for bulk and control URBs. The DVB driver in question > uses bulk transfers. Ok, so we could try out something like the appended? NOTE! I have not tested this at all. It LooksObvious(tm), but... Linus kernel/softirq.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 2f5e87f1bae2..97b080956fea 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -79,12 +79,16 @@ static void wakeup_softirqd(void) /* * If ksoftirqd is scheduled, we do not want to process pending softirqs - * right now. Let ksoftirqd handle this at its own rate, to get fairness. + * right now. Let ksoftirqd handle this at its own rate, to get fairness, + * unless we're doing some of the synchronous softirqs. */ -static bool ksoftirqd_running(void) +#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ)) +static bool ksoftirqd_running(unsigned long pending) { struct task_struct *tsk = __this_cpu_read(ksoftirqd); + if (pending & SOFTIRQ_NOW_MASK) + return false; return tsk && (tsk->state == TASK_RUNNING); } @@ -325,7 +329,7 @@ asmlinkage __visible void do_softirq(void) pending = local_softirq_pending(); - if (pending && !ksoftirqd_running()) + if (pending && !ksoftirqd_running(pending)) do_softirq_own_stack(); local_irq_restore(flags); @@ -352,7 +356,7 @@ void irq_enter(void) static inline void invoke_softirq(void) { - if (ksoftirqd_running()) + if (ksoftirqd_running(local_softirq_pending())) return; if (!force_irqthreads) {
Re: [PATCH bpf] bpf: prevent out-of-bounds speculation
On Mon, Jan 8, 2018 at 9:05 AM, Mark Rutlandwrote: > > I'm a little worried that in the presence of some CPU/compiler > optimisations, the masking may effectively be skipped under speculation. > So I'm not sure how robust this is going to be. Honestly, I think the masking is a hell of a lot more robust than any of the "official" fixes. More generic data speculation (as opposed to control speculation) is (a) mainly academic masturbation (b) hasn't even been shown to be a good idea even in _theory_ yet (except for the "completely unreal hardware" kind of theory where people assume some data oracle) (c) isn't actually done in any real CPU's today that I'm aware of (unless you want to call the return stack data speculation). and the thing is, we should really not then worry about "... but maybe future CPU's will be more aggressive", which is the traditional worry in these kinds of cases. Why? Because quite honestly, any future CPU's that are more aggressive about speculating things like this are broken shit that we should call out as such, and tell people not to use. Seriously. In this particular case, we should be very much aware of future CPU's being more _constrained_, because CPU vendors had better start taking this thing into account. So the masking approach is FUNDAMENTALLY SAFER than the "let's try to limit control speculation". If somebody can point to a CPU that actually speculates across an address masking operation, I will be very surprised. And unless you can point to that, then stop trying to dismiss the masking approach. The only thing we need to be really really careful about is to make sure that the mask generation itself is not in a control speculation path. IOW, the mask really has to be a true data dependency, and has to be generated arithmetically. Because if the mask generation has a control dependency on it, then obviously that defeats the whole "make sure we don't have control speculation" approach. Linus
Re: dvb usb issues since kernel 4.9
On Mon, Jan 8, 2018 at 9:55 AM, Ingo Molnarwrote: > > as I doubt we have enough time to root-case this properly. Well, it's not like this is a new issue, and we don't have to get it fixed for 4.15. It's been around since 4.9, it's not a "have to suddenly fix it this week" issue. I just think that people should plan on having to maybe revert it and mark the revert for stable. But if the USB or DVB layers can instead just make the packet queue a bit deeper and not react so badly to the latency of a single softirq, that would obviously be a good thing in general, and maybe fix this issue. So I'm not saying that the revert is inevitable either. But I have to say that that commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") was a pretty damn big hammer, and entirely ignored the "softirqs can have latency concerns" issue. So I do feel like the UDP packet storm thing might want a somewhat more directed fix than that huge hammer of trying to move softirqs aggressively into the softirq thread. This is not that different from threaded irqs. And while you can set the "thread every irq" flag, that would be largely insane to do by default and in general. So instead, people do it either for specific irqs (ie "request_threaded_irq()") or they have a way to opt out of it (IRQF_NO_THREAD). I _suspect_ that the softirq thing really just wants the same thing. Have the networking case maybe set the "prefer threaded" flag just for networking, if it's less latency-sensitive for softirq handling than In fact, even for networking, there are separate TX/RX softirqs, maybe networking would only set it for the RX case? Or maybe even trigger it only for cases where things queue up and it goes into a "polling mode" (like NAPI already does). Of course, I don't even know _which_ softirq it is that the DVB case has issues with. Maybe it's the same NET_RX case? But looking at that offending commit, I do note (for example), that we literally have things like tasklet[_hi]_schedule() that might have been explicitly expected to just run the tasklet at a fairly low latency (maybe instead of a workqueue exactly because it doesn't need to sleep and wants lower latency). So saying "just because softirqd is possibly already woken up, let's delay all those tasklets etc" does really seem very wrong to me. Can somebody tell which softirq it is that dvb/usb cares about? Linus
Re: dvb usb issues since kernel 4.9
On Sat, Jan 6, 2018 at 11:54 AM, Mauro Carvalho Chehabwrote: > > Em Sat, 6 Jan 2018 16:04:16 +0100 > "Josef Griebichler" escreveu: >> >> the causing commit has been identified. >> After reverting commit >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882 >> its working again. > > Just replying to me won't magically fix this. The ones that were involved on > this patch should also be c/c, plus USB people. Just added them. Actually, you seem to have added an odd subset of the people involved. For example, Ingo - who actually committed that patch - wasn't on the cc. I do think we need to simply revert that patch. It's very simple: it has been reported to lead to actual problems for people, and we don't fix one problem and then say "well, it fixed something else" when something breaks. When something breaks, we either unbreak it, or we revert the change that caused the breakage. It's really that simple. That's what "no regressions" means. We don't accept changes that cause regressions. This one did. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreauwrote: > > Linus, no need to explain that to me, I'm precisely trying to see how > to disable PTI for a specific process because I face up to 45% loss in > certain circumstances, making it a no-go. But while a few of us have > very specific workloads emphasizing this impact, others have very > different ones and will not notice. For example my laptop did boot > pretty fine and I didn't notice anything until I fire a network > benchmark. Sure, most people have hardware where the bottleneck is entirely elsewhere (slow network, rotating disk, whatever). But this whole "normal people won't notice" is dangerous thinking. They may well notice very much, we simply don't know what they are doing. Quite honesty, it's equally correct to say "normal people won't be affected by the security issue in the first place". That laptop that you didn't have any issues with? Likely it never had an exploit running on it either! So the whole "normal people" argument is pure and utter garbage. It's wrong. It's pure shit when it comes to performance, but it's also pure shit when it comes to the security issue. Don't use it. We need to fix the security problem, but we need to do it *without* these braindead arguments that performance is somehow secondary. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreauwrote: > > To be fair there's overreaction on both sides. The vast majority of > users need to get a 100% safe system and will never notice any > difference. There is no such thing as a "100% safe system". Never will be - unless you make sure you have no users. Also, people definitely *are* noticing the performance issues with the current set of patches, and they are causing real problems. Go search for reports of Amazon AWS slowdowns. So this whole "security is so important that performance doesn't matter" mindset is pure and utter garbage. And the whole "normal people won't even notice" is pure garbage too. Don't spread that bullshit when you see actual normal people complaining. Performance matters. A *LOT*. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, Jan 6, 2018 at 3:31 PM, Dan Williamswrote: > > I assume if we put this in uaccess_begin() we also need audit for > paths that use access_ok but don't do on to call uaccess_begin()? A > quick glance shows a few places where we are open coding the stac(). > Perhaps land the lfence in stac() directly? Yeah, we should put it in uaccess_begin(), and in the actual user accessor helpers that do stac. Some of them probably should be changed to use uaccess_begin() instead while at it. One question for the CPU people: do we actually care and need to do this for things that might *write* to something? The speculative write obviously is killed, but does it perhaps bring in a cacheline even when killed? Because maybe we don't need the lfence in put_user(), only in get_user()? Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Sat, Jan 6, 2018 at 4:32 AM, Alan Coxwrote: > > Also for x86-64 if we are trusting that an AND with a constant won't get > speculated into something else surely we can just and the address with ~(1 > << 63) before copying from/to user space ? The user will then just > speculatively steal their own memory. User accesses *may* go to the kernel too, with set_fs(KERNEL_DS). We've been getting rid of those, but they still exist. We historically perhaps have 'and'ed the address with current->thread.addr_limit, but it's no longer a power-of-two mask, so that doesn't work. We'd have to play tricks there, but it might work to do something like addr &= current->thread.addr_limit | 0xfff; or similar. But this is one area where the 'lfence' is probably not too bad. The cost of the existing 'stac' instruction is much higher, and in fact depending on how lfence works (and how stac affects the memory unit pipeline, it might even _help_ to serialize the stac with the actual address. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > The fact is, we have to stop speculating when access_ok() does *not* > fail - because that's when we'll actually do the access. And it's that > access that needs to be non-speculative. I also suspect we should probably do this entirely differently. Maybe the whole lfence can be part of uaccess_begin() instead (ie currently 'stac()'). That would fit the existing structure better, I think. And it would avoid any confusion about the whole "when to stop speculation". Linus
Re: [PATCH 01/18] asm-generic/barrier: add generic nospec helpers
On Fri, Jan 5, 2018 at 5:09 PM, Dan Williamswrote: > +#ifndef nospec_ptr > +#define nospec_ptr(ptr, lo, hi) > \ Do we actually want this horrible interface? It just causes the compiler - or inline asm - to generate worse code, because it needs to compare against both high and low limits. Basically all users are arrays that are zero-based, and where a comparison against the high _index_ limit would be sufficient. But the way this is all designed, it's literally designed for bad code generation for the unusual case, and the usual array case is written in the form of the unusual and wrong non-array case. That really seems excessively stupid. Linus
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Fri, Jan 5, 2018 at 5:10 PM, Dan Williamswrote: > From: Andi Kleen > > When access_ok fails we should always stop speculating. > Add the required barriers to the x86 access_ok macro. Honestly, this seems completely bogus. The description is pure garbage afaik. The fact is, we have to stop speculating when access_ok() does *not* fail - because that's when we'll actually do the access. And it's that access that needs to be non-speculative. That actually seems to be what the code does (it stops speculation when __range_not_ok() returns false, but access_ok() is !__range_not_ok()). But the explanation is crap, and dangerous. Linus