Re: [GIT] Networking

2018-10-28 Thread Linus Torvalds
On Sun, Oct 28, 2018 at 7:46 PM David Miller  wrote:
>
> Please pull, thanks a lot!

Pulled,

  Linus


Re: [PATCH 1/2] net: handle NULL ->poll gracefully

2018-06-29 Thread Linus Torvalds
On Fri, Jun 29, 2018 at 6:46 AM Linus Torvalds
 wrote:
>
> Hmm. I'll have to think about it.

I took yours over the revert. It's just smaller and nicer, and the
conditional should predict fine for any case where it might matter.

Linus


Re: [PATCH 1/2] net: handle NULL ->poll gracefully

2018-06-29 Thread Linus Torvalds
On Fri, Jun 29, 2018 at 6:37 AM Christoph Hellwig  wrote:
>
> The big aio poll revert broke various network protocols that don't
> implement ->poll as a patch in the aio poll serie removed sock_no_poll
> and made the common code handle this case.

Hmm. I just did the revert of commit 984652dd8b1f ("net: remove
sock_no_poll") in my tree.

But I haven't pushed it out, and your patch is certainly smaller/nicer
than the revert.

That said, the revert does have the advantage of avoiding a
conditional in the network poll() function (at the cost of a more
expensive indirect call for when sock_no_poll actually triggers, but
only "odd" network protocols have that).

Hmm. I'll have to think about it.

   Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-29 Thread Linus Torvalds
On Fri, Jun 29, 2018 at 6:29 AM Christoph Hellwig  wrote:
> No need for poll_table_entry, we just need a wait_queue_head.
> poll_table_entry is an select.c internal (except for two nasty driver) -
> neither epoll nor most in-kernel callers use it.

Well, you need the poll_table for the "poll_wait()", and you would
need to be able to have multiple wait-queues even though you only have
one file you're waiting for.

But yes, it doesn't really need to be the full complex poll_table_entry model.

  Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 4:37 PM Al Viro  wrote:
>
> You underestimate the nastiness of that thing (and for the record, I'm
> a lot *less* fond of AIO than you are, what with having had to read that
> nest of horrors lately).  It does not "copy the data to userland"; what it
> does instead is copying into an array of pages it keeps, right from
> IO completion callback.  I

Ugh.

Oh well. I'd be perfectly happy to have somebody re-write and
re-architect the aio code entirely.  Much rather than that the
->poll() code. Because I know which one I think is well-desiged with a
nice usable interface, and which one is a pile of shit.

In the meantime, if AIO wants to do poll() in some irq callback, may I
suggest just using workqueues.

Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 3:49 PM Al Viro  wrote:
>
> aio_poll() is not a problem.  It's wakeup callback that is one.

No, it's not a problem either.

The only thing the wakup callback wants to do is to remove the wait queue entry.

And *that* doesn't need to sleep, and it has absolutely nothing to do
with whether "->poll()" itself sleeps or not.

All that needs to do is "poll_freewait()". Done. In practice, it's
probably only going to be a single free_poll_entry(), but who cares?
The AIO code should handle the "maybe ->poll() added multiple
wait-queues" just fine.

> You are misreading that mess.  What he's trying to do (other than surviving
> the awful clusterfuck around cancels) is to handle the decision what to
> report to userland right in the wakeup callback.  *That* is what really
> drives the "make the second-pass ->poll() or something similar to it
> non-blocking" (in addition to the fact that it is such in considerable
> majority of instances).

That's just crazy BS.

Just call poll() again when you copy the data to userland (which by
definition can block, again).

Stop the idiotic "let's break poll for stupid AIO reasons, because the
AIO people are morons".

Just make the AIO code do the sane thing.

  Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 3:20 PM Al Viro  wrote:
>
> The rules for drivers change only in one respect - if your ->poll() is going 
> to
> need to block, check poll_requested_events(pt) & EPOLL_ATOMIC and return 
> EPOLLNVAL
> in such case.

OI still don't even understand why you care.

Yes, the AIO poll implementation did it under the spinlock.

But there's no good *reason* for that.  The "aio_poll()" function
itself is called in perfectly fine blocking context.

The only reason it does it under the spinlock is that apparently
Christoph didn't understand how poll() worked.

As far as I can tell, Christoph could have just done the first pass
'->poll()' *without* taking a spinlock, and that adds the table entry
to the table. Then, *under the spinlock*, you associate the table the
the kioctx. And then *after* the spinlock, you can call "->poll()"
again (now with a NULL table pointer), to verify that the state is
still not triggered. That's the whole point of the two-phgase poll
thing - the first phase adds the entry to the wait queues, and the
second phase checks for the race of "did it the event happen in the
meantime".

There is absolutely no excuse for calling '->poll()' itself under the
spinlock. I don't see any reason for it. The whole "AIO needs this to
avoid races" was always complete and utter bullshit, as far as I can
tell.

So stop it with this crazy and pointless "poll() might block".

IT DAMN WELL SHOULD BE ABLE TO BLOCK, AND NOBODY SANE WILL EVER CARE!

If somebody cares, they are doing things wrong. So fix the AIO code,
don't look at the poll() code, for chrissake!

   Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 2:30 PM Al Viro  wrote:
>
> > Again, locking is permitted. It's not great, but it's not against the rules.
>
> Me: a *LOT* of ->poll() instances only block in __pollwait() called 
> (indirectly)
> on the first pass.
>
> You: They are *all* supposed to do it.
>
> Me: 

Oh, I thought you were talking about the whole "first pass" adding to
wait queues, as opposed to doing it on the second pass.

The *blocking* is entirely immaterial. I didn't even react to it,
because it's simply not an issue.

I don't understand why you're even hung up about it.

The only reason "blocking" seems to be an issu eis because AIO has
shit-for-brains and wanted to do poll() under the spinlock.

But that's literally just AIO being confused garbage. It has zero
relevance for anything else.

Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 1:37 PM Al Viro  wrote:
>
>
> Speaking of obvious bogosities (a lot more so than a blocking allocation
> several calls down into helper):
>
> static __poll_t ca8210_test_int_poll(
> struct file *filp,
> struct poll_table_struct *ptable
> )

Ok, that's just garbage.

Again, this is not an excuse for changing "->poll()". This is just a
bogus driver that does stupid things.

And again, don't use this as an example of "poll must not block" The
fact is, poll() *can* block for locking and other such things, and
it's perfectly normal and ok. It just shouldn't block for IO.

I will not take any misguided patches to try to "fix" poll functions
that might block.

But I will take patches to fix bugs in drivers. In this case, I think
the fix is to just remove that crazy "wait_event_interruptible()"
entirely.

Not that anybody cares, obviously. Apparently nobody has ever noticed
how broken that poll function is.

   Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 1:28 PM Al Viro  wrote:
>
>
> Sure, but...
>
> static __poll_t binder_poll(struct file *filp,
> struct poll_table_struct *wait)
> {
> struct binder_proc *proc = filp->private_data;
> struct binder_thread *thread = NULL;
> bool wait_for_proc_work;
>
> thread = binder_get_thread(proc);
> if (!thread)
> return POLLERR;

That's actually fine.

In particular, it's ok to *not* add yourself to the wait-queues if you
already return the value that you will always return.

For example, the poll function for a regular file could just always return

EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM

because it never changes. Now, it doesn't actually *do* that, because
we have the rule that "no ->poll function" means exactly that, but
it's not wrong.

Similarly, if a poll handler has an error that will not go away, the
above is actually the *right* thing to do. There simply isn't anything
to wait for.

Arguably, it probably should return not just POLLERR, but also
POLLIN/POLLOUT, since an error *also* means that it's going to return
immediately from read/write, but that's a separate thing entirely.

> And that's hardly unique - we have instances playing with timers,
> allocations, whatnot.  Even straight mutex_lock(), as in

So?

Again, locking is permitted. It's not great, but it's not against the rules.

So none of the things you point to are excuses for changing interfaces
or adding any flags.

And no, we don't care at all about "blocking" in general. Somebody who
cares about _performance_ may care, but it's not fundamnetal. Even
select(*) and poll() itself will block for allocating select tables
etc, but also for reading (and updating) user space.

Anybody who thinks "select cannot block" or "->poll() musn't block" is
just confused. It has *never* been about that. It waits asynchronously
for IO, but it may well wait synchronously for locks or memory or just
"lazy implementation".

And none of this has antyhign to do with the ->poll() interface
itself. If you want to improve performance on some _particular_ file
and actually worry about locking etc, you fix that particular
implementation. You don't go and change the interface for everybody.

The fact is, those interface changes were just broken shit. They were
confused. I don't actually believe that AIO even needed them.

Christoph, do you have a test program for IOCB_CMD_POLL and what it's
actually supposed to do?

Because I think that what it can do is simply to do the ->poll() calls
outside the iocb locks, and then just attach the poll table to the
kioctx afterwards.

This whole "poll must not block" is a complete red herring. It doesn't
come from any other requirements than BAD AIO GARBAGE CODE.

Seriously. Stop thinking this has to happen inside some spinlocked
region. That's AIO braindamage, it's irrelevant.

 Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 11:17 AM Al Viro  wrote:
>
> As for what can be salvaged out of the whole mess,
> * probably the most serious lesson is that INDIRECT CALLS ARE
> COSTLY NOW and shouldn't be used lightly.

Note that this has always been true, it's just _way_ more obvious now.

Indirect calls are costly not just because of the nasty 20+ cycle cost
of the stupid spectre overhead (which hopefully will be a thing of the
past in a few years as people upgrade), but because they are a pretty
basic barrier to optimization, both for compilers but also just for
_people_.

Look at a lot of vfs optimization we've done, like all the name
hashing optimization etc. We basically fall flat on our face if a
filesystem implements its own name hash function, not _just_ because
of the cost of the indirect function call, but because it suddenly
means that the filesystem is doing its own thing and all the clever
work we did to integrate name hashing with copying the name no longer
works.

So I really want to avoid indirect calls. And when they *do* happen, I
want to avoid the model where people think of them as low-level object
accessor functions - the C++ disease. I want indirect function calls
to make sense at a higher level, and do some real operation.

End result: I really despised the new poll() model. Yes, the
performance report was what made me *notice*, but then when I looked
at the code I went "No". Using an indirect call as some low-level
accessor function really is fundamentally wrong. Don't do it. It's
badly designed.

Out VFS operations are _high-level_ operations, where we do one single
call for a whole "read()" operation. "->poll()" used to be the same.
The new "->get_poll_head()" and "->poll_mask()" was just bad, bad,
bad.

> * having an optional pointer to wait_queue_head in struct file
> is probably a good idea; a lot of ->poll() instances do have the same
> form.  Even if sockets do not (and I'm not all that happy about the
> solution in the latest series), the instances that do are common and
> important enough.

Right. I don't hate the poll wait-queue pointer. That said, I do hope
that we can simply write things so as to not even need it.

> * a *LOT* of ->poll() instances only block in __pollwait()
> called (indirectly) on the first pass.

They are *all* supposed to do it.

The whole idea with "->poll()" is that the model of operation is:

 -  add_wait_queue() and return state on the first pass

 - on subsequent passes (or if somebody else already returned a state
that means we already know we're not going to wait), the poll table is
NULL, so you *CANNOT* add_wait_queue again, so you just return state.

Additionally, once _anybody_ has returned a "I already have the
event", we also clear the poll table queue, so subsequent accesses
will purely be for returning the poll state.

So I don't understand why you're asking for annotation. The whole "you
add yourself to the poll table" is *fundamentally* only done on the
first pass. You should never do it for later passes.

> How much do you intend to revert?  Untangling just the ->poll()-related
> parts from the rest of changes in fs/aio.c will be unpleasant; we might
> end up reverting the whole tail of the series and redoing the things
> that are not poll-related on top of the revert... ;-/

I pushed out my revert. It was fairly straightforward, it just
reverted all the poll_mask/get_poll_head changes, and the aio code
that depended on them.

Btw, I really don't understand the "aio has a race". The old poll()
interface was fundamentally race-free. There simply *is* no way to
race on it, exactly because of the whole "add yourself to the wait
queue first, then ask for state afterwards" model.  The model may be
_odd_, but it has literally worked well for a quarter century exactly
because it's really simple and fundamentally cannot have races.

So I think it's the aio code that needs fixing, not the polling code.

I do want that explanation for why AIO is _so_ special that it can
introduce a race in poll().

Because I suspect it's not so special, and it's just buggy. Maybe
Christoph didn't understand the two-phase model (how you call ->poll()
_twice_ - first to add yourself to the queue, later to check status).
Or maybe AIO interfaces are just shit (again!) and don't work right.

   Linus


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Linus Torvalds
On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig  wrote:
>
> Note that for this removes the possibility of actually returning an
> error before waiting in poll.  We could still do this with an ERR_PTR
> in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but
> I'd like to defer that until actually required.

I'm still going to just revert the whole poll mess for now.

It's still completely broken. This helps things, but it doesn't fix
the fundamental issue: the new interface is strictly worse than the
old interface ever was.

So Christoph, if you don't like the tradoitional ->poll() method, and
you want something else for aio polling, I seriously will suggest that
you introduce a new f_op for *that*. Don't mess with the existing
->poll() function, don't make select() and poll() use a slower and
less capable function just because aio wants something else.

In other words, you need to see AIO as the less important case, not as
the driver for this.

I also want to understand what the AIO race was, and what the problem
with the poll() thing was. You claimed it was racy. I don't see it,
and it was never ever explained in the whole series. I should never
have pulled it in the first place if only for that reason, but I tend
to trust Al when it comes to the VFS layer, so I did. My bad.

So before we try this again, I most definitely want _explanations_.
And I want the whole approach to be very clear that AIO is the ugly
step-sister, not the driving force.

 Linus


Re: [GIT] Networking

2018-05-11 Thread Linus Torvalds
On Fri, May 11, 2018 at 5:10 PM David Miller  wrote:

> I guess this is my reward for trying to break the monotony of
> pull requests :-)

I actually went back and checked a few older pull requests to see if this
had been going on for a while and I just hadn't noticed.

It just took me by surprise :^p

   Linus


Re: [GIT] Networking

2018-05-11 Thread Linus Torvalds
David, is there something you want to tell us?

Drugs are bad, m'kay..

   Linus

On Fri, May 11, 2018 at 2:00 PM David Miller  wrote:

> "from Kevin Easton", "Thanks to Bhadram Varka", "courtesy of Gustavo A.
> R.  Silva", "To Eric Dumazet we are most grateful for this fix", "This
> fix from YU Bo, we do appreciate", "Once again we are blessed by the
> honorable Eric Dumazet with this fix", "This fix is bestowed upon us by
> Andrew Tomt", "another great gift from Eric Dumazet", "to Hangbin Liu we
> give thanks for this", "Paolo Abeni, he gave us this", "thank you Moshe
> Shemesh", "from our good brother David Howells", "Daniel Juergens,
> you're the best!", "Debabrata Benerjee saved us!", "The ship is now
> water tight, thanks to Andrey Ignatov", "from Colin Ian King, man we've
> got holes everywhere!", "Jiri Pirko what would we do without you!


Re: [llc_ui_release] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-04-28 Thread Linus Torvalds
On Sat, Apr 28, 2018 at 7:12 PM Fengguang Wu  wrote:

> FYI this happens in mainline kernel 4.17.0-rc2.
> It looks like a new regression.

> It occurs in 5 out of 5 boots.

> [main] 375 sockets created based on info from socket cachefile.
> [main] Generating file descriptors
> [main] Added 83 filenames from /dev
> udevd[507]: failed to execute '/sbin/modprobe' '/sbin/modprobe -bv
platform:regulatory': No such file or directory
> [  372.057947] caif:caif_disconnect_client(): nothing to disconnect
> [  372.082415] BUG: unable to handle kernel NULL pointer dereference at
0004

I think this is fixed by commit 3a04ce7130a7 ("llc: fix NULL pointer deref
for SOCK_ZAPPED")

Liunus


Re: Boot failures with net-next after rebase to v4.17.0-rc1

2018-04-24 Thread Linus Torvalds
On Tue, Apr 24, 2018 at 12:54 PM, Jesper Dangaard Brouer
 wrote:
> Hi all,
>
> I'm experiencing boot failures with net-next git-tree after it got
> rebased/merged with Linus'es tree at v4.17.0-rc1.

I suspect it's the global bit stuff that came in very late in the
merge window, and had been developed and tested for a while before,
but showed some problems under some configs.

The fix is currently in the x86/pti tree in -tip, see:

   x86/pti: Fix boot problems from Global-bit setting

and I expect it will percolate upstream soon.

In the meantime, it would be good to verify that merging that x86/pti
branch fixes it for you?

There is another candidate for boot problems - do you happen to have
CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled? That can under certain
circumstances get a percpu setup page fault because memory hadn't been
initialized sufficiently.

The fix there is to move the mm_init() call one step earlier in
init_main(): start_kernel() (to before trap_init()).

And if it's neither of the above, I think you'll need to help bisect it.

   Linus


Re: [Patch net] net_sched: fix a missing idr_remove() in u32_delete_key()

2018-04-07 Thread Linus Torvalds
On Fri, Apr 6, 2018 at 5:19 PM, Cong Wang  wrote:
> When we delete a u32 key via u32_delete_key(), we forget to
> call idr_remove() to remove its handle from IDR.
>
> Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles")
> Reported-by: Marcin Kabiesz 

Marcin sent me a tested-by too, so this seems all good.

Thanks,

 Linus


Fwd: Problem with the kernel 4.15 - cutting the band (tc)

2018-04-06 Thread Linus Torvalds
Forwarding a report about what looks like a regression between 4.14 and 4.15.

New ENOSPC issue? I don't even knew where to start guessing where to look.

Help me, Davem-Wan Kenobi, you are my only hope.

(But adding netdev just in case somebody else goes "That's obviously Xyz")

  Linus

-- Forwarded message --
From: Marcin Kabiesz 
Date: Thu, Apr 5, 2018 at 10:38 AM
Subject: Problem with the kernel 4.15 - cutting the band (tc)


Hello,
I have a problem with bandwidth cutting on kernel 4.15. On the version
up to 4.15, i.e. 4.14, this problem does not occur.

uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6
command to reproduce:

tc qdisc add dev ifb0 root handle 1: htb r2q 2
tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
10gbit quantum 16000
tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32

This ok, no error/warnings and dmesg log.

uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or
4.15.14 this same effect)
command to reproduce:

tc qdisc add dev ifb0 root handle 1: htb r2q 2
tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
10gbit quantum 16000
tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
RTNETLINK answers: No space left on device
We have an error talking to the kernel

This not ok, on error/warnings and no dmesg log.

Best Regards
Please forgive my English
Marcin Kabiesz


Re: [GIT PULL] remove in-kernel calls to syscalls

2018-04-02 Thread Linus Torvalds
On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowski
 wrote:
>
> This patchset removes all in-kernel calls to syscall functions in the
> kernel with the exception of arch/.

Ok, this finished off my arch updates for today, I'll probably move on
to driver pulls tomorrow.

Anyway, it's in my tree, will push out once my test build finishes.

Linus


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
 wrote:
>
> This is why, I want (with your agreement) to define clearly and once
> and for all, that the Linux semantics of writel are that it is ordered
> with previous writes to coherent memory (*)

Honestly, I think those are the sane semantics. In fact, make it
"ordered with previous writes" full stop, since it's not only ordered
wrt previous writes to memory, but also previous writel's.

> Also, can I assume the above ordering with writel() equally applies to
> readl() or not ?
>
> IE:
> dma_buf->foo = 1;
> readl(STUPID_DEVICE_DMA_KICK_ON_READ);

If that KICK_ON_READ is UC, then that's definitely the case. And
honestly, status registers like that really should always be UC.

But if somebody sets the area WC (which is crazy), then I think it
might be at least debatable. x86 semantics does allow reads to be done
before previous writes (or, put another way, writes to be buffered -
the buffers are ordered so writes don't get re-ordered, but reads can
happen during the buffering).

But UC accesses are always  done entirely ordered, and honestly, any
status register that starts a DMA would not make sense any other way.

Of course, you'd have to be pretty odd to want to start a DMA with a
read anyway - partly exactly because it's bad for performance since
reads will be synchronous and not buffered like a write).

   Linus


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya  wrote:
>
> Basically changing it to
>
> dma_buffer->foo = 1;/* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
> mmiowb()

Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?

The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

   Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
 wrote:
>
> The discussion at hand is about
>
> dma_buffer->foo = 1;/* WB */
> writel(KICK, DMA_KICK_REGISTER);/* UC */

Yes. That certainly is ordered on x86. In fact, afaik it's ordered
even if that writel() might be of type WC, because that only delays
writes, it doesn't move them earlier.

Whether people *do* that or not, I don't know. But I wouldn't be
surprised if they do.

So if it's a DMA buffer, it's "cached". And even cached accesses are
ordered wrt MMIO.

Basically, to get unordered writes on x86, you need to either use
explicitly nontemporal stores, or have a writecombining region with
back-to-back writes that actually combine.

And nobody really does that nontemporal store thing any more because
the hardware that cared pretty much doesn't exist any more. It was too
much pain. People use DMA and maybe an UC store for starting the DMA
(or possibly a WC buffer that gets multiple  stores in ascending order
as a stream of commands).

Things like UC will force everything to be entirely ordered, but even
without UC, loads won't pass loads, and stores won't pass stores.

> Now it appears that this wasn't fully understood back then, and some
> people are now saying that x86 might not even provide that semantic
> always.

Oh, the above UC case is absoutely guaranteed.

And I think even if it's WC, the write to kick off the DMA is ordered
wrt the cached write.

On x86, I think you need barriers only if you do things like

 - do two non-temporal stores and require them to be ordered: put a
sfence or mfence in between them.

 - do two WC stores, and make sure they do not combine: put a sfence
or mfence between them.

 - do a store, and a subsequent from a different address, and neither
of them is UC: put a mfence between them. But note that this is
literally just "load after store". A "store after load" doesn't need
one.

I think that's pretty much it.

For example, the "lfence" instruction is almost entirely pointless on
x86 - it was designed back in the time when people *thought* they
might re-order loads. But loads don't get re-ordered. At least Intel
seems to document that only non-temporal *stores* can get re-ordered
wrt each other.

End result: lfence is a historical oddity that can now be used to
guarantee that a previous  load has finished, and that in turn meant
that it is  now used in the Spectre mitigations. But it basically has
no real memory ordering meaning since nothing passes an earlier load
anyway, it's more of a pipeline thing.

But in the end, one question is just "how much do drivers actually
_rely_ on the x86 strong ordering?"

We so support "smp_wmb()" even though x86 has strong enough ordering
that just a barrier suffices. Somebody might just say "screw the x86
memory ordering, we're relaxed, and we'll fix up the drivers we care
about".

The only issue really is that 99.9% of all testing gets done on x86
unless you look at specific SoC drivers.

On ARM, for example, there is likely little reason to care about x86
memory ordering, because there is almost zero driver overlap between
x86 and ARM.

*Historically*, the reason for following the x86 IO ordering was
simply that a lot of architectures used the drivers that were
developed on x86. The alpha and powerpc workstations were *designed*
with the x86 IO bus (PCI, then PCIe) and to work with the devices that
came with it.

ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
test coverage" is simply not an issue.

How much of an issue is it for Power? Maybe you decide it's not a big deal.

Then all the above is almost irrelevant.

   Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
 wrote:
>
> Well, we need to clarify that once and for all, because as I wrote
> earlier, it was decreed by Linus more than a decade ago that writel
> would be fully ordered by itself vs. previous memory stores (at least
> on UC memory).

Yes.

So "writel()" needs to be ordered with respect to other writel() uses
on the same thread. Anything else *will* break drivers. Obviously, the
drivers may then do magic to say "do write combining etc", but that
magic will be architecture-specific.

The other issue is that "writel()" needs to be ordered wrt other CPU's
doing "writel()" if those writel's are in a spinlocked region.

So it's not  that "writel()" needs to be ordered wrt the spinlock
itself, but you *do* need to honor ordering if you have something like
this:

   spin_lock();
   writel(a);
   writel(b);
   spin_unlock();

and if two CPU's run the above code "at the same time", then the
*ONLY* acceptable sequence is abab.

You cannot, and must not, ever see "aabb" at the device, for example,
because of how the writel would basically leak out of the spinlock.

That sounds "obvious", but dammit, a lot of architectures got that
wrong, afaik.

Linus


Re: [PATCH v2 bpf-next 5/8] bpf: introduce BPF_RAW_TRACEPOINT

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 6:43 PM, Alexei Starovoitov  wrote:
>
> it's not the limit I'm hitting, but self referential issue.
> Exactly half gets expanded.
> I don't think there is an easy workaround other
> than duplicating the whole chain of REPEAT macro twice
> with slightly different name.

Take a look at the __MAP() macro in include/linux/syscalls.h.

It basically takes a "transformation" as its argument, and does it 
times, where 'n' is the first argument (but could be self-counting).

Maybe it will give you some ideas.

... and maybe it will just drive you mad and make you gouge out your
eyes with a spoon. Don't blame the messenger.

 Linus


Re: [PATCH v3 bpf-next 01/10] treewide: remove struct-pass-by-value from tracepoints arguments

2018-03-22 Thread Linus Torvalds
On Thu, Mar 22, 2018 at 1:48 PM, Steven Rostedt  wrote:
>
> OK, but instead of changing it to pass by reference, why not just pass
> the value. That is:
>
>  static void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
>  {
> -   trace_xen_mmu_set_pte_atomic(ptep, pte);
> +   trace_xen_mmu_set_pte_atomic(ptep, native_pte_val(pte));
> set_64bit((u64 *)ptep, native_pte_val(pte));
>  }

That looks simple and clean, and makes sense since the function itself
then uses that value anyway.

Certainly simpler than my monster define.

   Linus


Re: [PATCH v3 bpf-next 01/10] treewide: remove struct-pass-by-value from tracepoints arguments

2018-03-22 Thread Linus Torvalds
On Thu, Mar 22, 2018 at 12:31 PM, Alexei Starovoitov  wrote:
>
> yeah. C doesn't allow casting of 'struct s { u64 var };' into u64
> without massive hacks and aliasing warnings by compiler.

Without massive hacks? Yeah, no. But without warnings? You can do it.

  #define UINTTYPE(size) \
__typeof__(__builtin_choose_expr(size==1, (u8)1, \
__builtin_choose_expr(size==2, (u16)2, \
__builtin_choose_expr(size==4, (u32)3, \
__builtin_choose_expr(size==8, (u64)4, \
(void)5)


  #define CAST_TO_U64(x) ({ \
typeof(x) __src = (x); \
UINTTYPE(sizeof(x)) __dst; \
memcpy(&__dst, &__src, sizeof(__dst)); \
(u64)__dst; })

Yeah, I'm not proud of the above, but gcc actually seems to do the
right thing for it. Doing

struct d {
unsigned char a,b;
};

it generates

movzwl %di, %eax

for the CAST_TO_U64() (the above *looks* like it only casts to 32-bit,
but it is actually a full cast to 64 bits because movzwl will also
clear the top bits of the register).

No warnings.

But is ot "massively hacky"? You be the judge. Probably.

Going the other way is trivial, you just use that same UINTTYPE()
again and just the memcpy in reverse.

NOTE! The above obviously only works for things that are actually
proper nice easy sizes.

If you want to encode a 6-byte thing in a u64, the sanest thing is
likely to use a union, and just accept crap in some bytes of the u64
(and just expect it to be undone by the reversal operation). Honestly,
I'd suggest against it, because of just horrible code generation
and/or data leakage

Linus


Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-22 Thread Linus Torvalds
On Thu, Mar 22, 2018 at 3:48 AM, David Laight <david.lai...@aculab.com> wrote:
> From: Linus Torvalds
>>
>> Note that we definitely have seen hardware that *depends* on the
>> regular memcpy_fromio()" not doing big reads. I don't know how
>> hardware people screw it up, but it's clearly possible.

[ That should have been "big writes" ]

> I wonder if that hardware works with the current kernel on recent cpus?
> I bet it doesn't like the byte accesses that get generated either.

The one case we knew about we just fixed to use the special "16-bit
word at a time" scr_memcpyw().

If I recall correctly, it was some "enterprise graphics console".

If it's something I've found over the years, is that "enterprise"
hardware is absolutely the dregs. It's the low-volume stuff that
almost nobody uses and where the customer is used to the whole notion
of boutique hardware, so they're used to having to do special things
for special hardware.

And I very much use the word "special" in the "my mom told me I was
special" sense. It's not the *good* kind of "special". It's the "short
bus" kind of special.

> For x86 being able to request a copy done as 'rep movsx' (for each x)
> would be useful.

The portability issues are horrendous. Particularly if the memory area
(source or dest) might be unaligned.

The rule of thumb should be: don't use PIO, and if you *do* use PIO,
don't be picky about what you get.

And most *definitely* don't complain about performance to software
people. Blame the hardware people. Get a competent piece of hardware,
or a competent hardware designer.

Let's face it, PIO is stupid. Use it for command and status words. Not
for data transfer.

 Linus


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread Linus Torvalds
On Thu, Mar 22, 2018 at 5:48 AM, David Laight  wrote:
>
> So if we needed to do PIO reads using the AVX2 (or better AVX-512)
> registers would make a significant difference.
> Fortunately we can 'dma' most of the data we need to transfer.

I think this is the really fundamental issue.

A device that expects PIO to do some kind of high-performance
transaction is a *broken* device.

It really is that simple. We don't bend over for misdesigned hardware
crap unless it is really common.

> I've traced writes before, they are a lot faster and are limited
> by things in the fpga fabric (they appear back to back).

The write combine buffer really should be much more effective than any
AVX or similar can ever be.

   Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-22 Thread Linus Torvalds
On Thu, Mar 22, 2018 at 8:01 AM, Kees Cook  wrote:
>
> Seems like it doesn't like void * arguments:

Yeah, that was discussed separately, I just didn't realize we had any
such users.

As David said, just adding a (long) cast to it should be fine, ie

  #define __is_constant(a) \
 (sizeof(int) == sizeof(*(1 ? ((void*)((long)(a) * 0l)) : (int*)1)))

and is probably a good idea even outside of pointers (because "long
long" constants could cause warnings too due to the cast to (void *)).

Linus


Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-21 Thread Linus Torvalds
On Tue, Mar 20, 2018 at 7:42 AM, Alexander Duyck
 wrote:
>
> Instead of framing this as an enhanced version of the read/write ops
> why not look at replacing or extending something like the
> memcpy_fromio or memcpy_toio operations?

Yes, doing something like "memcpy_fromio_avx()" is much more
palatable, in that it works like the crypto functions do - if you do
big chunks, the "kernel_fpu_begin/end()" isn't nearly the issue it can
be otherwise.

Note that we definitely have seen hardware that *depends* on the
regular memcpy_fromio()" not doing big reads. I don't know how
hardware people screw it up, but it's clearly possible.

So it really needs to be an explicitly named function that basically a
driver can use to say "my hardware really likes big aligned accesses"
and explicitly ask for some AVX version if possible.

Linus


Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time

2018-03-21 Thread Linus Torvalds
On Wed, Mar 21, 2018 at 11:54 AM, Alexei Starovoitov  wrote:
>
> add fancy macro to compute number of arguments passed into tracepoint
> at compile time and store it as part of 'struct tracepoint'.

We should probably do this __COUNT() thing in some generic header, we
just talked last week about another use case entirely.

And wouldn't it be nice to just have some generic infrastructure like this:

/*
 * This counts to ten.
 *
 * Any more than that, and we'd need to take off our shoes
 */
#define __GET_COUNT(_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_n,...) _n
#define __COUNT(...) \
__GET_COUNT(__VA_ARGS__,10,9,8,7,6,5,4,3,2,1,0)
#define COUNT(...) __COUNT(dummy,##__VA_ARGS__)

#define __CONCAT(a,b) a##b
#define __CONCATENATE(a,b) __CONCAT(a,b)

and then you can do things like:

#define fn(...) __CONCATENATE(fn,COUNT(__VA_ARGS__))(__VA_ARGS__)

which turns "fn(x,y,z..)" into "fn(x,y,z)".

That can be useful for things like "max(a,b,c,d)" expanding to
"max4()", and then you can just have the trivial

  #define max3(a,b,c) max2(a,max2(b.c))

etc (with proper parentheses, of course).

And I'd rather not have that function name concatenation be part of
the counting logic, because we actually may have different ways of
counting, so the concatenation is separate.

In particular, the other situation this came up for, the counting was
in arguments _pairs_, so you'd use a "COUNT_PAIR()" instead of
"COUNT()".

NOTE NOTE NOTE! The above was slightly tested and then cut-and-pasted.
I might have screwed up at any point. Think of it as pseudo-code.

 Linus


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-21 Thread Linus Torvalds
On Wed, Mar 21, 2018 at 12:46 AM, Ingo Molnar  wrote:
>
> So I added a bit of instrumentation and the current state of things is that on
> 64-bit x86 every single task has an initialized FPU, every task has the exact
> same, fully filled in xfeatures (XINUSE) value:

Bah. Your CPU is apparently some old crud that barely has AVX. Of
course all those features are enabled.

> Note that this is with an AVX (128-bit) supporting CPU:

That's weak even by modern standards.

I have MPX bounds and MPX CSR on my old Skylake.

And the real worry is things like AVX-512 etc, which is exactly when
things like "save and restore one ymm register" will quite likely
clear the upper bits of the zmm register.

And yes, we can have some statically patched code that takes that into
account, and saves the whole zmm register when AVX512 is on, but the
whole *point* of the dynamic XSAVES thing is actually that Intel wants
to be able enable new user-space features without having to wait for
OS support. Literally. That's why and how it was designed.

And saving a couple of zmm registers is actually pretty hard. They're
big. Do you want to allocate 128 bytes of stack space, preferably
64-byte aligned, for a save area? No. So now it needs to be some kind
of per-thread (or maybe per-CPU, if we're willing to continue to not
preempt) special save area too.

And even then, it doesn't solve the real worry of "maybe there will be
odd interactions with future extensions that we don't even know of".

All this to do a 32-byte PIO access, with absolutely zero data right
now on what the win is?

Yes, yes, I can find an Intel white-paper that talks about setting WC
and then using xmm and ymm instructions to write a single 64-byte
burst over PCIe, and I assume that is where the code and idea came
from. But I don't actually see any reason why a burst of 8 regular
quad-word bytes wouldn't cause a 64-byte burst write too.

So right now this is for _one_ odd rdma controller, with absolutely
_zero_ performance numbers, and a very high likelihood that it does
not matter in the least.

And if there are some atomicity concerns ("we need to guarantee a
single atomic access for race conditions with the hardware"), they are
probably bogus and misplaced anyway, since

 (a) we can't guarantee that AVX2 exists in the first place

 (b) we can't guarantee that %ymm register write will show up on any
bus as a single write transaction anyway

So as far as I can tell, there are basically *zero* upsides, and a lot
of potential downsides.

Linus


Re: [PATCH] kbuild: disable clang's default use of -fmerge-all-constants

2018-03-20 Thread Linus Torvalds
On Tue, Mar 20, 2018 at 5:18 PM, Daniel Borkmann  wrote:
> Prasad reported that he has seen crashes in BPF subsystem with netd
> on Android with arm64 in the form of (note, the taint is unrelated):

Ack. This looks good to me. And thanks for noticing the behavior wrt
the correct gcc merging.

>  [ Hi Linus, feel free to take this fix directly if you want.
>Alternatively, we could route it via bpf tree. Thanks a
>lot for your feedback! ]

So since it's your patch and the only known issue comes from the bpf
side, I think it should just go through the bpf tree, and I expect it
to get to me through all the usual channels.

Thanks,

Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-20 Thread Linus Torvalds
On Tue, Mar 20, 2018 at 4:23 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Hmm. So thanks to the diseased mind of Martin Uecker, there's a better
> test for "__is_constant()":
>
>   /* Glory to Martin Uecker <martin.uec...@med.uni-goettingen.de> */
>   #define __is_constant(a) \
> (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))
>
> that is actually *specified* by the C standard to work, and doesn't
> even depend on any gcc extensions.

Well, it does depend on 'sizeof(*(void *)X)' being 1 and the compiler
not complaining about it, and that sizeof(int) is not 1.

But since we depend on those things in the kernel anyway, that's fine.

Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-20 Thread Linus Torvalds
On Sat, Mar 17, 2018 at 1:07 PM, Kees Cook  wrote:
>
> No luck! :( gcc 4.4 refuses to play along. And, hilariously, not only
> does it not change the complaint about __builtin_choose_expr(), it
> also thinks that's a VLA now.

Hmm. So thanks to the diseased mind of Martin Uecker, there's a better
test for "__is_constant()":

  /* Glory to Martin Uecker  */
  #define __is_constant(a) \
(sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))

that is actually *specified* by the C standard to work, and doesn't
even depend on any gcc extensions.

The reason is some really subtle pointer conversion rules, where the
type of the ternary operator will depend on whether one of the
pointers is NULL or not.

And the definition of NULL, in turn, very much depends on "integer
constant expression that has the value 0".

Are you willing to do one final try on a generic min/max? Same as my
last patch, but using the above __is_constant() test instead of
__builtin_constant_p?

   Linus


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Linus Torvalds
On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar  wrote:
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use ymm0 and ymm1
> ...
>
> vmovdqa 80(%rsp), %ymm1
> vmovdqa 40(%rsp), %ymm0
>
> ... without using the heavy XSAVE/XRSTOR instructions.

No. The above is buggy. It may *work*, but it won't work in the long run.

Pretty much every single vector extension has traditionally meant that
touching "old" registers breaks any new register use. Even if you save
the registers carefully like in your example code, it will do magic
and random things to the "future extended" version.

So I absolutely *refuse* to have anything to do with the vector unit.
You can only touch it in the kernel if you own it entirely (ie that
"kernel_fpu_begin()/_end()" thing). Anything else is absolutely
guaranteed to cause problems down the line.

And even if you ignore that "maintenance problems down the line" issue
("we can fix them when they happen") I don't want to see games like
this, because I'm pretty sure it breaks the optimized xsave by tagging
the state as being dirty.

So no. Don't use vector stuff in the kernel. It's not worth the pain.

The *only* valid use is pretty much crypto, and even there it has had
issues. Benchmarks use big arrays and/or dense working sets etc to
"prove" how good the vector version is, and then you end up in
situations where it's used once per fairly small packet for an
interrupt, and it's actually much worse than doing it by hand.

   Linus


Re: [PATCH bpf] bpf: fix crash due to inode i_op mismatch with clang/llvm

2018-03-19 Thread Linus Torvalds
On Mon, Mar 19, 2018 at 6:50 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Add it to everything. If it's an invalid optimization, it shouldn't be on.

IOW, why isn't this just something like

  diff --git a/Makefile b/Makefile
  index d65e2e229017..01abedc2e79f 100644
  --- a/Makefile
  +++ b/Makefile
  @@ -826,6 +826,9 @@ KBUILD_CFLAGS += $(call cc-disable-warning, pointer-sign)
   # disable invalid "can't wrap" optimizations for signed / pointers
   KBUILD_CFLAGS+= $(call cc-option,-fno-strict-overflow)

  +# disable invalid optimization on clang
  +KBUILD_CFLAGS   += $(call cc-option,-fno-merge-all-constants)
  +
   # Make sure -fstack-check isn't enabled (like gentoo apparently did)
   KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)

(whitespace-damaged, but you get the gist of it).

We disable some optimizations that are technically _valid_, because
they are too dangerous and a bad idea.

Disabling an optimization that isn't valid EVEN IN THEORY is an
absolute no-brainer, particularly if it has already shown itself to
cause problems.

We have other situations where we generate multiple static structures
and expect them to be unique. I'm not sure any of them would trigger
the clang rules, but the clang rules are obviously complete garbage
anyway, so who knows?

That optimization seems to teuly be pure and utter garbage. Clang can
even *see* the address comparison happening in that file.

Some clang person needs to be publicly shamed for enabling this kind
of garbage by default, particularly since they apparently _knew_ it
was invalid.

  Linus


Re: [PATCH bpf] bpf: fix crash due to inode i_op mismatch with clang/llvm

2018-03-19 Thread Linus Torvalds
On Mon, Mar 19, 2018 at 6:17 PM, Daniel Borkmann  wrote:
>
> Reason for this miscompilation is that clang has the more aggressive
> -fmerge-all-constants enabled by default. In fact, clang source code
> has an 'insightful' comment about it in its source code under
> lib/AST/ExprConstant.cpp:
>
>   // Pointers with different bases cannot represent the same object.
>   // (Note that clang defaults to -fmerge-all-constants, which can
>   // lead to inconsistent results for comparisons involving the address
>   // of a constant; this generally doesn't matter in practice.)
>
> gcc on the other hand does not enable -fmerge-all-constants by default
> and *explicitly* states in it's option description that using this
> flag results in non-conforming behavior, quote from man gcc:
>
>   Languages like C or C++ require each variable, including multiple
>   instances of the same variable in recursive calls, to have distinct
>   locations, so using this option results in non-conforming behavior.
>
> Given there are users with clang/LLVM out there today that triggered
> this, fix this mess by explicitly adding -fno-merge-all-constants to
> inode.o as CFLAGS via Kbuild system.

Oh, please do *NOT* add it to just that one file.

Add it to everything. If it's an invalid optimization, it shouldn't be on.

And even if it happens to trigger in only that one file, then
disabling it globally is just the safe thing to do.

What is the code generation difference if you just enable it globally?
I would certainly _hope_ that it's not noticeable, but if it's
noticeable that would certainly imply that it's very dangerous
somewhere else too!

Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-19 Thread Linus Torvalds
On Mon, Mar 19, 2018 at 2:43 AM, David Laight  wrote:
>
> Is it necessary to have the full checks for old versions of gcc?
>
> Even -Wvla could be predicated on very recent gcc - since we aren't
> worried about whether gcc decides to generate a vla, but whether
> the source requests one.

You are correct. We could just ignore the issue with old gcc versions,
and disable -Wvla rather than worry about it.

  Linus


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Linus Torvalds
On Mon, Mar 19, 2018 at 8:53 AM, David Laight  wrote:
>
> The x87 and SSE registers can't be changed - they can contain callee-saved
> registers.
> But (IIRC) the AVX and AVX2 registers are all caller-saved.

No.

The kernel entry is not the usual function call.

On kernel entry, *all* registers are callee-saved.

Of course, some may be return values, and I have a slight hope that I
can trash %eflags. But basically, a system call is simply not a
function call. We have different calling conventions on the argument
side too. In fact, the arguments are in different registers depending
on just *which* system call you take.

  Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-18 Thread Linus Torvalds
On Sun, Mar 18, 2018 at 3:59 PM, Rasmus Villemoes
 wrote:
>
> OK, I missed where this was made about side effects of x and y

We never made it explicit, since all we really cared about in the end
is the constantness.

But yes:

> but I suppose the idea was to use
>
>   no_side_effects(x) && no_side_effects(y) ? trivial_max(x, y) :
> old_max(x, y)

Exactly. Except with __builtin_choose_expr(), because we need the end
result to be seen as a integer constant expression, so that we can
then use it as an array size. So it needs that early parse-time
resolution.

>> We only really use __builtin_constant_p() as a (bad) approximation of
>> that in this case, since it's the best we can do.
>
> I don't think you should parenthesize bad, rather capitalize it. ({x++;
> 0;}) is constant in the eyes of __builtin_constant_p, but not
> side-effect free.

Hmm. Yeah, but probably don't much care for the kernel.

For example, we do things like this:

   #define __swab64(x) \
(__builtin_constant_p((__u64)(x)) ? \
___constant_swab64(x) : \
__fswab64(x))

where that "___constant_swab64()" very much uses the same argument
over and over.

And we do that for related reasons - we really want to do the constant
folding at build time for some cases, and this was the only sane way
to do it. Eg code like

return (addr & htonl(0xff00)) == htonl(0x7f00);

wants to do the right thing, and long ago gcc didn't have builtins for
byte swapping, so we had to just do nasty horribly macros that DTRT
for constants.

But since the kernel is standalone, we don't need to really worry
about the *generic* case, we just need to worry about our own macros,
and if somebody does that example you show I guess we'll just have to
shun them ;)

Of course, our own macros are often macros from hell, exactly because
they often contain a lot of type-checking and/or type-(size)-based
polymorphism. But then we actually *want* gcc to simplify things, and
avoid side effects, just have potentially very complex expressions.

But we basically never have that kind of intentionally evil macros, so
we are willing to live with a bad substitute.

But yes, it would be better to have some more control over things, and
actually have a way to say "if X is a constant integer expression, do
transformation Y, otherwise call function y()".

Actually sparse started out with the idea in the background that it
could become not just a checker, but a "transformation tool". Partly
because of long gone historical issues (ie gcc people used to be very
anti-plugin due to licensing issues).

Of course, a more integrated and powerful preprocessor language is
almost always what we *really* wanted, but traditionally "powerful
preprocessor" has always meant "completely incomprehensible and badly
integrated preprocessor".

"cpp" may be a horrid language, but it's there and it's fast (when
integrated with the front-end, like everybody does now)

But sadly, cpp is really bad for the above kind of "if argument is
constant" kind of tricks. I suspect we'd use it a lot otherwise.

>> So the real use would be to say "can we use the simple direct macro
>> that just happens to also fold into a nice integer constant
>> expression" for __builtin_choose_expr().
>>
>> I tried to find something like that, but it really doesn't exist, even
>> though I would actually have expected it to be a somewhat common
>> concern for macro writers: write a macro that works in any arbitrary
>> environment.
>
> Yeah, I think the closest is a five year old suggestion
> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57612) to add a
> __builtin_assert_no_side_effects, that could be used in macros that for
> some reason cannot be implemented without evaluating some argument
> multiple times. It would be more useful to have the predicate version,
> which one could always turn into a build bug version. But we have
> neither, unfortunately.

Yeah, and since we're in the situation that *new* gcc versions work
for us anyway, and we only have issues with older gcc's (that sadly
people still use), even if there was a new cool feature we couldn't
use it.

Oh well. Thanks for the background.

Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-18 Thread Linus Torvalds
On Sun, Mar 18, 2018 at 2:13 PM, Rasmus Villemoes
<li...@rasmusvillemoes.dk> wrote:
> On 2018-03-17 19:52, Linus Torvalds wrote:
>>
>> Ok, so it really looks like that same "__builtin_constant_p() doesn't
>> return a constant".
>>
>> Which is really odd, but there you have it.
>
> Not really. We do rely on builtin_constant_p not being folded too
> quickly to a 0/1 answer, so that gcc still generates good code even if
> the argument is only known to be constant at a late(r) optimization
> stage (through inlining and all).

Hmm. That makes sense. It just doesn't work for our case when we
really want to choose the expression based on side effects or not.

> So unlike types_compatible_p, which
> can obviously be answered early during parsing, builtin_constant_p is
> most of the time a yes/no/maybe/it's complicated thing.

The silly thing is, the thing we *really* want to know _is_ actually
easily accessible during the early parsing, exactly like
__builtin_compatible_p(): it's not really that we care about the
expressions being constant, as much as the "can this have side
effects" question.

We only really use __builtin_constant_p() as a (bad) approximation of
that in this case, since it's the best we can do.

So the real use would be to say "can we use the simple direct macro
that just happens to also fold into a nice integer constant
expression" for __builtin_choose_expr().

I tried to find something like that, but it really doesn't exist, even
though I would actually have expected it to be a somewhat common
concern for macro writers: write a macro that works in any arbitrary
environment.

I guess array sizes are really the only true limiting environment
(static initializers is another one).

How annoying. Odd that newer gcc's seem to do so much better (ie gcc-5
seems to be fine). So _something_ must have changed.

 Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-17 Thread Linus Torvalds
On Sat, Mar 17, 2018 at 12:27 AM, Kees Cook  wrote:
>
> Unfortunately my 4.4 test fails quickly:
>
> ./include/linux/jiffies.h: In function ‘jiffies_delta_to_clock_t’:
> ./include/linux/jiffies.h:444: error: first argument to
> ‘__builtin_choose_expr’ not a constant

Ok, so it really looks like that same "__builtin_constant_p() doesn't
return a constant".

Which is really odd, but there you have it.

I wonder if you can use that "sizeof()" to force evaluation of it,
because sizeof() really does end up being magical when it comes to
"integer constant expression".

So instead of this:

   #define __no_side_effects(a,b) \
  (__builtin_constant_p(a)&&__builtin_constant_p(b))

that just assumes that __builtin_constant_p() itself always counts as
a constant expression, what happens if you do

  #define __is_constant(a) \
(sizeof(char[__builtin_constant_p(a)]))

  #define __no_side_effects(a,b) \
(__is_constant(a) && __is_constant(b))

I realize that the above looks completely insane: the whole point is
to *not* have VLA's, and we know that __builtin_constant_p() isn't
always evaliated as a constant.

But hear me out: if the issue is that there's some evaluation ordering
between the two builtins, and the problem is that the
__builtin_choose_expr() part of the expression is expanded *before*
the __builtin_constant_p() has been expanded, then just hiding it
inside that bat-shit-crazy sizeof() will force that to be evaluated
first (because a sizeof() is defined to be a integer constant
expression.

So the above is completely insane, bit there is actually a chance that
using that completely crazy "x -> sizeof(char[x])" conversion actually
helps, because it really does have a (very odd) evaluation-time
change.  sizeof() has to be evaluated as part of the constant
expression evaluation, in ways that "__builtin_constant_p()" isn't
specified to be done.

But it is also definitely me grasping at straws. If that doesn't work
for 4.4, there's nothing else I can possibly see.

Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 10:44 AM, David Laight  wrote:
>
> I looked at the generated code for one of the constant sized VLA that
> the compiler barfed at.
> It seemed to subtract constants from %sp separately for the VLA.
> So it looks like the compiler treats them as VLA even though it
> knows the size.
> That is probably missing optimisation.

Looking at the code is definitely an option.

In fact, instead of depending on -Wvla, we could just make 'objtool'
warn about real variable-sized stack frames.

That said, if that "sizeof()" trick of Al's actually works with older
gcc versions too (it *should*, but it's not like
__builtin_choose_expr() and __builtin_constant_p() have well-defined
rules in the standard), that may just be the solution.

And if gcc ends up generating bad code for those "constant sized vlas"
anyway, then -Wvla would effectively warn about that code generation
problem.

  Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 1:14 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> It does not work with gcc-4.1.x, but works with gcc-4.4.x.
>
> I can't seem to see the errors any way, I wonder if
> __builtin_choose_expr() simply didn't exist back then.

No, that goes further back.

It seems to be -Wvla itself that doesn't exist in 4.1, so the test
build failed simply because I used that flag ;)

   Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 1:12 PM, Al Viro  wrote:
>
> That's C99, straight from N1256.pdf (C99-TC3)...

I checked C90, since the error is

   ISO C90 forbids variable length array

and I didn't see anything there.

Admittedly I only found a draft copy.

Linus


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 1:03 PM, Miguel Ojeda
 wrote:
>>
>> Kees - is there some online "gcc-4.4 checker" somewhere? This does
>> seem to work with my gcc. I actually tested some of those files you
>> pointed at now.
>
> I use this one:
>
>   https://godbolt.org/

Well, my *test* code works on that one and -Wvla -Werror.

It does not work with gcc-4.1.x, but works with gcc-4.4.x.

I can't seem to see the errors any way, I wonder if
__builtin_choose_expr() simply didn't exist back then.

Odd that you can't view warnings/errors with it.

But it's possible that it fails on more complex stuff in the kernel.

I've done a "allmodconfig" build with that patch, and the only issue
it found was that (real) type issue in tpm_tis_core.h.

Linus


Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2 == netdev)

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 11:30 AM, David Miller  wrote:
>
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme.  For
> example, passing values in registers instead of on the stack.

Actually, it's almost exactly the reverse.

On x86-64, we'd like to just pass the 'struct pt_regs *' pointer, and
have the sys_xyz() function itself just pick out the arguments it
needs from there.

That has a few reasons for it:

 - we can clear all registers at system call entry, which helps defeat
some of the "pass seldom used register with user-controlled value that
survives deep into the callchain" things that people used to leak
information

 - we can streamline the low-level system call code, which needs to
pass around 'struct pt_regs *' anyway, and the system call only picks
up the values it actually needs

 - it's really quite easy(*) to just make the SYSCALL_DEFINEx() macros
just do it all with a wrapper inline function

but it fundamentally means that you *cannot* call 'sys_xyz()' from
within the kernel, unless you then do it with something crazy like

struct pt_regs myregs;
... fill in the right registers for this architecture _if_ this
architecture uses ptregs ..
sys_xyz();

which I somehow really doubt you want to do in the networking code.

Now, I did do one version that just created two entrypoints for every
single system call - the "kernel version" and the "real" system call
version. That sucks, because you have two choices:

 - either pointlessly generate extra code for the 200+ system calls
that are *not* used by the kernel

 - or let gcc just merge the two, and make code generation suck where
the real system call just loads the registers and jumps to the common
code.

That second option really does suck, because if you let the compiler
just generate the _single_ system call, it will do the "load actual
value from ptregs" much more nicely, and only when it needs it, and
schedules it all into the system call code.

So just making the rule be: "you mustn't call the SYSCALL_DEFINEx()
functions from anything but the system call code" really makes
everything better.

Then you only need to fix up the *handful* of so system calls that
actually have in-kernel callers.

Many of them end up being things that could be improved on further
anyway (ie there's discussion about further cleanup and trying to
avoid using "set_fs()" for arguments etc, because there already exists
helper functions that take the kernel-space versions, and the
sys_xyz() version is actually just going through stupid extra work for
a kernel user).

Linus

(*) The "really quite easy" is only true on 64-bit architectures.
32-bit architectures have issues with packing 64-bit values into two
registers, so using macro expansion with just the number of arguments
doesn't work.


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 10:55 AM, Al Viro  wrote:
>
> That's not them, that's C standard regarding ICE.

Yes. The C standard talks about "integer constant expression". I know.
It's come up in this very thread before.

The C standard at no point talks about - or forbids - "variable length
arrays". That never comes up in the whole standard, I checked.

So we are right now hindered by a _syntactic_ check, without any way
to have a _semantic_ check.

That's my problem. The warnings are misleading and imply semantics.

And apparently there is no way to actually check semantics.

> 1,100 is *not* a constant expression as far as the standard is concerned,

I very much know.

But it sure isn't "variable" either as far as the standard is
concerned, because the standard doesn't even have that concept (it
uses "variable" for argument numbers and for variables).

So being pedantic doesn't really change anything.

> Would you argue that in
> void foo(char c)
> {
> int a[(c<<1) + 10 - c + 2 - c];

Yeah, I don't think that even counts as a constant value, even if it
can be optimized to one. I would not at all be unhppy to see
__builtin_constant_p() to return zero.

But that is very much different from the syntax issue.

So I would like to get some way to get both type-checking and constant
checking without the annoying syntax issue.

> expr, constant_expression is not a constant_expression.  And in
> this particular case the standard is not insane - the only reason
> for using that is typechecking and _that_ can be achieved without
> violating 6.6p6:
> sizeof(expr,0) * 0 + ICE
> *is* an integer constant expression, and it gives you exact same
> typechecking.  So if somebody wants to play odd games, they can
> do that just fine, without complicating the logics for compilers...

Now that actually looks like a good trick. Maybe we can use that
instead of the comma expression that causes problems.

And using sizeof() to make sure that __builtin_choose_expression()
really gets an integer constant expression and that there should be no
ambiguity looks good.

Hmm.

This works for me, and I'm being *very* careful (those casts to
pointer types are inside that sizeof, because it's not an integral
type, and non-integral casts are not valid in an ICE either) but
somebody needs to check gcc-4.4:

  #define __typecheck(a,b) \
(!!(sizeof((typeof(a)*)1==(typeof(b)*)1)))

  #define __no_side_effects(a,b) \
(__builtin_constant_p(a)&&__builtin_constant_p(b))

  #define __safe_cmp(a,b) \
(__typecheck(a,b) && __no_side_effects(a,b))

  #define __cmp(a,b,op) ((a)op(b)?(a):(b))

  #define __cmp_once(a,b,op) ({ \
typeof(a) __a = (a);\
typeof(b) __b = (b);\
__cmp(__a,__b,op); })

  #define __careful_cmp(a,b,op) \
__builtin_choose_expr(__safe_cmp(a,b), __cmp(a,b,op),
__cmp_once(a,b,op))

  #define min(a,b)  __careful_cmp(a,b,<)
  #define max(a,b)  __careful_cmp(a,b,>)
  #define min_t(t,a,b)  __careful_cmp((t)(a),(t)(b),<)
  #define max_t(t,a,b)  __careful_cmp((t)(a),(t)(b),>)

and yes, it does cause new warnings for that

comparison between ‘enum tis_defaults’ and ‘enum tpm2_const’

in drivers/char/tpm/tpm_tis_core.h due to

   #define TIS_TIMEOUT_A_MAX   max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)

but technically that warning is actually correct, I'm just confused
why gcc cares about the cast placement or something.

That warning is easy to fix by turning it into a "max_t(int, enum1,
enum2)' and that is technically the right thing to do, it's just not
warned about for some odd reason with the current code.

Kees - is there some online "gcc-4.4 checker" somewhere? This does
seem to work with my gcc. I actually tested some of those files you
pointed at now.

  Linus
 include/linux/kernel.h | 77 +-
 1 file changed, 20 insertions(+), 57 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..23c31bf1d7fb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -787,37 +787,29 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
  * strict type-checking.. See the
  * "unnecessary" pointer comparison.
  */
-#define __min(t1, t2, min1, min2, x, y) ({		\
-	t1 min1 = (x);	\
-	t2 min2 = (y);	\
-	(void) ( == );			\
-	min1 < min2 ? min1 : min2; })
+#define __typecheck(a,b) \
+	(!!(sizeof((typeof(a)*)1==(typeof(b)*)1)))
 
-/**
- * min - return minimum of two values of the same or compatible types
- * @x: first value
- * @y: second value
- */
-#define min(x, y)	\
-	__min(typeof(x), typeof(y),			\
-	  __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),	\
-	  x, y)
+#define __no_side_effects(a,b) \
+	(__builtin_constant_p(a)&&__builtin_constant_p(b))
 
-#define __max(t1, t2, max1, max2, x, y) ({		\
-	t1 max1 = (x);	\
-	t2 max2 = (y);	\
-	(void) ( == );			\
-	max1 

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds
On Fri, Mar 16, 2018 at 4:47 AM, Florian Weimer  wrote:
>
> If you want to catch stack frames which have unbounded size,
> -Werror=stack-usage=1000 or -Werror=vla-larger-than=1000 (with the constant
> adjusted as needed) might be the better approach.

No, we want to catch *variable* stack sizes.

Does "-Werror=vla-larger-than=0" perhaps work for that? No, because
the stupid compiler says that is "meaningless".

And no, using "-Werror=vla-larger-than=1" doesn't work either, because
the moronic compiler continues to think that "vla" is about the
_type_, not the code:

   t.c: In function ‘test’:
   t.c:6:6: error: argument to variable-length array is too large
[-Werror=vla-larger-than=]
 int array[(1,100)];

Gcc people are crazy.

Is there really no way to just say "shut up about the stupid _syntax_
issue that is entirely irrelevant, and give us the _code_ issue".

 Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 4:46 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> Well, the explicit typing allows that mixing, in that you can just
> have "const_max_t(5,sizeof(x))"

I obviously meant "const_max_t(size_t,5,sizeof(x))". Heh.

Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 4:41 PM, Kees Cook  wrote:
>
> I much prefer explicit typing, but both you and Rasmus mentioned
> wanting the int/sizeof_t mixing.

Well, the explicit typing allows that mixing, in that you can just
have "const_max_t(5,sizeof(x))"

So I'm ok with that.

What I'm *not* so much ok with is "const_max(5,sizeof(x))" erroring
out, or silently causing insane behavior due to hidden subtle type
casts..

Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 3:46 PM, Kees Cook  wrote:
>
> So, AIUI, I can either get strict type checking, in which case, this
> is rejected (which I assume there is still a desire to have):
>
> int foo[const_max(6, sizeof(whatever))];

Ehh, yes, that looks fairly sane, and erroring out would be annoying.

But maybe we should just make the type explicit, and make it "const_max_t()"?

I think all the existing users are of type "max_t()" anyway due to the
very same issue, no?

At least if there's an explicit type like 'size_t', then passing in
"-1" becoming a large unsigned integer is understandable and clear,
not just some odd silent behavior.

Put another way: I think it's unacceptable that

 const_max(-1,6)

magically becomes a huge positive number like in that patch of yours, but

 const_max_t(size_t, -1, 6)

*obviously* is a huge positive number.

The two things would *do* the same thing, but in the second case the
type is explicit and visible.

> due to __builtin_types_compatible_p() rejecting it, or I can construct
> a "positive arguments only" test, in which the above is accepted, but
> this is rejected:

That sounds acceptable too, although the "const_max_t()" thing is
presumably going to be simpler?

 Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook  wrote:
>
> size_t __error_not_const_arg(void) \
> __compiletime_error("const_max() used with non-compile-time constant arg");
> #define const_max(x, y) \
> __builtin_choose_expr(__builtin_constant_p(x) &&\
>   __builtin_constant_p(y),  \
>   (typeof(x))(x) > (typeof(y))(y) ? \
> (x) : (y),  \
>   __error_not_const_arg())
>
> Is typeof() forcing enums to int? Regardless, I'll put this through
> larger testing. How does that look?

Ok, that alleviates my worry about one class of insane behavior, but
it does raise a few other questions:

 - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.

 - this does have the usual "what happen if you do

 const_max(-1,sizeof(x))

where the comparison will now be done in 'size_t', and -1 ends up
being a very very big unsigned integer.

Is there no way to get that type checking inserted? Maybe now is a
good point for that __builtin_types_compatible(), and add it to the
constness checking (and change the name of that error case function)?

  Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 12:47 PM, Kees Cook  wrote:
>
> To gain the ability to compare differing types, the arguments are
> explicitly cast to size_t.

Ugh, I really hate this.

It silently does insane things if you do

   const_max(-1,6)

and there is nothing in the name that implies that you can't use
negative constants.

   Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-12 Thread Linus Torvalds
On Mon, Mar 12, 2018 at 3:55 PM, Andrew Morton
 wrote:
>
> Replacing the __builtin_choose_expr() with ?: works of course.

Hmm. That sounds like the right thing to do. We were so myopically
staring at the __builtin_choose_expr() problem that we overlooked the
obvious solution.

Using __builtin_constant_p() together with a ?: is in fact our common
pattern, so that should be fine. The only real reason to use
__builtin_choose_expr() is if you want to get the *type* to vary
depending on which side you choose, but that's not an issue for
min/max.

> What will be the runtime effects?

There should be none. Gcc will turn the conditional for the ?: into a
constant, and DTRT.

  Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-11 Thread Linus Torvalds
On Sun, Mar 11, 2018 at 4:05 AM, Ingo Molnar  wrote:
>
> BTW., while I fully agree with everything you said, it's not entirely correct 
> to
> claim that if a C compiler can generate VLA code it is necessarily able to 
> parse
> and evaluate constant array sizes "just fine".
>
> Constant expressions are typically parsed very early on, at the preprocessing
> stage. They can be used with some preprocessor directives as well, such as 
> '#if'
> (with some further limitations on their syntax).

Yes. But constant simplification and CSE etc is just a very
fundamental part of a compiler, and anybody who actually implements
VLA's would have to do it anyway.

So no, a message like

  warning: Array declaration is not a C90 constant expression,
resulting in VLA code generation

would be moronic. Only some completely mindless broken shit would do
"oh, it's not a parse-time constant, so it will be variable". The two
just do not follow AT ALL.

So the message might be about _possibly_ resulting in VLA code
generation, but honestly, at that point you should just add the
warning when you actually generate the code to do the stack
allocation.

Because at that point, you know whether it's variable or not.

And trust me, it won't be variable for things like (2,3), or even for
our "max()" thing with other odd builtins. Not unless the compiler
doesn't really support VLA at all (maybe some bolted-on crazy thing
that just turns a VLA at the front-end time into just an alloca), or
the user has explicitly asked to disable some fundamental optimization
phase.

   Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 9:34 AM, Miguel Ojeda
 wrote:
>
> So the warning is probably implemented to just trigger whenever VLAs
> are used but the given standard does not allow them, for all
> languages. The problem is why the ISO C90 frontend is not giving an
> error for using invalid syntax for array sizes to begin with?

So in *historical* context - when a compiler didn't do variable length
arrays at all - the original semantics of C "constant expressions"
actually make a ton of sense.

You can basically think of a constant expression as something that can
be (historically) handled by the front-end without any real
complexity, and no optimization phases - just evaluating a simple
parse tree with purely compile-time constants.

So there's a good and perfectly valid reason for why C limits certain
expressions to just a very particular subset. It's not just array
sizes, it's  case statements etc too. And those are very much part of
the C standard.

So an error message like

   warning: ISO C90 requires array sizes to be constant-expressions

would be technically correct and useful from a portability angle. It
tells you when you're doing something non-portable, and should be
automatically enabled with "-ansi -pedantic", for example.

So what's misleading is actually the name of the warning and the
message, not that it happens. The warning isn't about "variable
length", it's literally about the rules for what a
"constant-expression" is.

And in C, the expression (2,3) has a constant _value_ (namely 3), but
it isn't a constant-expression as specified by the language.

Now, the thing is that once you actually do variable length arrays,
those old front-end rules make no sense any more (outside of the "give
portability warnings" thing).

Because once you do variable length arrays, you obviously _parse_
everything just fine, and you're doing to evaluate much more complex
expressions than some limited constant-expression rule.

And at that point it would make a whole lot more sense to add a *new*
warning that basically says "I have to generate code for a
variable-sized array", if you actually talk about VLA's.

But that's clearly not what gcc actually did.

So the problem really is that -Wvla doesn't actually warn about VLA's,
but about something technically completely different.

And that's why those stupid syntactic issues with min/max matter. It's
not whether the end result is a compile-time constant or not, it's
about completely different issues, like whether there is a
comma-expression in it.

  Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>
> Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or
> some other name for the simple macro. Bleh.

Oh, and I'm starting to see the real problem.

It's not that our current "min/max()" are broiken. It's that "-Wvla" is garbage.

Lookie here:

int array[(1,2)];

results in gcc saying

 warning: ISO C90 forbids variable length array ‘array’ [-Wvla]
   int array[(1,2)];
   ^~~

and that error message - and the name of the flag - is obviously pure garbage.

What is *actually* going on is that ISO C90 requires an array size to
be not a constant value, but a constant *expression*. Those are two
different things.

A constant expression has little to do with "compile-time constant".
It's a more restricted form of it, and has actual syntax requirements.
A comma expression is not a constant expression, for example, which
was why I tested this.

So "-Wvla" is garbage, with a misleading name, and a misleading
warning string. It has nothing to do with "variable length" and
whether the compiler can figure it out at build time, and everything
to do with a _syntax_ rule.

  Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>
> And sparse freaks out too:
>
>drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect
> type in initializer (different address spaces) @@expected void
> *addr @@got struct mac_regs [noderef] 

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 11:03 PM, Miguel Ojeda
 wrote:
>
> Just compiled 4.9.0 and it seems to work -- so that would be the
> minimum required.
>
> Sigh...
>
> Some enterprise distros are either already shipping gcc >= 5 or will
> probably be shipping it soon (e.g. RHEL 8), so how much does it hurt
> to ask for a newer gcc? Are there many users/companies out there using
> enterprise distributions' gcc to compile and run the very latest
> kernels?

I wouldn't mind upping the compiler requirements, and we have other
reasons to go to 4.6.

But _this_ particular issue doesn't seem worth it to then go even
further. Annoying.

   Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 5:31 PM, Kees Cook  wrote:
>
> WTF, gmail just blasted HTML into my explicitly plain-text email?! 
> Apologies...

There's more now in your email, I think maybe it's triggered by your
signature file and some gmail web interface bug. Or it just tries to
force quote using html.

There's some html email disease inside google, where some parts of the
company seems to think that html email is _good_.

The official gmail Android app is a big pain, int hat it doesn't
*have* a plain-text mode, even though it knows how to format the
plain-text part of the email.

You might want to be on the lookout for some bad drugs at the office.
Because the world would thank you if you took them away from the gmail
app people.

 Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 4:32 PM, Andrew Morton  wrote:
>
> I wonder which gcc versions actually accept Kees's addition.

Note that we already do have this pattern, as seen by:

  git grep -2  __builtin_choose_expr | grep -2 __builtin_constant_p

which show drivers/pinctrl/intel/pinctrl-intel.h, so the pattern
already exists current kernels and _works_ - it apparently just
doesn't work in slightly more complicated cases.

It's one reason why I wondered if simplifying the expression to have
just that single __builtin_constant_p() might not end up working..

  Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton  wrote:
>
> A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> to know that __builtin_constant_p(x) is a constant.  Or something.

LOL.

I suspect it might be that it wants to evaluate
__builtin_choose_expr() at an earlier stage than it evaluates
__builtin_constant_p(), so it's not that it doesn't know that
__builtin_constant_p() is a constant, it just might not know it *yet*.

Maybe.

Side note, if it's not that, but just the "complex" expression that
has the logical 'and' etc, maybe the code could just use

  __builtin_constant_p((x)+(y))

or something.

But yeah:

> Sigh.  Wasn't there some talk about modernizing our toolchain
> requirements?

Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
and once we upgrade to 4.5 I think Arnd said that no distro actually
ships it, so we might as well go to 4.6.

So maybe this is just the excuse to finally make that official, if
there is no clever workaround any more.

Linus


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 12:05 PM, Kees Cook  wrote:
> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:

Ok, looks good.

I just have a couple of questions about applying it.

In particular, if this will help people working on getting rid of
VLA's in the short term, I can apply it directly. But if people who
are looking at it (anybody else than Kees?) don't much care, then this
might be a 4.17 thing or at least "random -mm queue"?

The other unrelated reaction I had to this was that "we're passing
those types down very deep, even though nobody _cares_ about them all
that much at that deep level".

Honestly, the only case that really cares is the very top level, and
the rest could just take the properly cast versions.

For example, "max_t/min_t" really don't care at all, since they - by
definition - just take the single specified type.

So I'm wondering if we should just drop the types from __max/__min
(and everything they call) entirely, and instead do

#define __check_type(x,y) ((void)((typeof(x)*)1==(typeof(y)*)1))
#define min(x,y)   (__check_type(x,y),__min(x,y))
#define max(x,y)   (__check_type(x,y),__max(x,y))

#define min_t(t,x,y) __min((t)(x),(t)(y))
#define max_t(t,x,y) __max((t)(x),(t)(y))

and then __min/__max and friends are much simpler (and can just assume
that the type is already fine, and the casting has been done).

This is technically entirely independent of this VLA cleanup thing,
but the "passing the types around unnecessarily" just becomes more
obvious when there's now another level of macros, _and_ it's a more
complex expression too.

Yes, yes, the __single_eval_xyz() functions still end up wanting the
types for the declaration of the new single-evaluation variables, but
the 'typeof' pattern is the standard pattern, so

#define __single_eval_max(max1, max2, x, y) ({  \
typeof (x) max1 = (x);  \
typeof (y) max2 = (y);  \
max1 > max2 ? max1 : max2; })

actually looks more natural to me than passing the two types in as
arguments to the macro.

(That form also is amenable to things like "__auto_type" etc simplifications).

Side note: do we *really* need the unique variable names? That's what
makes those things _really_ illegible. I thgink it's done just for a
sparse warning that we should probably ignore. It's off by default
anyway exactly because it doesn't work that well due to nested macro
expansions like this.

There is very real value to keeping our odd macros legible, I feel.
Even when they are complicated by issues like this, it would be good
to keep them as simple as possible.

Comments?

Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> How are you going to handle five processes doing the same setup concurrently?

Side note: it's not just serialization. It's also "is it actually up
and running".

The rule for "request_module()" (for a real module) has been that it
returns when the module is actually alive and active and have done
their initcalls.

The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
that in the caller) behavior doesn't actually have any semantics AT
ALL. It only means that you get the error returns from execve()
itself, so you know that the executable file actually existed and
parsed right enough to get started.

But you don't actually have any reason to believe that it has *done*
anything, and started processing any requests. There's no reason
what-so-ever to believe that it has registered itself for any
asynchronous requests or anything like that.

So in the real module case, you can do

request_module("modulename");

and just start using whatever resource you just requested. So the
netfilter code literally does

request_module("nft-chain-%u-%.*s", family,
   nla_len(nla), (const char *)nla_data(nla));
nfnl_lock(NFNL_SUBSYS_NFTABLES);
type = __nf_tables_chain_type_lookup(nla, family);
if (type != NULL)
return ERR_PTR(-EAGAIN);

and doesn't even care about error handling for request_module()
itself, because it knows that either the module got loaded and is
ready, or something failed. And it needs to look that chain type up
anyway, so the failure is indicated by _that_.

With a UMH_WAIT_EXEC? No. You have *nothing*. You know the thing
started, but it might have SIGSEGV'd immediately, and you have
absolutely no way of knowing, and absolutely no way of even figuring
it out. You can wait - forever - for something to bind to whatever
dynamic resource you're expecting. You'll just fundamentally never
know.

You can try again, of course. Add a timeout, and try again in five
seconds or something. Maybe it will work then. Maybe it won't. You
won't have any way to know the _second_ time around either. Or the
third. Or...

See why I say it has to be synchronous?

If it's synchronous, you can actually do things like

 (a) maybe you only need a one-time thing, and don't have any state
("load fixed tables, be done") and that's it. If the process returns
with no error code, you're all done, and you know it's fine.

 (b) maybe the process wants to start a listener daemon or something
like the traditional inetd model. It can open the socket, it can start
listening on it, and it can fork off a child and check it's status. It
can then do exit(0) if everything is fine, and now request_module()
returns.

see the difference? Even if you ended up with a background process
(like in that (b) case), you did so with *error* handling, and you did
so knowing that the state has actually been set up by the time the
request_module() returns.

And if you use the proper module loading exclusion, it also means that
that (b) can know it's the only process starting up, and it's not
racing with another one.  It might still want to do the usual
lock-files in user space to protect against just the admin starting it
manually, but at least you don't have the situation that a hundred
threads just had a thundering herd where they all ended up using the
same kernel facility, and they all independently started a hundred
usermode helpers.

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 10:57 AM, David Miller  wrote:
>
> Once the helper UMH is invoked, it runs asynchronously taking eBPF
> translation requests.

How?

Really. See my comment about mutual exclusion. The current patch is
*broken* because it doesn't handle it. Really.

Think of it this way - you may have now started *five* of those things
concurrently by mistake.

The actual module loading case never does that, because the actual
module loading case has per-module serialization that got
short-circuited.

How are you going to handle five processes doing the same setup concurrently?

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 10:48 AM, Andy Lutomirski <l...@amacapital.net> wrote:
>> On Mar 9, 2018, at 10:17 AM, Linus Torvalds <torva...@linux-foundation.org> 
>> wrote:
>>
>> Hmm. I wish we had an "execute blob" model, but we really don't, and
>> it would be hard/impossible to do without pinning the pages in memory.
>>
>
> Why so hard?  We can already execute a struct file for execveat, and Alexei 
> already has this working for umh.
> Surely we can make an immutable (as in even root can’t write it) 
> kernel-internal tmpfs file, execveat it, then unlink it.

And what do you think that does? It pins the memory for the whole
time. As a *copy* of the original file.

Anyway, see my other suggestion that makes this all irrelevant. Just
wait synchronously (until the exit), and just use deny_write_access().

The "synchronous wait" means that you don't have the semantic change
(and really., it's *required* anyway for the whole mutual exclusion
against another thread racing to load the same module), and the
deny_write_access() means that we don't neeed to make another copy.

Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook  wrote:
>
> Module loading (via kernel_read_file()) already uses
> deny_write_access(), and so does do_open_execat(). As long as module
> loading doesn't call allow_write_access() before the execve() has
> started in the new implementation, I think we'd be covered here.

No. kernel_read_file() only does it *during* the read.

So there's a huge big honking gap between the two.

Also, the second part of my suggestion was to be entirely synchronous
with the whole execution of the process, and do it within the "we do
mutual exclusion fo rmodules with the same name" logic.

Note that Andrei's patch uses UMH_WAIT_EXEC. That's basically
"vfork+exec" - it only waits for the exec to have started, it doesn't
wait for the whole thing.

So I'm saying "use UMH_WAIT_PROC, do it in a different place, and make
sure you cover the whole sequence with deny_write_access()".

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-09 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 9:08 PM, Alexei Starovoitov  wrote:
>
> there is not abi breakage and file cannot disappear from running task.
> One cannot umount fs while file is still being used.

I think that "cannot umount" part _is_ the ABI breakage that Andy is
talking about.

> Not only "read twice", but "read many".
> If .text sections of elf that are not yet in memory can be modified
> by malicious user, later they will be brought in with different code.
> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.

I don't think it actually fixes anything.  It might just break things.
For all we know, people run modprobe with CAP_SYS_MODULE only, since
that is obviously the only capability it needs.

Hmm. I wish we had an "execute blob" model, but we really don't, and
it would be hard/impossible to do without pinning the pages in memory.

My gut feel is that the right direction to explore is:

 - consider the module loaded for the whole duration of the execve. So
the execution is a *blocking* operation (and we get the correct
exclusion semantics)

 - use deny_write_access() to make sure that we don't have active
writers and cannot get them during the execve.

The above mean that something that executes to load a new ebpf rule
will work very well.  But a "start and forget" will not work (although
you can obviously do so with a internal fork/exec).

Hmm?

   Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 7:06 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> So I don't like Andy's "let's make it a kernel module and then that
> kernel module can execve() a blob". THAT seems like just stupid
> indirection.
>
> But I do like Andy's "execve a blob" part, because it is the *blob*
> that has had its signature verified, not the file!

To be honest., Andy's approach has the advantage that all the
synchronization we do for modules still works.

In particular, we have module loading logic where we synchronize
loading of modules with the same name, so that two things that do
request_module() concurrently will not lead to two modules being
loaded.

See that whole "module_mutex" thing, and the logic in (for example)
"add_unformed_module()".

Andrei's patch short-circuits the module loading before that, so if
you have two concurrent request_module() calls, you'll end up with no
serialization, and two executables.

So maybe Andy is right. He often is, after all.

 Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 5:44 PM, Kees Cook  wrote:
>
> My concerns are mostly about crossing namespaces. If a container
> triggers an autoload, the result runs in the init_ns.

Heh. I saw that as an advantage. It's basically the same semantics as
a normal module load does - in that the "kernel namespace" is init_ns.

My own personal worry is actually different - we do check the
signature of the file we're loading, but we're then passing it off to
execve() not as the image we loaded, but as the file pointer.

So the execve() will end up not using the actual buffer we checked the
signature on, but instead just re-reading the file.

Now, that has two issues:

 (a) it means that 'init_module' doesn't work, only 'finit_module'.

Not a big deal, but I do think that we should fail the 'info->file
== NULL' case explicitly (instead of failing when it does an execve()
of /dev/null, which is what I think it does now - but that's just from
the reading the patch, not from testing it).

 (b) somebody could maybe try to time it and modify the file
after-the-fact of the signature check, and then we execute something
else.

Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.

But it turns out that isn't needed. Some bad actor could just do
"finit_module()" with a file that they just *copied* from the module
directory.

Yes, yes, they still need CAP_SYS_MODULE to even get that far, but it
does worry me.

So this does seem to turn "CAP_SYS_MODULE" into meaning "can run any
executable as root in the init namespace". They don't have to have the
module signing key that I thought would protect us, because they can
do that "rewrite after signature check".

So that does seem like a bad problem with the approach.

So I don't like Andy's "let's make it a kernel module and then that
kernel module can execve() a blob". THAT seems like just stupid
indirection.

But I do like Andy's "execve a blob" part, because it is the *blob*
that has had its signature verified, not the file!

 Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski  wrote:
>
> Also, I don't see how this is any more exploitable than any other
> init_module().

Absolutely. If Kees doesn't trust the files to be loaded, an
executable - even if it's running with root privileges and in the
initns - is still fundamentally weaker than a kernel module.

So I don't understand the security argument AT ALL. It's nonsensical.
The executable loading does all the same security checks that the
module loading does, including the signing check.

And the whole point is that we can now do things with building and
loading a ebpf rule instead of having a full module.

 Linus


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 4:45 PM, Kees Cook  wrote:
>
> Rasmus mentioned this too. What I said there was that I was shy to
> make that change, since we already can't mix that kind of thing with
> the existing min()/max() implementation. The existing min()/max() is
> already extremely strict, so there are no instances of this in the
> tree.

Yes, but I also didn't want to add any new cases in case people add
new min/max() users.

But:

> If I explicitly add one, I see this with or without the patch:
>
> In file included from drivers/misc/lkdtm.h:7:0,
>  from drivers/misc/lkdtm_core.c:33:
> drivers/misc/lkdtm_core.c: In function ‘lkdtm_module_exit’:
> ./include/linux/kernel.h:809:16: warning: comparison of distinct
> pointer types lacks a cast

Oh, ok, in that case, just drop the __builtin_types_compatible_p()
entirely. It's not adding anything.

I was expecting the non-chosen expression in __builtin_choose_expr()
to not cause type warnings. I'm actually surprised it does. Type games
is why __builtin_choose_expr() tends to exist in the first place.

> So are you saying you _want_ the type enforcement weakened here, or
> that I should just not use __builtin_types_compatible_p()?

I don't want to weaken the type enforcement, and I _thought_ you had
done that __builtin_types_compatible_p() to keep it in place.

But if that's not why you did it, then why was it there at all? If the
type warning shows through even if it's in the other expression, then
just a


#define __min(t1, t2, x, y) \
__builtin_choose_expr(  \
__builtin_constant_p(x) &   \
__builtin_constant_p(y),\
(t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),  \
__single_eval_min(t1, t2,   \
   ...

would seem to be sufficient?

Because logically, the only thing that matters is that x and y don't
have any side effects and can be evaluated twice, and
"__builtin_constant_p()" is already a much stronger version of that.

Hmm? The __builtin_types_compatible_p() just doesn't seem to matter
for the only thing I thought it was there for.

  Linus


Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 1:40 PM, Kees Cook  wrote:
> +#define __min(t1, t2, x, y)\
> +   __builtin_choose_expr(__builtin_constant_p(x) &&\
> + __builtin_constant_p(y) &&\
> + __builtin_types_compatible_p(t1, t2), \
> + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\

I understand why you use __builtin_types_compatible_p(), but please don't.

It will mean that trivial constants like "5" and "sizeof(x)" won't
simplify, because they have different types.

The ?: will give the right combined type anyway, and if you want the
type comparison warning, just add a comma-expression with something
like like

   (t1 *)1 == (t2 *)1

to get the type compatibility warning.

Yeah, yeah, maybe none of the VLA cases triggered that, but it seems
silly to not just get that obvious constant case right.

Hmm?

  Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Linus Torvalds
On Tue, Mar 6, 2018 at 12:01 PM, Andy Lutomirski  wrote:
>
> I assume I'm missing some context here, but why does this need to be
> handled by the kernel rather than, say, a change to how modprobe
> works?

Honestly, the less we have to mess with user-mode tooling, the better.

We've been *so* much better off moving most of the module loading
logic to the kernel, we should not go back in the old broken
direction.

I do *not* want the kmod project that is then taken over by systemd,
and breaks it the same way they broke firmware loading.

Keep modprobe doing one thing, and one thing only: track dependencies
and mindlessly just load the modules. Do *not* ask for it to do
anything else.

Right now kmod is a nice simple project. Lots of testsuite stuff, and
a very clear goal. Let's keep kmod doing one thing, and not even have
to care about internal kernel decisions like "oh, this module might
not be a module, but an executable".

If anything, I think we want to keep our options open, in the case we
need or want to ever consider short-circuiting things and allowing
direct loading of the simple cases and bypassing modprobe entirely.

   Linus


Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Linus Torvalds
On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.

  Linus


Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-27 Thread Linus Torvalds
On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones  wrote:
> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>  > Just triggered this on a server I was rsync'ing to.
>
> Actually, I can trigger this really easily, even with an rsync from one
> disk to another.  Though that also smells a little like networking in
> the traces. Maybe netdev has ideas.

Is this new to 4.15? Or is it just that you're testing something new?

If it's new and easy to repro, can you just bisect it? And if it isn't
new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
ok)?

Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
but it was rewritten for 4.14.. I'm wondering if that remodeling ended
up triggering something.

Adding PeterZ to the participants list in case he has ideas. I'm not
seeing what would be the problem in that call chain from hell.

   Linus


Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 7:06 PM, Al Viro  wrote:
>
> Similar to the way put_cmsg() handles 32bit case on biarch
> targets, introduce a flag telling put_cmsg() to treat
> ->msg_control as kernel pointer, using memcpy instead of
> copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> with its set_fs().

If this is the only case that kernel_recvmsg() exists for, then by all
means, that patch certainly looks like a good thing.

 Linus


Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-13 Thread Linus Torvalds
On Fri, Jan 12, 2018 at 4:15 PM, Tony Luck  wrote:
>
> Here there isn't any reason for speculation. The core has the
> value of 'x' in a register and the upper bound encoded into the
> "cmp" instruction.  Both are right there, no waiting, no speculation.

So this is an argument I haven't seen before (although it was brought
up in private long ago), but that is very relevant: the actual scope
and depth of speculation.

Your argument basically depends on just what gets speculated, and on
the _actual_ order of execution.

So your argument depends on "the uarch will actually run the code in
order if there are no events that block the pipeline".

Or at least it depends on a certain latency of the killing of any OoO
execution being low enough that the cache access doesn't even begin.

I realize that that is very much a particular microarchitectural
detail, but it's actually a *big* deal. Do we have a set of rules for
what is not a worry, simply because the speculated accesses get killed
early enough?

Apparently "test a register value against a constant" is good enough,
assuming that register is also needed for the address of the access.

Linus


Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-11 Thread Linus Torvalds
On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  wrote:
>
> This series incorporates Mark Rutland's latest ARM changes and adds
> the x86 specific implementation of 'ifence_array_ptr'. That ifence
> based approach is provided as an opt-in fallback, but the default
> mitigation, '__array_ptr', uses a 'mask' approach that removes
> conditional branches instructions, and otherwise aims to redirect
> speculation to use a NULL pointer rather than a user controlled value.

Do you have any performance numbers and perhaps example code
generation? Is this noticeable? Are there any microbenchmarks showing
the difference between lfence use and the masking model?

Having both seems good for testing, but wouldn't we want to pick one in the end?

Also, I do think that there is one particular array load that would
seem to be pretty obvious: the system call function pointer array.

Yes, yes, the actual call is now behind a retpoline, but that protects
against a speculative BTB access, it's not obvious that it  protects
against the mispredict of the __NR_syscall_max comparison in
arch/x86/entry/entry_64.S.

The act of fetching code is a kind of read too. And retpoline protects
against BTB stuffing etc, but what if the _actual_ system call
function address is wrong (due to mis-prediction of the system call
index check)?

Should the array access in entry_SYSCALL_64_fastpath be made to use
the masking approach?

Linus


Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-09 Thread Linus Torvalds
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams  wrote:
>
> I looks like there is another problem, or I'm misreading the
> cleverness.

I think you were misreading it.

I was basically saying that this:

unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\

doesn't work, and that the "_m -1 - _i" needs to be replaced by "_i |
_m -1 -_i".

So you have

unsigned long _mask = ~(long)(_i (_m - 1 - _i)) >> BITS_PER_LONG - 1;\

which should give the right result. No?

But as mentioned, I think you can do it with two instructions if you
do an architecture-specific inline asm:

unsigned long mask;

asm("cmpq %1,%2; sbbq %0,%0"
:"=r" (mask)
:"g" (max),"r" (idx));

which is likely much faster, and has much better register usage ("max"
can be a constant or loaded directly from memory, and "mask" could be
assigned the same register as idx).

But once again, I didn't really test it.

Note that the "cmpq/sbbq" version works regardless of max/idx values,
since it literally does the math in BITS_ION_LONG+1 bits.

In contrast, the "binary or with idx" version only works if the high
bit set in idx cannot be valid (put another way: 'max' must not be
bigger than MAXLONG+1).

   Linus


Re: Re: dvb usb issues since kernel 4.9

2018-01-09 Thread Linus Torvalds
On Tue, Jan 9, 2018 at 9:57 AM, Eric Dumazet  wrote:
>
> Your patch considers TASKLET_SOFTIRQ being a candidate for 'immediate
> handling', but TCP Small queues heavily use TASKLET,
> so as far as I am concerned a revert would have the same effect.

Does it actually?

TCP ends up dropping packets outside of the window etc, so flooding a
machine with TCP packets and causing some further processing up the
stack sounds very different from the basic packet flooding thing that
happens with NET_RX_SOFTIRQ.

Also, honestly, the kinds of people who really worry about flooding
tend to have packet filtering in the receive path etc.

So I really think "you can use up 90% of CPU time with a UDP packet
flood from the same network" is very very very different - and
honestly not at all as important - as "you want to be able to use a
USB DVB receiver and watch/record TV".

Because that whole "UDP packet flood from the same network" really is
something you _fundamentally_ have other mitigations for.

I bet that whole commit was introduced because of a benchmark test,
rather than real life. No?

In contrast, now people are complaining about real loads not working.

 Linus


Re: dvb usb issues since kernel 4.9

2018-01-09 Thread Linus Torvalds
On Tue, Jan 9, 2018 at 9:42 AM, Mauro Carvalho Chehab
 wrote:
>
> On my preliminar tests, writing to a file on an ext4 partition at a
> USB stick loses data up to the point to make it useless (1/4 of the data
> is lost!). However, writing to a class 10 microSD card is doable.

Note that most USB sticks are horrible crap. They can have write
latencies counted in _seconds_.

You can cause VM issues and various other non-hardware stalls with
them, simply because something gets stuck waiting for a page writeout
that should take a few ms on any reasonable hardware, but ends up
talking half a second or more.

For example, even really well-written software that tries to do things
like threaded write-behind to smooth out the IO will be _totally_
screwed by the USB stick behavior (where you might write a few MB at
high speeds, and then the next write - however small - takes a second
because the stupid USB stick does a synchronous garbage collection.
Suddenly all that clever software that tried to keep things moving
along smoothly without any hiccups, and tried hard to make the USB bus
have a nice constant loadm can't do anything at all about the crap
hardware.

So when testing writes to USB sticks, I'm not convinced you're
actually testing any USB bus limitations or even really any other
hardware limitations than the USB stick itself.

  Linus


Re: Re: dvb usb issues since kernel 4.9

2018-01-09 Thread Linus Torvalds
On Tue, Jan 9, 2018 at 9:27 AM, Eric Dumazet  wrote:
>
> So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has
> shown up multiple times in various 'regressions'
> simply because it could surface the problem more often.
> But even if you revert it, you can still make the faulty
> driver/subsystem misbehave by adding more stress to the cpu handling
> the IRQ.

..but that's always true. People sometimes live on the edge - often by
design (ie hardware has been designed/selected to be the crappiest
possible that still work).

That doesn't change anything. A patch that takes "bad things can
happen" to "bad things DO happen" is a bad patch.

> Maybe the answer is to tune the kernel for small latencies at the
> price of small throughput (situation before the patch)

Generally we always want to tune for latency. Throughput is "easy",
but almost never interesting.

Sure, people do batch jobs. And yes, people often _benchmark_
throughput, because it's easy to benchmark. It's much harder to
benchmark latency, even when it's often much more important.

A prime example is the SSD benchmarks in the last few years - they
improved _dramatically_ when people noticed that the real problem was
latency, not the idiotic maximum big-block bandwidth numbers that have
almost zero impact on most people.

Put another way: we already have a very strong implicit bias towards
bandwidth just because it's easier to see and measure.

That means that we generally should strive to have a explicit bias
towards optimizing for latency when that choice comes up.  Just to
balance things out (and just to not take the easy way out: bandwidth
can often be improved by adding more layers of buffering and bigger
buffers, and that often ends up really hurting latency).

> 1) Revert the patch

Well, we can revert it only partially - limiting it to just networking
for example.

Just saying "act the way you used to for tasklets" already seems to
have fixed the issue in DVB.

> 2) get rid of ksoftirqd since it adds unexpected latencies.

We can't get rid of it entirely, since the synchronous softirq code
can cause problems too. It's why we have that "maximum of ten
synchronous events" in __do_softirq().

And we don't *want* to get rid of it.

We've _always_ had that small-scale "at some point we can't do it
synchronously any more".

That is a small-scale "don't have horrible latency for _other_ things"
protection. So it's about latency too, it's just about protecting
latency of the rest of the system.

The problem with commit 4cd13c21b207 is that it turns the small-scale
latency issues in softirq handling (they get larger latencies for lots
of hardware interrupts or even from non-preemptible kernel code) into
the _huge_ scale latency of scheduling, and does so in a racy way too.

> 3) Let applications that expect to have high throughput make sure to
> pin their threads on cpus that are not processing IRQ.
> (And make sure to not use irqbalance, and setup IRQ cpu affinities)

The only people that really deal in "thoughput only" tend to be the
HPC people, and they already do things like this.

(The other end of the spectrum is the realtime people that have
extreme latency requirements, who do things like that for the reverse
reason: keeping one or more CPU's reserved for the particular
low-latency realtime job).

   Linus


Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> # carry will be clear if idx >= max
> cmpq %idx,%max

Bah. Other way around.

cmpq %max,%idx

I'm a moron.

> # mask will be clear if carry was clear, ~0 otherwise
> sbbq %mask,%mask
>
> to generate mask directly. I might have screwed that up. Worth perhaps trying?

More importantly, worth _testing_ and fixing my hand-waving "asm like
this" crap.

But I do think that simple two-instruction cmpq/sbbq sequence could
get it right in just two trivial ALU instructions.

  Linus


Re: [PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams  wrote:
>
> originally from Linus and tweaked by Alexei and I:

Sadly, that tweak - while clever - is wrong.

> unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\

Why?

Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
can still be positive.

Think "m = 100", "i=bignum". The subtraction will overflow and become
positive again, the shift will shift to zero, and then the mask will
become ~0.

Now, you can fix it, but you need to be a tiny bit more clever.  In
particular, just make sure that you retain the high bit of "_i",
basically making the rule be that a negative index is not ever valid.

And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
Now the sign bit is set if _i had it set, _or_ if the subtraction
turned negative, and you don't have to worry about the overflow
situation.

But it does require that extra step to be trustworthy. Still purely
cheap arithmetic operations, although there is possibly some
additional register pressure there.

Somebody might be able to come up with something even more minimal (or
find a fault in my fix of your tweak).

Obviously, with architecture-specific code, you may well be able to do
better, using the carry flag of the subtraction.

For example, on x86, I think you could do it with just two instructions:

# carry will be clear if idx >= max
cmpq %idx,%max

# mask will be clear if carry was clear, ~0 otherwise
sbbq %mask,%mask

to generate mask directly. I might have screwed that up. Worth perhaps trying?

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 3:53 PM, Dan Williams  wrote:
>
> I've been thinking the "and" is only suitable for the array bounds
> check, for get_user() we're trying to block speculation past
> access_ok() at which point we can only do the lfence?

Well, we *could* do the "and", at least for the simple cases (ie the
true "get_user()" that integrates the access_ok with the access).

IOW, mainly the code in arch/x86/lib/getuser.S.

But it probably is a lot simpler to just add the "lfence" to ASM_STAC,
because by definition those cases don't tend to be the truly critical
ones - people who use those functions tend to do one or two accesses,
and the real cost is likely the I$ misses and the D$ miss to get
current->addr_limit. Not to mention the "stac" itself, which is much
more expensive than the access on current microarchitectures.

But something like this *might* work:

   index c97d935a29e8..7fa3d293beaf 100644
   --- a/arch/x86/lib/getuser.S
   +++ b/arch/x86/lib/getuser.S
   @@ -38,8 +38,11 @@
   .text
ENTRY(__get_user_1)
   mov PER_CPU_VAR(current_task), %_ASM_DX
   -   cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
   +   mov TASK_addr_limit(%_ASM_DX),%_ASM_DX
   +   cmp %_ASM_DX,%_ASM_AX
   jae bad_get_user
   +   or $0xfff,%_ASM_DX
   +   and %_ASM_DX,%_ASM_AX
   ASM_STAC
1: movzbl (%_ASM_AX),%edx
   xor %eax,%eax

(this only does the one-byte case - the 2/4/8 byte cases are exactly the same).

The above is completely untested and might have some stupid
thinko/typo, so take it purely as a "example patch" to show the
concept, rather than actually do it.

But just adding "lfence" to the existing ASM_STAC is a hell of a lot
easier, and the performance difference between that trivial patch and
the above "let's be clever with 'and'" might not be measurable.

I really have no idea how expensive lfence might actually end up being
in practice. It's possible that lfence is actually fairly cheap in
kernel code, since we tend to not have very high IPC anyway.

 Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
> <torva...@linux-foundation.org> wrote:
>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams <dan.j.willi...@intel.com> 
>> wrote:
>>>
>>> I assume if we put this in uaccess_begin() we also need audit for
>>> paths that use access_ok but don't do on to call uaccess_begin()? A
>>> quick glance shows a few places where we are open coding the stac().
>>> Perhaps land the lfence in stac() directly?
>>
>> Yeah, we should put it in uaccess_begin(), and in the actual user
>> accessor helpers that do stac. Some of them probably should be changed
>> to use uaccess_begin() instead while at it.
>>
>> One question for the CPU people: do we actually care and need to do
>> this for things that might *write* to something? The speculative write
>> obviously is killed, but does it perhaps bring in a cacheline even
>> when killed?
>
> As far as I understand a write could trigger a request-for-ownership
> read for the target cacheline.

Oh, absolutely.

I just wonder at what point that happens.

Honestly, trying to get exclusive access to a cacheline can be _very_
expensive (not just for the local thread), so I would actually expect
that doing so for speculative writes is actually bad for performance.

That's doubly true because - unlike reads - there is no critical
latency issue, so trying to get the cache access started as early as
possible simply isn't all that important.

So I suspect that a write won't actually try to allocate the cacheline
until the write has actually retired.

End result: writes - unlike reads - *probably* will not speculatively
perturb the cache with speculative write addresses.

> Even though writes can trigger reads, as far as I can see the write
> needs to be dependent on the first out-of-bounds read

Yeah. A write on its own wouldn't matter, even if it were to perturb
the cache state, because the address already comes from user space, so
there's no new information in the cache perturbation for the attacker.

But that all implies that we shouldn't need the lfence for the
"put_user()" case, only for the get_user() (where the value we read
would then perhaps be used to do another access).

So we want to add the lfence (or "and") to get_user(), but not
necessarily put_user().

Agreed?

  Linus


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 11:15 AM, Alan Stern  wrote:
>
> Both dwc2_hsotg and ehci-hcd use the tasklets embedded in the
> giveback_urb_bh member of struct usb_hcd.  See usb_hcd_giveback_urb()
> in drivers/usb/core/hcd.c; the calls are
>
> else if (high_prio_bh)
> tasklet_hi_schedule(>bh);
> else
> tasklet_schedule(>bh);
>
> As it turns out, high_prio_bh gets set for interrupt and isochronous
> URBs but not for bulk and control URBs.  The DVB driver in question
> uses bulk transfers.

Ok, so we could try out something like the appended?

NOTE! I have not tested this at all. It LooksObvious(tm), but...

Linus
 kernel/softirq.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f1bae2..97b080956fea 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -79,12 +79,16 @@ static void wakeup_softirqd(void)
 
 /*
  * If ksoftirqd is scheduled, we do not want to process pending softirqs
- * right now. Let ksoftirqd handle this at its own rate, to get fairness.
+ * right now. Let ksoftirqd handle this at its own rate, to get fairness,
+ * unless we're doing some of the synchronous softirqs.
  */
-static bool ksoftirqd_running(void)
+#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ))
+static bool ksoftirqd_running(unsigned long pending)
 {
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
+   if (pending & SOFTIRQ_NOW_MASK)
+   return false;
return tsk && (tsk->state == TASK_RUNNING);
 }
 
@@ -325,7 +329,7 @@ asmlinkage __visible void do_softirq(void)
 
pending = local_softirq_pending();
 
-   if (pending && !ksoftirqd_running())
+   if (pending && !ksoftirqd_running(pending))
do_softirq_own_stack();
 
local_irq_restore(flags);
@@ -352,7 +356,7 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
-   if (ksoftirqd_running())
+   if (ksoftirqd_running(local_softirq_pending()))
return;
 
if (!force_irqthreads) {


Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 9:05 AM, Mark Rutland  wrote:
>
> I'm a little worried that in the presence of some CPU/compiler
> optimisations, the masking may effectively be skipped under speculation.
> So I'm not sure how robust this is going to be.

Honestly, I think the masking is a hell of a lot more robust than any
of the "official" fixes.

More generic data speculation (as opposed to control speculation) is

 (a) mainly academic masturbation

 (b) hasn't even been shown to be a good idea even in _theory_ yet
(except for the "completely unreal hardware" kind of theory where
people assume some data oracle)

 (c) isn't actually done in any real CPU's today that I'm aware of
(unless you want to call the return stack data speculation).

and the thing is, we should really not then worry about "... but maybe
future CPU's will be more aggressive", which is the traditional worry
in these kinds of cases.

Why? Because quite honestly, any future CPU's that are more aggressive
about speculating things like this are broken shit that we should call
out as such, and tell people not to use.

Seriously.

In this particular case, we should be very much aware of future CPU's
being more _constrained_, because CPU vendors had better start taking
this thing into account.

So the masking approach is FUNDAMENTALLY SAFER than the "let's try to
limit control speculation".

If somebody can point to a CPU that actually speculates across an
address masking operation, I will be very surprised. And unless you
can point to that, then stop trying to dismiss the masking approach.

The only thing we need to be really really careful about is to make
sure that the mask generation itself is not in a control speculation
path.

IOW, the mask really has to be a true data dependency, and has to be
generated arithmetically. Because if the mask generation has a control
dependency on it, then obviously that defeats the whole "make sure we
don't have control speculation" approach.

   Linus


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 9:55 AM, Ingo Molnar  wrote:
>
> as I doubt we have enough time to root-case this properly.

Well, it's not like this is a new issue, and we don't have to get it
fixed for 4.15. It's been around since 4.9, it's not a "have to
suddenly fix it this week" issue.

I just think that people should plan on having to maybe revert it and
mark the revert for stable.

But if the USB or DVB layers can instead just make the packet queue a
bit deeper and not react so badly to the latency of a single softirq,
that would obviously be a good thing in general, and maybe fix this
issue. So I'm not saying that the revert is inevitable either.

But I have to say that that commit 4cd13c21b207 ("softirq: Let
ksoftirqd do its job") was a pretty damn big hammer, and entirely
ignored the "softirqs can have latency concerns" issue.

So I do feel like the UDP packet storm thing might want a somewhat
more directed fix than that huge hammer of trying to move softirqs
aggressively into the softirq thread.

This is not that different from threaded irqs. And while you can set
the "thread every irq" flag, that would be largely insane to do by
default and in general. So instead, people do it either for specific
irqs (ie "request_threaded_irq()") or they have a way to opt out of it
(IRQF_NO_THREAD).

I _suspect_ that the softirq thing really just wants the same thing.
Have the networking case maybe set the "prefer threaded" flag just for
networking, if it's less latency-sensitive for softirq handling than

In fact, even for networking, there are separate TX/RX softirqs, maybe
networking would only set it for the RX case? Or maybe even trigger it
only for cases where things queue up and it goes into a "polling mode"
(like NAPI already does).

Of course, I don't even know _which_ softirq it is that the DVB case
has issues with. Maybe it's the same NET_RX case?

But looking at that offending commit, I do note (for example), that we
literally have things like tasklet[_hi]_schedule() that might have
been explicitly expected to just run the tasklet at a fairly low
latency (maybe instead of a workqueue exactly because it doesn't need
to sleep and wants lower latency).

So saying "just because softirqd is possibly already woken up, let's
delay all those tasklets etc" does really seem very wrong to me.

Can somebody tell which softirq it is that dvb/usb cares about?

 Linus


Re: dvb usb issues since kernel 4.9

2018-01-07 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 11:54 AM, Mauro Carvalho Chehab
 wrote:
>
> Em Sat, 6 Jan 2018 16:04:16 +0100
> "Josef Griebichler"  escreveu:
>>
>> the causing commit has been identified.
>> After reverting commit 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882
>> its working again.
>
> Just replying to me won't magically fix this. The ones that were involved on
> this patch should also be c/c, plus USB people. Just added them.

Actually, you seem to have added an odd subset of the people involved.

For example, Ingo - who actually committed that patch - wasn't on the cc.

I do think we need to simply revert that patch. It's very simple: it
has been reported to lead to actual problems for people, and we don't
fix one problem and then say "well, it fixed something else" when
something breaks.

When something breaks, we either unbreak it, or we revert the change
that caused the breakage.

It's really that simple. That's what "no regressions" means.  We don't
accept changes that cause regressions. This one did.

  Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreau  wrote:
>
> Linus, no need to explain that to me, I'm precisely trying to see how
> to disable PTI for a specific process because I face up to 45% loss in
> certain circumstances, making it a no-go. But while a few of us have
> very specific workloads emphasizing this impact, others have very
> different ones and will not notice. For example my laptop did boot
> pretty fine and I didn't notice anything until I fire a network
> benchmark.

Sure, most people have hardware where the bottleneck is entirely
elsewhere (slow network, rotating disk, whatever).

But this whole "normal people won't notice" is dangerous thinking.
They may well notice very much, we simply don't know what they are
doing.

Quite honesty, it's equally correct to say "normal people won't be
affected by the security issue in the first place".

That laptop that you didn't have any issues with? Likely it never had
an exploit running on it either!

So the whole "normal people" argument is pure and utter garbage. It's
wrong. It's pure shit when it comes to performance, but it's also pure
shit when it comes to the security issue.

Don't use it.

We need to fix the security problem, but we need to do it *without*
these braindead arguments that performance is somehow secondary.

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>
> To be fair there's overreaction on both sides. The vast majority of
> users need to get a 100% safe system and will never notice  any
> difference.

There is no such thing as a "100% safe system". Never will be - unless
you make sure you have no users.

Also, people definitely *are* noticing the performance issues with the
current set of patches, and they are causing real problems. Go search
for reports of Amazon AWS slowdowns.

So this whole "security is so important that performance doesn't
matter" mindset is pure and utter garbage.

And the whole "normal people won't even notice" is pure garbage too.
Don't spread that bullshit when you see actual normal people
complaining.

Performance matters. A *LOT*.

Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  wrote:
>
> I assume if we put this in uaccess_begin() we also need audit for
> paths that use access_ok but don't do on to call uaccess_begin()? A
> quick glance shows a few places where we are open coding the stac().
> Perhaps land the lfence in stac() directly?

Yeah, we should put it in uaccess_begin(), and in the actual user
accessor helpers that do stac. Some of them probably should be changed
to use uaccess_begin() instead while at it.

One question for the CPU people: do we actually care and need to do
this for things that might *write* to something? The speculative write
obviously is killed, but does it perhaps bring in a cacheline even
when killed?

Because maybe we don't need the lfence in put_user(), only in get_user()?

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 4:32 AM, Alan Cox  wrote:
>
> Also for x86-64 if we are trusting that an AND with a constant won't get
> speculated into something else surely we can just and the address with ~(1
> << 63) before copying from/to user space ? The user will then just
> speculatively steal their own memory.

User accesses *may* go to the kernel too, with set_fs(KERNEL_DS).
We've been getting rid of those, but they still exist.

We historically perhaps have 'and'ed the address with
current->thread.addr_limit, but it's no longer a power-of-two mask, so
that doesn't work. We'd have to play tricks there, but it might work
to do something like

 addr &= current->thread.addr_limit | 0xfff;

or similar.

But this is one area where the 'lfence' is probably not too bad. The
cost of the existing 'stac' instruction is much higher, and in fact
depending on how lfence works (and how stac affects the memory unit
pipeline, it might even _help_ to serialize the stac with the actual
address.

Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-05 Thread Linus Torvalds
On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.

I also suspect we should probably do this entirely differently.

Maybe the whole lfence can be part of uaccess_begin() instead (ie
currently 'stac()'). That would fit the existing structure better, I
think. And it would avoid any confusion about the whole "when to stop
speculation".

 Linus


Re: [PATCH 01/18] asm-generic/barrier: add generic nospec helpers

2018-01-05 Thread Linus Torvalds
On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
> +#ifndef nospec_ptr
> +#define nospec_ptr(ptr, lo, hi)  
>   \

Do we actually want this horrible interface?

It just causes the compiler - or inline asm - to generate worse code,
because it needs to compare against both high and low limits.

Basically all users are arrays that are zero-based, and where a
comparison against the high _index_ limit would be sufficient.

But the way this is all designed, it's literally designed for bad code
generation for the unusual case, and the usual array case is written
in the form of the unusual and wrong non-array case. That really seems
excessively stupid.

 Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-05 Thread Linus Torvalds
On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  wrote:
> From: Andi Kleen 
>
> When access_ok fails we should always stop speculating.
> Add the required barriers to the x86 access_ok macro.

Honestly, this seems completely bogus.

The description is pure garbage afaik.

The fact is, we have to stop speculating when access_ok() does *not*
fail - because that's when we'll actually do the access. And it's that
access that needs to be non-speculative.

That actually seems to be what the code does (it stops speculation
when __range_not_ok() returns false, but access_ok() is
!__range_not_ok()). But the explanation is crap, and dangerous.

 Linus


  1   2   3   4   >