Re: [GIT PULL] sysctl constification changes for v6.11-rc1

2024-07-25 Thread Linus Torvalds
On Wed, 24 Jul 2024 at 14:00, Joel Granados  wrote:
>
> This is my first time sending out a semantic patch, so get back to me if
> you have issues or prefer some other way of receiving it.

Looks fine to me.

Sometimes if it's just a pure scripting change, people send me the
script itself and just ask me to run it as a final thing before the
rc1 release or something like that.

But since in practice there's almost always some additional manual
cleanup, doing it this way with the script documented in the commit is
typically the right way to go.

This time it was details like whitespace alignment, sometimes it's
"the script did 95%, but there was another call site that also needed
updating", or just a documentation update to go in together with the
change or whatever.

Anyway, pulled and just going through my build tests now.

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [GIT PULL] keys: Miscellaneous fixes/changes

2023-03-21 Thread Linus Torvalds
On Tue, Mar 21, 2023 at 12:16 PM Jens Axboe  wrote:
>
> I haven't seen the patch yet as it hasn't been pushed,

Well, it went out a couple of minutes before your email, so it's out now.

> It may make sense to add some debug check for
> PF_KTHREAD having TIF_NOTIFY_RESUME set, or task_work pending for that
> matter, as that is generally not workable without doing something to
> handle it explicitly.

Yeah, I guess we could have some generic check for that. I'm not sure
where it would be. Scheduler?

   Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [GIT PULL] keys: Miscellaneous fixes/changes

2023-03-21 Thread Linus Torvalds
On Tue, Mar 21, 2023 at 9:43 AM David Howells  wrote:
>
>  (1) Fix request_key() so that it doesn't cache a looked up key on the
>  current thread if that thread is a kernel thread.  The cache is
>  cleared during notify_resume - but that doesn't happen in kernel
>  threads.  This is causing cifs DNS keys to be un-invalidateable.

I've pulled this, but I'd like people to look a bit more at this.

The issue with TIF_NOTIFY_RESUME is that it is only done on return to
user space.

And these days, PF_KTHREAD isn't the only case that never returns to
user space. PF_IO_WORKER has the exact same behaviour.

Now, to counteract this, as of this merge window (and marked for
stable) IO threads do a fake "return to user mode" handling in
io_run_task_work(), and so I think we're all good, but I'd like people
to at least think about this.

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-29 Thread Linus Torvalds
On Sun, Aug 28, 2022 at 9:49 PM John Hubbard  wrote:
>
> ...here. I count ~1000 calls to panic() in today's kernel, to a
> function in kernel/panic.c that shows no hint of being removed, nor
> even deprecated.

Heh. I guess we never finished the panic() removal.

It's been decades, I suspect we ended up deciding that the bootup
failures might as well continue to panic.

Anyway, please don't use it. It's one of those things that should
never ever trigger, and mainly for something like "oops, I ran out of
memory during boot" etc.

Oh, I'm sure it's crept into other places too, but that doesn't make it ok.

 Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-08-28 Thread Linus Torvalds
On Sun, Aug 28, 2022 at 6:56 PM Dave Young  wrote:
>
> > John mentioned PANIC_ON().
>
> I would vote for PANIC_ON(), it sounds like a good idea, because
> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
> this will cause a kernel panic and one will be more careful before
> using it.

People, NO.

We're trying to get rid of BUG_ON() because it kills the machine.

Not replace it with another bogus thing that kills a machine.

So no PANIC_ON(). We used to have "panic()" many many years ago, we
got rid of it. We're not re-introducing it.

People who want to panic on warnings can do so. WARN_ON() _becomes_
PANIC for those people. But those people are the "we have a million
machines, we want to just fail things on any sign of trouble, and we
have MIS people who can look at the logs".

And it's not like we need to get rid of _all_ BUG_ON() cases. If you
have a "this is major internal corruption, there's no way we can
continue", then BUG_ON() is appropriate. It will try to kill that
process and try to keep the machine running, and again, the kind of
people who don't care about one machine (because - again - they have
millions of them) can just turn that into a panic-and-reboot
situation.

But the kind of people for whom the machine they are on IS THEIR ONLY
MACHINE - whether it be a workstation, a laptop, or a cellphone -
there is absolutely zero situation where "let's just kill the machine"
is *EVER* approproate. Even a BUG_ON() will try to continue as well as
it can after killing the current thread, but it's going to be iffy,
because locking etc.

So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for
"oops, I really don't know what to do, and I physically *cannot*
continue" (and that is *not* "I'm too lazy to do error handling").

There is no room for PANIC. None. Ever.

The only thing there is are "I don't care about this machine because
I've got 999,999 other machines, so I'd rather take one machine
offline for analysis".

Understand? The "should I panic and reboot" is fundamentally not about
the code, and it's not a choice that the kernel code gets to make.
It's purely about the choice of the person maintaining the machine.

As a kernel developer, you do not EVER get to say "panic" or "kill the machine".

End of story.

 Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH printk 3/3] printk: remove dict ring

2020-09-17 Thread Linus Torvalds
On Thu, Sep 17, 2020 at 6:16 AM John Ogness  wrote:
>
> Since there is no code that will ever store anything into the dict
> ring, remove it. If any future dictionary properties are to be
> added, these should be added to the struct printk_info.

Ack. I'm very happy to see the dictionary stuff go. It was one of
those over-designed things in the printk rewrite years and years ago.

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread Linus Torvalds
On Fri, Aug 14, 2020 at 4:52 PM Joe Perches  wrote:
>
> On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> >
> > This is why I think any discussion that says "people should buffer
> > their lines themselves and we should get rid if pr_cont()" is
> > fundamentally broken.
> >
> > Don't go down that hole. I won't take it. It's wrong.
>
> I don't think it's wrong per se.

It's *absolutely* and 100% wrong.

Yes, any random *user* of pr_cont() can decide to buffer on it's own.

But when the discussion is about printk() - the implementation, not
the users - then it's complete and utter BS to talk about trying to
get rid of pr_cont().

See the difference?

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-08-14 Thread Linus Torvalds
On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
 wrote:
>
> I think what Linus said a long time ago was that the initial purpose of
> pr_cont was
>
> pr_info("Initialize feature foo...");
> if (init_feature_foo() == 0)
> pr_cont("ok\n");
> else
> pr_cont("not ok\n");
>
> And if init_feature_foo() crashes the kernel then the first printk()
> form panic() will flush the cont buffer.

Right.

This is why I think any discussion that says "people should buffer
their lines themselves and we should get rid if pr_cont()" is
fundamentally broken.

Don't go down that hole. I won't take it. It's wrong.

The fact is, pr_cont() goes back to the original kernel. No, it wasn't
pr_cont() back then, and no, there were no actual explicit markers for
"this is a continuation" at all, it was all just "the last printk
didn't have a newline, so we continue where we left off".

We've added pr_cont (and KERN_CONT) since then, and I realize that a
lot of people hate the complexity it introduces, but it's a
fundamental complexity that you have to live with.

If you can't live with pr_cont(), you shouldn't be working on
printk(), and find some other area of the kernel that you _can_ live
with.

It really is that simple.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/4] printk: store instead of processing cont parts

2020-07-21 Thread Linus Torvalds
On Tue, Jul 21, 2020 at 7:42 AM Sergey Senozhatsky
 wrote:
>
> OK, so basically, extending printk_caller_id() so that for IRQ/NMI
> we will have more info than just "0x8000 + raw_smp_processor_id()".

I think it's really preempt_count() that we want to have there.

That has the softirq/hardirq/NMI information in it.

So the "context" identifier of an incomplete write would be { task,
cpu, preempt_count() } of the writer. If a new KERN_CONT writer comes
in, it would have to match in that context to actually combine.

You can probably play "hide the bits" tricks to not *ac tually* use
three words for it. The task structure in particular tends to be very
aligned, you could hide bits in the low bits there. The CPU number
would fit in there, for example.

 Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/4] printk: store instead of processing cont parts

2020-07-20 Thread Linus Torvalds
On Sun, Jul 19, 2020 at 6:51 PM Sergey Senozhatsky
 wrote:
>
> Do I get it right, what you are saying is - when we process a PR_CONT
> message the cont buffer should already contain previous non-LOG_NEWLINE
> and non-PR_CONT message, otherwise it's a bug?

No.

I'm saying that the code that does PR_CONT should have done *some*
printing before, otherwise it's at the very least questionable.

IOW, you can't just randomly start printing with PR_CONT, without
having established _some_ context for it.

But that context could be a previous newline you created (the PR_CONT
will be a no-op). That's certainly useful for printing a header and
then after that printing possible other complex data that may or may
not have line breaks in it.

So your example looks fine. The context starts out with

pr_warn("which would create a new lock dependency:\n");

and after that you can use KERN_CONT / pr_cont() as much as you want,
since you've established a context for what you're printing.

And then it ends with 'pr_cont("\n")'.

So anything that was interrupted by this, and uses KERN_CONT /
pr_cont() will have no ambiguous issues. The code you pointed at both
started and ended a line.

That said, we have traditionally used not just "current process", but
also "last irq-level" as the context information, so I do think it
would be good to continue to do that.

At that point, "an interrupt printed something in the middle" isn't
even an issue any more, because it's clear that the context has
changed.

 Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/4] printk: store instead of processing cont parts

2020-07-19 Thread Linus Torvalds
On Sun, Jul 19, 2020 at 7:35 AM Sergey Senozhatsky
 wrote:
>
> Can we merge lines that we don't want to merge?
>
>pr_cont()  ->  IRQ -> pr_cont() -> NMI -> pr_cont()

That pr_cont in either IRQ or NMI context would be a bug.

You can't validly have a PR_CONT without the non-cont that precedes it.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-07-18 Thread Linus Torvalds
On Sat, Jul 18, 2020 at 7:43 AM John Ogness  wrote:
>
> I expect this is handled correctly since the reader is not given any
> parts until a full line is ready, but I will put more focus on testing
> this to make sure.

Yeah, the patches looked fine, but I only scanned them, and just
wanted to make sure.

Over the years, we've gotten printk wrong so many times that I get a
bit paranoid. Things can look fine on the screen, but then have odd
line breaks in the logs. Or vice versa. Or work fine on some machine,
but consistently show some race on another.

And some of the more complex features are hardly ever actually used -
I'm not sure the optional message context (aka dictionary) is ever
actually used.

Yes, all the "dev_printk()" helpers fill it in with the device
information (create_syslog_header()), and you _can_ use them if you
know about them (ie

journalctl -b _KERNEL_SUBSYSTEM=pci_bus

but I sometimes wonder how many people use all this complexity. And
how many people even know about it..)

So there are hidden things in there that can easily break *subtly* and
then take ages for people to notice, because while some are very
obvious indeed ("why is my module list message broken up into a
hundred lines?") others might be things people aren't even aware of.

Maybe a lot of system tools use those kernel dictionaries. Maybe it
would break immediately. I just sometimes wonder...

 Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

2020-07-17 Thread Linus Torvalds
On Fri, Jul 17, 2020 at 4:48 PM John Ogness  wrote:
>
> Using placeholders avoids tools such as systemd-journald from
> erroneously reporting missed messages. However, it also means that
> empty placeholder records are visible in systemd-journald logs and
> displayed in tools such as dmesg.

As long as the readers still reliably do the joining, this is fine.

HOWEVER.

Make sure you test the case of "fast concurrent readers". The last
time we did things like this, it was a disaster, because a concurrent
reader would see and return the _incomplete_ line, and the next entry
was still being generated on another CPU.

The reader would then decide to return that incomplete line, because
it had something.

And while in theory this could then be handled properly in user space,
in practice it wasn't. So you'd see a lot of logging tools that would
then report all those continuations as separate log events.

Which is the whole point of LOG_CONT - for that *not* to happen.

So this is just a heads-up that I will not pull something that breaks
LOG_CONT because it thinks "user space can handle it". No. User space
does not handle it, and we need to handle it for the user.

So as long as the kernel makes sure the joining does happen it at some
point, it's all fine. It obviously doesn't have to happen at printk()
time, just as long as incomplete records aren't exposed even to
concurrent readers.

A test-case with a short delay in between writes might be a good idea,
although the last time this broke it was very clear in peoples syslogs
and dmesg because it turns out there are lots of concurrent readers at
boot time and _somebody_ will hit the race.

 Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/purgatory: Add -fno-stack-protector

2020-06-16 Thread Linus Torvalds
On Tue, Jun 16, 2020 at 3:25 PM Arvind Sankar  wrote:
>
> The purgatory Makefile removes -fstack-protector options if they were
> configured in, but does not currently add -fno-stack-protector.

I took this directly, since the build failure seems so annoying if you
happen to have an affected compiler and config.

Maybe that's not very common, but ..

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 2/3] printk: add lockless buffer

2020-05-18 Thread Linus Torvalds
On Mon, May 18, 2020 at 6:04 AM John Ogness  wrote:
>
> The cmpxchg() needs to be moved out of the while condition so that a
> continue can be used as intended. Patch below.

This seems pointless and wrong (patch edited to remove the '-' lines
so that you see the end result):

> smp_mb(); /* LMM(data_push_tail:C) */
>
> +   if (atomic_long_try_cmpxchg_relaxed(_ring->tail_lpos,
> +   _lpos,
> +   next_lpos)) { /* LMM(data_push_tail:D) */
> +   break;
> +   }

Doing an "smp_mb()" followed by a "cmpxchg_relaxed" seems all kinds of
odd and pointless, and is very much non-optimal on x86 for example.,

Just remove the smp_mb(), and use the non-relaxed form of cmpxchg.
It's defined to be fully ordered if it succeeds (and if the cmpxchg
doesn't succeed, it's a no-op and the memory barrier shouldn't make
any difference).

Otherwise you'll do two memory ordering operations on x86 (and
probably some other architectures), since the cmpxchg is always
ordered on x86 and there exists no "relaxed" form of it.

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] x86/kexec: prefer _THIS_IP_ to current_text_addr

2018-08-20 Thread Linus Torvalds
On Mon, Aug 20, 2018 at 10:58 AM Nick Desaulniers
 wrote:
>
> + akpm, Linus
>
> Bumping for review.

Ugh. I am not personally a huge fan of this endless "fix up one at a time".

Just do a patch that removes current_text_addr() entirely and be done
with it, if that's what we want the end result to be.

Don't bother with these small "let's remove the remaining ones one by
one". Just get it over and done with.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [GIT PULL] remove in-kernel calls to syscalls

2018-04-02 Thread Linus Torvalds
On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowski
 wrote:
>
> This patchset removes all in-kernel calls to syscall functions in the
> kernel with the exception of arch/.

Ok, this finished off my arch updates for today, I'll probably move on
to driver pulls tomorrow.

Anyway, it's in my tree, will push out once my test build finishes.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 6:57 PM, Dave Young  wrote:
>
> Could you say more about how to check this? My .config disabled
> CONFIG_X86_MCE, should this be enabled?

By all means, try it.

Some pending machine check exception that we have *not* reacted to,
and that is pending around kexec boot could very possibly be relevant.

I forget your exact symptoms. Was it that the _first_ kexec works, but
the second one hangs? Or was that the other unrelated issue? Or have I
confused this with some other report entirely?

   Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 6:47 PM, Dave Young  wrote:
> Did several quick tests, probably need more tests, but till now the
> results are:
>
> void stop_this_cpu(void *dummy)
> {
> => add wbinvd here: kexec works
> local_irq_disable();
> => add wbinvd here: kexec works
> /*
>  * Remove this CPU:
>  */
> set_cpu_online(smp_processor_id(), false);
> => add wbinvd here: kexec does not work

Funky.

> So it seems that it will not work after cpu offined..

Well, that set_cpu_online() call really just clears a bit in our
'__cpu_online_mask' CPU mask. It doesn't really do anything to the
*hardware*.

But I do wonder if the wbinvd causes an SMI or something on your
system. I _think_ wbinvd causes some external pin to be wiggled just
to tell possible external cache hardware to flush too, and on a system
level that could be tied to some random thing.

And then if we get an SMI/NMI when we've marked the system offline,
maybe we do something odd.

Very odd. But maybe this makes somebody go "Duh, that's because of xyz.."

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 5:47 PM, Dave Young  wrote:
>
> It does not work with just once wbinvd(), and it only works with
> removing the wbinvd() for me.  Tom's new post works for me as well
> since my cpu is an Intel i5-4200U.

Intriguing.

It's not like the wbinvd really should be that much of a deal.

I think Tom's patch is fine and should be applied, but it does worry
me a bit that even a single wbinvd makes that much of a difference for
you. There is very little logical reason I can think of that a wbinvd
should make any difference what-so-ever on an i5-4200U.

I wonder if you have some system issues, and wbinvd just happens to
trigger them. But I think we do wbinvd before a suspend-to-RAM too
(it's "ACPI_FLUSH_CPU_CACHE()" in the ACPI code). And the dmr code
dioes "wbinvd_on_all_cpus()" which does a cross-call etc.

Would you mind experimenting a bit with that wbinvd?

In particular, what happens if you enable it (so it's not hidden by
the SME check), but you move it up to before interrupts are disabled?

I'm wondering if there is some issue with MCE generation and wbinvd
and whatever, and doing it when the CPU is down and interrupts are
disabled causes some system issue..

Does anybody have any other ideas?

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 11:42 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
>  [ .. ]Some of the errata
> around SME have been about machine check exceptions or something.

That should be "some of the errata around wbinvd". I have no idea if
there have been SME issues.

That said, the really bad old wbinvd bug (which hung the system iirc)
was for some old PPro CPU's. Googling around, the errata I find seem
either irrelevant (eip corruption in 16-bit mode) or fairly mild (odd
behavior wrt parity errors).

So I have this memory of wbinvd having problems, but that memory does
seem to be largely historical. But Dave Young clearly sees *something*
odd going on.

   Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: kexec reboot fails with extra wbinvd introduced for AME SME

2018-01-17 Thread Linus Torvalds
On Tue, Jan 16, 2018 at 11:22 PM, Dave Young  wrote:
>
> For the kexec reboot hang, if I remove the wbinvd in stop_this_cpu()
> then kexec works fine. like this:

Honestly, I think we should apply that patch regardless.

Using 'wbinvd' should not be some "just because of random reasons".
There are CPU's with errata on wbinvd, and the thing in general is
slow and nasty.

Doing the wbinvd in a loop sounds even stranger.

If we're only doing it because of some SME issue, why isn't it
dependent on SME? And why is it inside that loop at all?

Anyway, does it work for you if you just do the wbinvd() once, outside
the loop? Admittedly the loop shouldn't actually loop (hlt with
interrupts disabled), but who the hell knows.. Some of the errata
around SME have been about machine check exceptions or something.

See commit a68e5c94f7d3 ("x86, hotplug: Move WBINVD back outside the
play_dead loop") for another example where wbinvd was inside a loop
and apparently caused some odd issues.

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 1/2] x86/mm, kexec: Fix memory corruption with SME on successive kexecs

2017-07-27 Thread Linus Torvalds
On Thu, Jul 27, 2017 at 7:15 AM, Tom Lendacky  wrote:
>
> I can #ifdef the wbinvd based on whether AMD_MEM_ENCRYPT is configured
> or not so that the wbinvd is avoided if not configured.

I suspect an ifdef will be useless, since things like distro kernels
tend to enable everything.

So it should probably be disabled dynamically, and only done if the
AMD memory encryption thing has actually been active.

[ There have also been various actual errata with wbinvd, so there
tends to be a non-performance reason to try to avoid it unless
strictly required ]

   Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Removal of the kernel code/data/bss resources does break kexec/kdump

2016-04-19 Thread Linus Torvalds
On Tue, Apr 19, 2016 at 2:04 AM, Dave Young  wrote:
>
> It is not clear how to handle it, maybe we can assume nobody is using it as
> non-root, leave it as is or just add |CAP_SYS_BOOT for /proc/iomem?

Pretty much nobody uses fine-grained capabilities anyway - they are
one of those bad security things that generally add more complexity
than value(*) - so I wouldn't worry about it unless you actually find
something that cares.

 Linus

(*) The one exception tends to be certain network services that can
use CAP_NET_BIND_SERVICE like things to really lower their attack
surface. But certainly not one-time things like kexec.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Removal of the kernel code/data/bss resources does break kexec/kdump

2016-04-15 Thread Linus Torvalds
On Fri, Apr 15, 2016 at 8:46 AM, Emrah Demir  wrote:
>
> file_ns_capable bring some problems.

No it does not. file_ns_capable() is _required_ for security. We have
had several security issues with file IO doing "capable()", and it's
wrong and insecure.

> I used capable and now there is no problem as far as I tested.

You just screwed up the security, and with your change, a suid
application can be fooled into making the hidden data available to
non-secure users.

"capable()" is wrong. For file reading,  you *have* to use
file_ns_capable(). It really is that simple. You should not test the
capabilities of the process, you should be testing the capabilities of
the file descriptor, which comes from the *open-time* capabilities.

It sounds like you applied just the patch to kernel/resource.c,
without applying the infrastructure patch.

You also need commit 34dbbcdbf633 ("Make file credentials available to
the seqfile interfaces").

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Removal of the kernel code/data/bss resources does break kexec/kdump

2016-04-14 Thread Linus Torvalds
On Thu, Apr 14, 2016 at 1:27 PM, Emrah Demir <e...@abdsec.com> wrote:
> On 2016-04-14 13:40, Linus Torvalds wrote:
>>
>>
>> Actually, %pK is horrible in /proc and /sys files, and does the wrong
>> thing.
>
> I agree with that, but for now there is no way to make things right in /proc
> or /sys.

Well, there is now.

I've pushed out my attempt at fixing things properly. Please check
that kexec works - and if kexec ends up reading that file as non-root,
I don't know what to say/do.

Here's the three relevant cases:

   cat /proc/iomem
   sudo cat /proc/iomem
   sudo cat < /proc/iomem

and two of them will now show the resource ranges as just plain
zeroes. But yes, it needed extra infrastructure to be able to get this
right.

Note that while %pK is always wrong in /proc and /sys files, in this
case it would have been particularly wrong, since the values can be
64-bit even on a 32-bit architecture, so trying to show them as
pointers would have gotten not just the capability handling wrong, it
would have truncated a 64-bit value to 32 bits in that case.

   Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Removal of the kernel code/data/bss resources does break kexec/kdump

2016-04-14 Thread Linus Torvalds
On Thu, Apr 14, 2016 at 4:07 AM, Emrah Demir  wrote:
>
> Kees Cook proposed to write a %pK formatted patch. This would solve most of
> the problems.

Actually, %pK is horrible in /proc and /sys files, and does the wrong thing.

It uses the current creds for deciding what to do, which is exactly
the wrong thing (for all the usual reasons) for a file access from a
security standpoint. Sadly, almost every use of %pK gets this wrong.
Thankfully, it's much less of a problem for reads than for writes, but
it's still wrong.

A file access should use "file->f_cred", but the seq_file interface
sadly doesn't expose any way to do that.

I'll take a look, but it's non-trivial to get right. %pK turns out to
have been seriously mis-designed, and is basically almost always a
bug.

  Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: 2.6.34-rc4 : OOPS in unmap_vma

2010-04-14 Thread Linus Torvalds


On Wed, 14 Apr 2010, Borislav Petkov wrote:
 
 hmm, it doesn't look like it. Your code translates to something like
 
0:   b8 00 00 00 00  mov$0x0,%eax
5:   80 ff ffcmp$0xff,%bh
8:   ff 48 21decl   0x21(%rax)
b:   45 80 48 8b 45  rex.RB orb$0x45,-0x75(%r8)
   10:   80 48 ff c8 orb$0xc8,-0x1(%rax)

There's a large constant (0xff80) in there at the beginning, 
and the disassembly hasn't found the start of the next instruction very 
cleanly. The same is true at the end: another large constant is cut off in 
the middle. 

The byte just before the dumped instruction stream is almost certainly 
'48h', and the last byte of the last constant is 0xff, and the disassembly 
ends up being:

   0:   48 b8 00 00 00 00 80mov$0xff80,%rax
   7:   ff ff ff 
   a:   48 21 45 80 and%rax,-0x80(%rbp)
   e:   48 8b 45 80 mov-0x80(%rbp),%rax
  12:   48 ff c8dec%rax
  15:   48 3b 85 40 ff ff ffcmp-0xc0(%rbp),%rax
  1c:   48 8b 85 50 ff ff ffmov-0xb0(%rbp),%rax
  23:   48 0f 42 7d 80  cmovb  -0x80(%rbp),%rdi
  28:   48 89 7d 80 mov%rdi,-0x80(%rbp)
  2c:*  48 8b 38mov(%rax),%rdi -- trapping instruction
  2f:   48 85 fftest   %rdi,%rdi
  32:   0f 84 f5 04 00 00   je 0x52d
  38:   48 b8 fb 0f 00 00 00mov$0xcffb,%rax
  3f:   c0 ff ff 

But yes, you found the right spot (that 0xff80 constant is 
-549755813888 decimal):

 which I could correlate with what I get here (comments added):

Yup. Close enough. Btw, it's often good to look at both the *.s code _and_ 
the *.lst code. If you do make mm/memory.lst, you'll find those big 
constants easily, and then you'll see the code this way:

do {
next = pgd_addr_end(addr, end);
81b2aa45:   48 b8 00 00 00 00 80mov
$0x80,%rax
81b2aa4c:   00 00 00
81b2aa4f:   49 8d 04 04 lea
(%r12,%rax,1),%rax
81b2aa53:   48 89 45 a8 mov%rax,-0x58(%rbp)
81b2aa57:   48 b8 00 00 00 00 80mov
$0xff80,%rax
81b2aa5e:   ff ff ff
81b2aa61:   48 21 45 a8 and%rax,-0x58(%rbp)
81b2aa65:   48 8b 45 b8 mov-0x48(%rbp),%rax
81b2aa69:   48 8b 55 a8 mov-0x58(%rbp),%rdx
81b2aa6d:   48 ff c8dec%rax
81b2aa70:   48 ff cadec%rdx   
81b2aa73:   48 39 c2cmp%rax,%rdx
81b2aa76:   48 8b 45 b8 mov-0x48(%rbp),%rax
81b2aa7a:   48 8b 55 90 mov-0x70(%rbp),%rdx
81b2aa7e:   48 0f 42 45 a8  cmovb  -0x58(%rbp),%rax
81b2aa83:   48 89 45 a8 mov%rax,-0x58(%rbp)
81b2aa87:   48 8b 02mov(%rdx),%rax
void pud_clear_bad(pud_t *);
void pmd_clear_bad(pmd_t *);

static inline int pgd_none_or_clear_bad(pgd_t *pgd)
{
if (pgd_none(*pgd))
81b2aa8a:   48 85 c0test   %rax,%rax
81b2aa8d:   74 20   je 81b2aaaf 
unmap_vmas+0x228
return 1;
if (unlikely(pgd_bad(*pgd))) {
81b2aa8f:   48 ba fb 0f 00 00 00mov
$0xcffb,%rdx
81b2aa96:   c0 ff ff
81b2aa99:   48 21 c2and%rax,%rdx
81b2aa9c:   48 83 fa 63 cmp$0x63,%rdx
81b2aaa0:   0f 84 d9 04 00 00   je 81b2af7f 
unmap_vmas+0x6f8

although Parag's compiler has generated much better code (possibly due to 
config differences, possibly due to compiler versions)

 So you oops when dereferencing that pgd value in %rax (%rdx in my case),
 *pgd in pgd_none_or_clear_bad(pgd) which is called in the below fragment
 of unmap_page_range().
 
   pgd = pgd_offset(vma-vm_mm, addr);
   do {
   next = pgd_addr_end(addr, end);
   if (pgd_none_or_clear_bad(pgd)) {
   (*zap_work)--;
   continue;
   }
   next = zap_pud_range(tlb, vma, pgd, addr, next,
   zap_work, details);
   } while (pgd++, addr = next, (addr != end  *zap_work  0));

Correct.

 so it looks like it tries to find a page table rooted at that address
 but the pointer value of 2203 is bogus.

Yes, it does look like some strange page table corruption, doesn't look 
anon_vma 

Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()

2008-08-13 Thread Linus Torvalds


On Wed, 13 Aug 2008, Huang Ying wrote:

 - xchg(kexec_lock, 0);
 + locked = xchg(kexec_lock, 0);
 + BUG_ON(!locked);

Why do you want to do this at all? 

And why do you implement your locks with xchg() in the first place? That's 
total and utter crap.

Hint: we have _real_ locking primitives in the kernel.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()

2008-08-13 Thread Linus Torvalds


On Wed, 13 Aug 2008, Andrew Morton wrote:

 We don't need to create that local.  I queued this:

No, please don't. 

Just don't take this whole patch-series until it's cleaned up. There is 
absolutely no excuse for using xchg as a locking primitive. Nothing like 
this should be queued anywhere, it should be burned and the ashes should 
be scattered over the atlantic so that nobody will ever see them again.

F*ck me with a spoon, if you have to use xchg() to do a trylock, why the 
hell isn't the unlock sequence then

smp_mb();
var = 0;

instead? Not that that's really right either, but at least it avoids the 
_ridiculous_ crap. The real solution is probably to use a spinlock and 
trylock/unlock.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()

2008-08-13 Thread Linus Torvalds


On Wed, 13 Aug 2008, Andrew Morton wrote:
 - * in interrupt context :)
 + * Return true if we acquired the lock
   */
 -static int kexec_lock;
 +static inline bool kexec_trylock(void)
 +{
 + return !test_and_set_bit(0, kexec_bitlock);

Nope. That needs to be an unsigned long.

But more importantl, why not just make it a lock in the first place?

static DEFINE_SPINLOCK(kexec_lock);

#define kexec_trylock() spin_trylock(kexec_lock)
#define kexec_unlock() spin_unlock(kexec_lock)

and then you get it all right and clear and obvious.

Yeah, and I didn't check whether there is anything that is supposed to be 
able to sleep. If there is, use a mutex instead of a spinlock, of course.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()

2008-08-13 Thread Linus Torvalds


On Wed, 13 Aug 2008, Andrew Morton wrote:
 
 #2:

I thought you said there were things that want to sleep in the region?

If so, spinlocks will work as long as you don't have CONFIG_PREEMPT or 
lock validation (there's no way to deadlock thanks to all the lock getters 
using the trylock variant), but will blow up because a successful 
trylock will obviously also disable preemption and/or trigger all the lock 
detection.

So if there are potential sleepers, you'd need the mutex instead.

Linus

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec