Re: [GIT PULL] sysctl constification changes for v6.11-rc1
On Wed, 24 Jul 2024 at 14:00, Joel Granados wrote: > > This is my first time sending out a semantic patch, so get back to me if > you have issues or prefer some other way of receiving it. Looks fine to me. Sometimes if it's just a pure scripting change, people send me the script itself and just ask me to run it as a final thing before the rc1 release or something like that. But since in practice there's almost always some additional manual cleanup, doing it this way with the script documented in the commit is typically the right way to go. This time it was details like whitespace alignment, sometimes it's "the script did 95%, but there was another call site that also needed updating", or just a documentation update to go in together with the change or whatever. Anyway, pulled and just going through my build tests now. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [GIT PULL] keys: Miscellaneous fixes/changes
On Tue, Mar 21, 2023 at 12:16 PM Jens Axboe wrote: > > I haven't seen the patch yet as it hasn't been pushed, Well, it went out a couple of minutes before your email, so it's out now. > It may make sense to add some debug check for > PF_KTHREAD having TIF_NOTIFY_RESUME set, or task_work pending for that > matter, as that is generally not workable without doing something to > handle it explicitly. Yeah, I guess we could have some generic check for that. I'm not sure where it would be. Scheduler? Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [GIT PULL] keys: Miscellaneous fixes/changes
On Tue, Mar 21, 2023 at 9:43 AM David Howells wrote: > > (1) Fix request_key() so that it doesn't cache a looked up key on the > current thread if that thread is a kernel thread. The cache is > cleared during notify_resume - but that doesn't happen in kernel > threads. This is causing cifs DNS keys to be un-invalidateable. I've pulled this, but I'd like people to look a bit more at this. The issue with TIF_NOTIFY_RESUME is that it is only done on return to user space. And these days, PF_KTHREAD isn't the only case that never returns to user space. PF_IO_WORKER has the exact same behaviour. Now, to counteract this, as of this merge window (and marked for stable) IO threads do a fake "return to user mode" handling in io_run_task_work(), and so I think we're all good, but I'd like people to at least think about this. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
On Sun, Aug 28, 2022 at 9:49 PM John Hubbard wrote: > > ...here. I count ~1000 calls to panic() in today's kernel, to a > function in kernel/panic.c that shows no hint of being removed, nor > even deprecated. Heh. I guess we never finished the panic() removal. It's been decades, I suspect we ended up deciding that the bootup failures might as well continue to panic. Anyway, please don't use it. It's one of those things that should never ever trigger, and mainly for something like "oops, I ran out of memory during boot" etc. Oh, I'm sure it's crept into other places too, but that doesn't make it ok. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
On Sun, Aug 28, 2022 at 6:56 PM Dave Young wrote: > > > John mentioned PANIC_ON(). > > I would vote for PANIC_ON(), it sounds like a good idea, because > BUG_ON() is not obvious and, PANIC_ON() can alert the code author that > this will cause a kernel panic and one will be more careful before > using it. People, NO. We're trying to get rid of BUG_ON() because it kills the machine. Not replace it with another bogus thing that kills a machine. So no PANIC_ON(). We used to have "panic()" many many years ago, we got rid of it. We're not re-introducing it. People who want to panic on warnings can do so. WARN_ON() _becomes_ PANIC for those people. But those people are the "we have a million machines, we want to just fail things on any sign of trouble, and we have MIS people who can look at the logs". And it's not like we need to get rid of _all_ BUG_ON() cases. If you have a "this is major internal corruption, there's no way we can continue", then BUG_ON() is appropriate. It will try to kill that process and try to keep the machine running, and again, the kind of people who don't care about one machine (because - again - they have millions of them) can just turn that into a panic-and-reboot situation. But the kind of people for whom the machine they are on IS THEIR ONLY MACHINE - whether it be a workstation, a laptop, or a cellphone - there is absolutely zero situation where "let's just kill the machine" is *EVER* approproate. Even a BUG_ON() will try to continue as well as it can after killing the current thread, but it's going to be iffy, because locking etc. So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for "oops, I really don't know what to do, and I physically *cannot* continue" (and that is *not* "I'm too lazy to do error handling"). There is no room for PANIC. None. Ever. The only thing there is are "I don't care about this machine because I've got 999,999 other machines, so I'd rather take one machine offline for analysis". Understand? The "should I panic and reboot" is fundamentally not about the code, and it's not a choice that the kernel code gets to make. It's purely about the choice of the person maintaining the machine. As a kernel developer, you do not EVER get to say "panic" or "kill the machine". End of story. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH printk 3/3] printk: remove dict ring
On Thu, Sep 17, 2020 at 6:16 AM John Ogness wrote: > > Since there is no code that will ever store anything into the dict > ring, remove it. If any future dictionary properties are to be > added, these should be added to the struct printk_info. Ack. I'm very happy to see the dictionary stuff go. It was one of those over-designed things in the printk rewrite years and years ago. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On Fri, Aug 14, 2020 at 4:52 PM Joe Perches wrote: > > On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote: > > > > This is why I think any discussion that says "people should buffer > > their lines themselves and we should get rid if pr_cont()" is > > fundamentally broken. > > > > Don't go down that hole. I won't take it. It's wrong. > > I don't think it's wrong per se. It's *absolutely* and 100% wrong. Yes, any random *user* of pr_cont() can decide to buffer on it's own. But when the discussion is about printk() - the implementation, not the users - then it's complete and utter BS to talk about trying to get rid of pr_cont(). See the difference? Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky wrote: > > I think what Linus said a long time ago was that the initial purpose of > pr_cont was > > pr_info("Initialize feature foo..."); > if (init_feature_foo() == 0) > pr_cont("ok\n"); > else > pr_cont("not ok\n"); > > And if init_feature_foo() crashes the kernel then the first printk() > form panic() will flush the cont buffer. Right. This is why I think any discussion that says "people should buffer their lines themselves and we should get rid if pr_cont()" is fundamentally broken. Don't go down that hole. I won't take it. It's wrong. The fact is, pr_cont() goes back to the original kernel. No, it wasn't pr_cont() back then, and no, there were no actual explicit markers for "this is a continuation" at all, it was all just "the last printk didn't have a newline, so we continue where we left off". We've added pr_cont (and KERN_CONT) since then, and I realize that a lot of people hate the complexity it introduces, but it's a fundamental complexity that you have to live with. If you can't live with pr_cont(), you shouldn't be working on printk(), and find some other area of the kernel that you _can_ live with. It really is that simple. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/4] printk: store instead of processing cont parts
On Tue, Jul 21, 2020 at 7:42 AM Sergey Senozhatsky wrote: > > OK, so basically, extending printk_caller_id() so that for IRQ/NMI > we will have more info than just "0x8000 + raw_smp_processor_id()". I think it's really preempt_count() that we want to have there. That has the softirq/hardirq/NMI information in it. So the "context" identifier of an incomplete write would be { task, cpu, preempt_count() } of the writer. If a new KERN_CONT writer comes in, it would have to match in that context to actually combine. You can probably play "hide the bits" tricks to not *ac tually* use three words for it. The task structure in particular tends to be very aligned, you could hide bits in the low bits there. The CPU number would fit in there, for example. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/4] printk: store instead of processing cont parts
On Sun, Jul 19, 2020 at 6:51 PM Sergey Senozhatsky wrote: > > Do I get it right, what you are saying is - when we process a PR_CONT > message the cont buffer should already contain previous non-LOG_NEWLINE > and non-PR_CONT message, otherwise it's a bug? No. I'm saying that the code that does PR_CONT should have done *some* printing before, otherwise it's at the very least questionable. IOW, you can't just randomly start printing with PR_CONT, without having established _some_ context for it. But that context could be a previous newline you created (the PR_CONT will be a no-op). That's certainly useful for printing a header and then after that printing possible other complex data that may or may not have line breaks in it. So your example looks fine. The context starts out with pr_warn("which would create a new lock dependency:\n"); and after that you can use KERN_CONT / pr_cont() as much as you want, since you've established a context for what you're printing. And then it ends with 'pr_cont("\n")'. So anything that was interrupted by this, and uses KERN_CONT / pr_cont() will have no ambiguous issues. The code you pointed at both started and ended a line. That said, we have traditionally used not just "current process", but also "last irq-level" as the context information, so I do think it would be good to continue to do that. At that point, "an interrupt printed something in the middle" isn't even an issue any more, because it's clear that the context has changed. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/4] printk: store instead of processing cont parts
On Sun, Jul 19, 2020 at 7:35 AM Sergey Senozhatsky wrote: > > Can we merge lines that we don't want to merge? > >pr_cont() -> IRQ -> pr_cont() -> NMI -> pr_cont() That pr_cont in either IRQ or NMI context would be a bug. You can't validly have a PR_CONT without the non-cont that precedes it. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On Sat, Jul 18, 2020 at 7:43 AM John Ogness wrote: > > I expect this is handled correctly since the reader is not given any > parts until a full line is ready, but I will put more focus on testing > this to make sure. Yeah, the patches looked fine, but I only scanned them, and just wanted to make sure. Over the years, we've gotten printk wrong so many times that I get a bit paranoid. Things can look fine on the screen, but then have odd line breaks in the logs. Or vice versa. Or work fine on some machine, but consistently show some race on another. And some of the more complex features are hardly ever actually used - I'm not sure the optional message context (aka dictionary) is ever actually used. Yes, all the "dev_printk()" helpers fill it in with the device information (create_syslog_header()), and you _can_ use them if you know about them (ie journalctl -b _KERNEL_SUBSYSTEM=pci_bus but I sometimes wonder how many people use all this complexity. And how many people even know about it..) So there are hidden things in there that can easily break *subtly* and then take ages for people to notice, because while some are very obvious indeed ("why is my module list message broken up into a hundred lines?") others might be things people aren't even aware of. Maybe a lot of system tools use those kernel dictionaries. Maybe it would break immediately. I just sometimes wonder... Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On Fri, Jul 17, 2020 at 4:48 PM John Ogness wrote: > > Using placeholders avoids tools such as systemd-journald from > erroneously reporting missed messages. However, it also means that > empty placeholder records are visible in systemd-journald logs and > displayed in tools such as dmesg. As long as the readers still reliably do the joining, this is fine. HOWEVER. Make sure you test the case of "fast concurrent readers". The last time we did things like this, it was a disaster, because a concurrent reader would see and return the _incomplete_ line, and the next entry was still being generated on another CPU. The reader would then decide to return that incomplete line, because it had something. And while in theory this could then be handled properly in user space, in practice it wasn't. So you'd see a lot of logging tools that would then report all those continuations as separate log events. Which is the whole point of LOG_CONT - for that *not* to happen. So this is just a heads-up that I will not pull something that breaks LOG_CONT because it thinks "user space can handle it". No. User space does not handle it, and we need to handle it for the user. So as long as the kernel makes sure the joining does happen it at some point, it's all fine. It obviously doesn't have to happen at printk() time, just as long as incomplete records aren't exposed even to concurrent readers. A test-case with a short delay in between writes might be a good idea, although the last time this broke it was very clear in peoples syslogs and dmesg because it turns out there are lots of concurrent readers at boot time and _somebody_ will hit the race. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/purgatory: Add -fno-stack-protector
On Tue, Jun 16, 2020 at 3:25 PM Arvind Sankar wrote: > > The purgatory Makefile removes -fstack-protector options if they were > configured in, but does not currently add -fno-stack-protector. I took this directly, since the build failure seems so annoying if you happen to have an affected compiler and config. Maybe that's not very common, but .. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 2/3] printk: add lockless buffer
On Mon, May 18, 2020 at 6:04 AM John Ogness wrote: > > The cmpxchg() needs to be moved out of the while condition so that a > continue can be used as intended. Patch below. This seems pointless and wrong (patch edited to remove the '-' lines so that you see the end result): > smp_mb(); /* LMM(data_push_tail:C) */ > > + if (atomic_long_try_cmpxchg_relaxed(_ring->tail_lpos, > + _lpos, > + next_lpos)) { /* LMM(data_push_tail:D) */ > + break; > + } Doing an "smp_mb()" followed by a "cmpxchg_relaxed" seems all kinds of odd and pointless, and is very much non-optimal on x86 for example., Just remove the smp_mb(), and use the non-relaxed form of cmpxchg. It's defined to be fully ordered if it succeeds (and if the cmpxchg doesn't succeed, it's a no-op and the memory barrier shouldn't make any difference). Otherwise you'll do two memory ordering operations on x86 (and probably some other architectures), since the cmpxchg is always ordered on x86 and there exists no "relaxed" form of it. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] x86/kexec: prefer _THIS_IP_ to current_text_addr
On Mon, Aug 20, 2018 at 10:58 AM Nick Desaulniers wrote: > > + akpm, Linus > > Bumping for review. Ugh. I am not personally a huge fan of this endless "fix up one at a time". Just do a patch that removes current_text_addr() entirely and be done with it, if that's what we want the end result to be. Don't bother with these small "let's remove the remaining ones one by one". Just get it over and done with. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [GIT PULL] remove in-kernel calls to syscalls
On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowskiwrote: > > This patchset removes all in-kernel calls to syscall functions in the > kernel with the exception of arch/. Ok, this finished off my arch updates for today, I'll probably move on to driver pulls tomorrow. Anyway, it's in my tree, will push out once my test build finishes. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot fails with extra wbinvd introduced for AME SME
On Wed, Jan 17, 2018 at 6:57 PM, Dave Youngwrote: > > Could you say more about how to check this? My .config disabled > CONFIG_X86_MCE, should this be enabled? By all means, try it. Some pending machine check exception that we have *not* reacted to, and that is pending around kexec boot could very possibly be relevant. I forget your exact symptoms. Was it that the _first_ kexec works, but the second one hangs? Or was that the other unrelated issue? Or have I confused this with some other report entirely? Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot fails with extra wbinvd introduced for AME SME
On Wed, Jan 17, 2018 at 6:47 PM, Dave Youngwrote: > Did several quick tests, probably need more tests, but till now the > results are: > > void stop_this_cpu(void *dummy) > { > => add wbinvd here: kexec works > local_irq_disable(); > => add wbinvd here: kexec works > /* > * Remove this CPU: > */ > set_cpu_online(smp_processor_id(), false); > => add wbinvd here: kexec does not work Funky. > So it seems that it will not work after cpu offined.. Well, that set_cpu_online() call really just clears a bit in our '__cpu_online_mask' CPU mask. It doesn't really do anything to the *hardware*. But I do wonder if the wbinvd causes an SMI or something on your system. I _think_ wbinvd causes some external pin to be wiggled just to tell possible external cache hardware to flush too, and on a system level that could be tied to some random thing. And then if we get an SMI/NMI when we've marked the system offline, maybe we do something odd. Very odd. But maybe this makes somebody go "Duh, that's because of xyz.." Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot fails with extra wbinvd introduced for AME SME
On Wed, Jan 17, 2018 at 5:47 PM, Dave Youngwrote: > > It does not work with just once wbinvd(), and it only works with > removing the wbinvd() for me. Tom's new post works for me as well > since my cpu is an Intel i5-4200U. Intriguing. It's not like the wbinvd really should be that much of a deal. I think Tom's patch is fine and should be applied, but it does worry me a bit that even a single wbinvd makes that much of a difference for you. There is very little logical reason I can think of that a wbinvd should make any difference what-so-ever on an i5-4200U. I wonder if you have some system issues, and wbinvd just happens to trigger them. But I think we do wbinvd before a suspend-to-RAM too (it's "ACPI_FLUSH_CPU_CACHE()" in the ACPI code). And the dmr code dioes "wbinvd_on_all_cpus()" which does a cross-call etc. Would you mind experimenting a bit with that wbinvd? In particular, what happens if you enable it (so it's not hidden by the SME check), but you move it up to before interrupts are disabled? I'm wondering if there is some issue with MCE generation and wbinvd and whatever, and doing it when the CPU is down and interrupts are disabled causes some system issue.. Does anybody have any other ideas? Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot fails with extra wbinvd introduced for AME SME
On Wed, Jan 17, 2018 at 11:42 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > [ .. ]Some of the errata > around SME have been about machine check exceptions or something. That should be "some of the errata around wbinvd". I have no idea if there have been SME issues. That said, the really bad old wbinvd bug (which hung the system iirc) was for some old PPro CPU's. Googling around, the errata I find seem either irrelevant (eip corruption in 16-bit mode) or fairly mild (odd behavior wrt parity errors). So I have this memory of wbinvd having problems, but that memory does seem to be largely historical. But Dave Young clearly sees *something* odd going on. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: kexec reboot fails with extra wbinvd introduced for AME SME
On Tue, Jan 16, 2018 at 11:22 PM, Dave Youngwrote: > > For the kexec reboot hang, if I remove the wbinvd in stop_this_cpu() > then kexec works fine. like this: Honestly, I think we should apply that patch regardless. Using 'wbinvd' should not be some "just because of random reasons". There are CPU's with errata on wbinvd, and the thing in general is slow and nasty. Doing the wbinvd in a loop sounds even stranger. If we're only doing it because of some SME issue, why isn't it dependent on SME? And why is it inside that loop at all? Anyway, does it work for you if you just do the wbinvd() once, outside the loop? Admittedly the loop shouldn't actually loop (hlt with interrupts disabled), but who the hell knows.. Some of the errata around SME have been about machine check exceptions or something. See commit a68e5c94f7d3 ("x86, hotplug: Move WBINVD back outside the play_dead loop") for another example where wbinvd was inside a loop and apparently caused some odd issues. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v1 1/2] x86/mm, kexec: Fix memory corruption with SME on successive kexecs
On Thu, Jul 27, 2017 at 7:15 AM, Tom Lendackywrote: > > I can #ifdef the wbinvd based on whether AMD_MEM_ENCRYPT is configured > or not so that the wbinvd is avoided if not configured. I suspect an ifdef will be useless, since things like distro kernels tend to enable everything. So it should probably be disabled dynamically, and only done if the AMD memory encryption thing has actually been active. [ There have also been various actual errata with wbinvd, so there tends to be a non-performance reason to try to avoid it unless strictly required ] Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Removal of the kernel code/data/bss resources does break kexec/kdump
On Tue, Apr 19, 2016 at 2:04 AM, Dave Youngwrote: > > It is not clear how to handle it, maybe we can assume nobody is using it as > non-root, leave it as is or just add |CAP_SYS_BOOT for /proc/iomem? Pretty much nobody uses fine-grained capabilities anyway - they are one of those bad security things that generally add more complexity than value(*) - so I wouldn't worry about it unless you actually find something that cares. Linus (*) The one exception tends to be certain network services that can use CAP_NET_BIND_SERVICE like things to really lower their attack surface. But certainly not one-time things like kexec. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Removal of the kernel code/data/bss resources does break kexec/kdump
On Fri, Apr 15, 2016 at 8:46 AM, Emrah Demirwrote: > > file_ns_capable bring some problems. No it does not. file_ns_capable() is _required_ for security. We have had several security issues with file IO doing "capable()", and it's wrong and insecure. > I used capable and now there is no problem as far as I tested. You just screwed up the security, and with your change, a suid application can be fooled into making the hidden data available to non-secure users. "capable()" is wrong. For file reading, you *have* to use file_ns_capable(). It really is that simple. You should not test the capabilities of the process, you should be testing the capabilities of the file descriptor, which comes from the *open-time* capabilities. It sounds like you applied just the patch to kernel/resource.c, without applying the infrastructure patch. You also need commit 34dbbcdbf633 ("Make file credentials available to the seqfile interfaces"). Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Removal of the kernel code/data/bss resources does break kexec/kdump
On Thu, Apr 14, 2016 at 1:27 PM, Emrah Demir <e...@abdsec.com> wrote: > On 2016-04-14 13:40, Linus Torvalds wrote: >> >> >> Actually, %pK is horrible in /proc and /sys files, and does the wrong >> thing. > > I agree with that, but for now there is no way to make things right in /proc > or /sys. Well, there is now. I've pushed out my attempt at fixing things properly. Please check that kexec works - and if kexec ends up reading that file as non-root, I don't know what to say/do. Here's the three relevant cases: cat /proc/iomem sudo cat /proc/iomem sudo cat < /proc/iomem and two of them will now show the resource ranges as just plain zeroes. But yes, it needed extra infrastructure to be able to get this right. Note that while %pK is always wrong in /proc and /sys files, in this case it would have been particularly wrong, since the values can be 64-bit even on a 32-bit architecture, so trying to show them as pointers would have gotten not just the capability handling wrong, it would have truncated a 64-bit value to 32 bits in that case. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Removal of the kernel code/data/bss resources does break kexec/kdump
On Thu, Apr 14, 2016 at 4:07 AM, Emrah Demirwrote: > > Kees Cook proposed to write a %pK formatted patch. This would solve most of > the problems. Actually, %pK is horrible in /proc and /sys files, and does the wrong thing. It uses the current creds for deciding what to do, which is exactly the wrong thing (for all the usual reasons) for a file access from a security standpoint. Sadly, almost every use of %pK gets this wrong. Thankfully, it's much less of a problem for reads than for writes, but it's still wrong. A file access should use "file->f_cred", but the seq_file interface sadly doesn't expose any way to do that. I'll take a look, but it's non-trivial to get right. %pK turns out to have been seriously mis-designed, and is basically almost always a bug. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: 2.6.34-rc4 : OOPS in unmap_vma
On Wed, 14 Apr 2010, Borislav Petkov wrote: hmm, it doesn't look like it. Your code translates to something like 0: b8 00 00 00 00 mov$0x0,%eax 5: 80 ff ffcmp$0xff,%bh 8: ff 48 21decl 0x21(%rax) b: 45 80 48 8b 45 rex.RB orb$0x45,-0x75(%r8) 10: 80 48 ff c8 orb$0xc8,-0x1(%rax) There's a large constant (0xff80) in there at the beginning, and the disassembly hasn't found the start of the next instruction very cleanly. The same is true at the end: another large constant is cut off in the middle. The byte just before the dumped instruction stream is almost certainly '48h', and the last byte of the last constant is 0xff, and the disassembly ends up being: 0: 48 b8 00 00 00 00 80mov$0xff80,%rax 7: ff ff ff a: 48 21 45 80 and%rax,-0x80(%rbp) e: 48 8b 45 80 mov-0x80(%rbp),%rax 12: 48 ff c8dec%rax 15: 48 3b 85 40 ff ff ffcmp-0xc0(%rbp),%rax 1c: 48 8b 85 50 ff ff ffmov-0xb0(%rbp),%rax 23: 48 0f 42 7d 80 cmovb -0x80(%rbp),%rdi 28: 48 89 7d 80 mov%rdi,-0x80(%rbp) 2c:* 48 8b 38mov(%rax),%rdi -- trapping instruction 2f: 48 85 fftest %rdi,%rdi 32: 0f 84 f5 04 00 00 je 0x52d 38: 48 b8 fb 0f 00 00 00mov$0xcffb,%rax 3f: c0 ff ff But yes, you found the right spot (that 0xff80 constant is -549755813888 decimal): which I could correlate with what I get here (comments added): Yup. Close enough. Btw, it's often good to look at both the *.s code _and_ the *.lst code. If you do make mm/memory.lst, you'll find those big constants easily, and then you'll see the code this way: do { next = pgd_addr_end(addr, end); 81b2aa45: 48 b8 00 00 00 00 80mov $0x80,%rax 81b2aa4c: 00 00 00 81b2aa4f: 49 8d 04 04 lea (%r12,%rax,1),%rax 81b2aa53: 48 89 45 a8 mov%rax,-0x58(%rbp) 81b2aa57: 48 b8 00 00 00 00 80mov $0xff80,%rax 81b2aa5e: ff ff ff 81b2aa61: 48 21 45 a8 and%rax,-0x58(%rbp) 81b2aa65: 48 8b 45 b8 mov-0x48(%rbp),%rax 81b2aa69: 48 8b 55 a8 mov-0x58(%rbp),%rdx 81b2aa6d: 48 ff c8dec%rax 81b2aa70: 48 ff cadec%rdx 81b2aa73: 48 39 c2cmp%rax,%rdx 81b2aa76: 48 8b 45 b8 mov-0x48(%rbp),%rax 81b2aa7a: 48 8b 55 90 mov-0x70(%rbp),%rdx 81b2aa7e: 48 0f 42 45 a8 cmovb -0x58(%rbp),%rax 81b2aa83: 48 89 45 a8 mov%rax,-0x58(%rbp) 81b2aa87: 48 8b 02mov(%rdx),%rax void pud_clear_bad(pud_t *); void pmd_clear_bad(pmd_t *); static inline int pgd_none_or_clear_bad(pgd_t *pgd) { if (pgd_none(*pgd)) 81b2aa8a: 48 85 c0test %rax,%rax 81b2aa8d: 74 20 je 81b2aaaf unmap_vmas+0x228 return 1; if (unlikely(pgd_bad(*pgd))) { 81b2aa8f: 48 ba fb 0f 00 00 00mov $0xcffb,%rdx 81b2aa96: c0 ff ff 81b2aa99: 48 21 c2and%rax,%rdx 81b2aa9c: 48 83 fa 63 cmp$0x63,%rdx 81b2aaa0: 0f 84 d9 04 00 00 je 81b2af7f unmap_vmas+0x6f8 although Parag's compiler has generated much better code (possibly due to config differences, possibly due to compiler versions) So you oops when dereferencing that pgd value in %rax (%rdx in my case), *pgd in pgd_none_or_clear_bad(pgd) which is called in the below fragment of unmap_page_range(). pgd = pgd_offset(vma-vm_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) { (*zap_work)--; continue; } next = zap_pud_range(tlb, vma, pgd, addr, next, zap_work, details); } while (pgd++, addr = next, (addr != end *zap_work 0)); Correct. so it looks like it tries to find a page table rooted at that address but the pointer value of 2203 is bogus. Yes, it does look like some strange page table corruption, doesn't look anon_vma
Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()
On Wed, 13 Aug 2008, Huang Ying wrote: - xchg(kexec_lock, 0); + locked = xchg(kexec_lock, 0); + BUG_ON(!locked); Why do you want to do this at all? And why do you implement your locks with xchg() in the first place? That's total and utter crap. Hint: we have _real_ locking primitives in the kernel. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()
On Wed, 13 Aug 2008, Andrew Morton wrote: We don't need to create that local. I queued this: No, please don't. Just don't take this whole patch-series until it's cleaned up. There is absolutely no excuse for using xchg as a locking primitive. Nothing like this should be queued anywhere, it should be burned and the ashes should be scattered over the atlantic so that nobody will ever see them again. F*ck me with a spoon, if you have to use xchg() to do a trylock, why the hell isn't the unlock sequence then smp_mb(); var = 0; instead? Not that that's really right either, but at least it avoids the _ridiculous_ crap. The real solution is probably to use a spinlock and trylock/unlock. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()
On Wed, 13 Aug 2008, Andrew Morton wrote: - * in interrupt context :) + * Return true if we acquired the lock */ -static int kexec_lock; +static inline bool kexec_trylock(void) +{ + return !test_and_set_bit(0, kexec_bitlock); Nope. That needs to be an unsigned long. But more importantl, why not just make it a lock in the first place? static DEFINE_SPINLOCK(kexec_lock); #define kexec_trylock() spin_trylock(kexec_lock) #define kexec_unlock() spin_unlock(kexec_lock) and then you get it all right and clear and obvious. Yeah, and I didn't check whether there is anything that is supposed to be able to sleep. If there is, use a mutex instead of a spinlock, of course. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] kexec jump: fix compiling warning on xchg(kexec_lock, 0) in kernel_kexec()
On Wed, 13 Aug 2008, Andrew Morton wrote: #2: I thought you said there were things that want to sleep in the region? If so, spinlocks will work as long as you don't have CONFIG_PREEMPT or lock validation (there's no way to deadlock thanks to all the lock getters using the trylock variant), but will blow up because a successful trylock will obviously also disable preemption and/or trigger all the lock detection. So if there are potential sleepers, you'd need the mutex instead. Linus ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec