Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

2024-02-09 Thread Philipp Rudo
Hi Alex,

On Fri, 2 Feb 2024 13:58:52 +0100
Alexander Graf  wrote:

> Hi Philipp,
> 
> On 29.01.24 17:34, Philipp Rudo wrote:
> > Hi Alex,
> >
> > adding linux-integrity as there are some synergies with IMA_KEXEC (in case 
> > we
> > get KHO to work).
> >
> > Fist of all I believe that having a generic framework to pass information 
> > from
> > one kernel to the other across kexec would be a good thing. But I'm afraid 
> > that  
> 
> 
> Thanks, I'm happy to hear that you agree with the basic motivation :). 
> There are fundamentally 2 problems with passing data:
> 
>    * Passing structured data in a cross-architecture way
>    * Passing memory
> 
> KHO tackles both. It proposes a common FDT based format that allows us 
> to pass per-subsystem properties. That way, a subsystem does not need to 
> know whether it's running on ARM, x86, RISC-V or s390x. It just gains 
> awareness for KHO and can pass data.
> 
> On top of that, it proposes a standardized "mem" property (and some 
> magic around that) which allows subsystems to pass memory.
> 
> 
> > you are ignoring some fundamental problems which makes it extremely hard, if
> > not impossible, to reliably transfer the kernel's state from one kernel to 
> > the
> > other.
> >
> > One thing I don't understand is how reusing the scratch area is working. 
> > Sure
> > you pass it's location via the dt/boot_params but I don't see any code that
> > makes it a CMA region. So IIUC the scratch area won't be available for the 
> > 2nd
> > kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded 
> > and
> > runs inside that area and I don't believe the CMA design ever considered 
> > that
> > the kernel image could be included in a CMA area.  
> 
> 
> That one took me a lot to figure out sensibly (with recursion all the 
> way down) while building KHO :). I hope I detailed it sensibly in the 
> documentation - please let me know how to improve it in case it's 
> unclear: https://lore.kernel.org/lkml/20240117144704.602-8-g...@amazon.com/
> 
> Let me explain inline using different words as well what happens:
> 
> The first (and only the first) kernel that boots allocates a CMA region 
> as "scratch region". It loads the new kernel into that region. It passes 
> that region as "scratch region" to the next kernel. The next kernel now 
> takes it and marks every page block that the scratch region spans as CMA:
> 
> https://lore.kernel.org/lkml/20240117144704.602-3-g...@amazon.com/
> 
> The CMA hint doesn't mean we create an actual CMA region. It mostly 
> means that the kernel won't use this memory for any kernel allocations. 
> Kernel allocations up to this point are allocations we don't need to 
> pass on with KHO again. Kernel allocations past that point may be 
> allocations that we want to pass, so we just never place them into the 
> "scratch region" again.
> 
> And because we now already have a scratch region from the previous 
> kernel, we keep reusing that forever with any new KHO kexec.

Thanks for the explanation. I've missed the memblock_mark_scratch in
kho_populate. The code makes much more sense now :-)

Having that said, for complex series like this one I like to do the
review on a branch in my local git as that to avoid problems like that
(or at least make them less likely). But your patches didn't apply. Can
you tell me what your base is or make your git branch available. That
would be very helpful to me. Thanks!

> > Staying at reusing the scratch area. One thing that is broken for sure is 
> > that
> > you reuse the scratch area without ever checking the kho_scratch parameter 
> > of
> > the 2nd kernel's command line. Remember, with kexec you are dealing with two
> > different kernels with two different command lines. Meaning you can only 
> > reuse
> > the scratch area if the requested size in the 2nd kernel is identical to the
> > one of the 1st kernel. In all other cases you need to adjust the scratch 
> > area's
> > size or reserve a new one.  
> 
> 
> Hm. So you're saying a user may want to change the size of the scratch 
> area with a KHO kexec. That's insanely risky because you (as rightfully 
> pointed out below) may have significant fragmentation at that point. And 
> we will only know when we're in the new kernel so it's too late to 
> abort. IMHO it's better to just declare the scratch region as immutable 
> during KHO to avoid that pitfall.

Yes, a user can set any command line with kexec. My expectation as a
user is that the kernel respects whatever I set on the command line and
doesn't think it knows better and

Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

2024-01-29 Thread Philipp Rudo
Hi Alex,

adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
get KHO to work).

Fist of all I believe that having a generic framework to pass information from
one kernel to the other across kexec would be a good thing. But I'm afraid that
you are ignoring some fundamental problems which makes it extremely hard, if
not impossible, to reliably transfer the kernel's state from one kernel to the
other.

One thing I don't understand is how reusing the scratch area is working. Sure
you pass it's location via the dt/boot_params but I don't see any code that
makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
runs inside that area and I don't believe the CMA design ever considered that
the kernel image could be included in a CMA area.

Staying at reusing the scratch area. One thing that is broken for sure is that
you reuse the scratch area without ever checking the kho_scratch parameter of
the 2nd kernel's command line. Remember, with kexec you are dealing with two
different kernels with two different command lines. Meaning you can only reuse
the scratch area if the requested size in the 2nd kernel is identical to the
one of the 1st kernel. In all other cases you need to adjust the scratch area's
size or reserve a new one.

This directly leads to the next problem. In kho_reserve_previous_mem you are
reusing the different memory regions wherever the 1st kernel allocated them.
But that also means you are handing over the 1st kernel's memory
fragmentation to the 2nd kernel and you do that extremely early during boot.
Which means that users who need to allocate large continuous physical memory,
like the scratch area or the crashkernel memory, will have increasing chance to
not find a suitable area. Which IMHO is unacceptable.

Finally, and that's the big elephant in the room, is your lax handling of the
unstable kernel internal ABI. Remember, you are dealing with two different
kernels, that also means two different source levels and two different configs.
So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
doesn't means that they have the same struct buffer_page. But that's what your
code implicitly assumes. For KHO ever to make it upstream you need to make sure
that both kernels are "speaking the same language".

Personally I see two possible solutions:

1) You introduce a stable intermediate format for every subsystem similar to
what IMA_KEXEC does. This should work for simple types like struct buffer_page
but for complex ones like struct vfio_device that's basically impossible.

2) You also hand over the ABI version for every given type (basically just a
hash over all fields including all the dependencies). So the 2nd kernel can
verify that the data handed over is in a format it can handle and if not bail
out with a descriptive error message rather than reading garbage. Plus side is
that once such a system is in place you can reuse it to automatically resolve
all dependencies so you no longer need to manually store the buffer_page and
its buffer_data_page separately.
Down side is that traversing the debuginfo (including the ones from modules) is
not a simple task and I expect that such a system will be way more complex than
the rest of KHO. In addition there are some cases that the versioning won't be
able to capture. For example if a type contains a "void *"-field. Then although
the definition of the type is identical in both kernels the field can be cast
to different types when used. An other problem will be function pointers which
you first need to resolve in the 1st kernel and then map to the identical
function in the 2nd kernel. This will become particularly "fun" when the
function is part of a module that isn't loaded at the time when you try to
recreate the kernel's state.

So to summarize, while it would be nice to have a generic framework like KHO to
pass data from one kernel to the other via kexec there are good reasons why it
doesn't exist, yet.

Thanks
Philipp


On Wed, 17 Jan 2024 14:46:47 +
Alexander Graf  wrote:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
> 
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
> 
>   

Re: [RFC 0/3] kdump: Check mem_map of CMA area in kdump

2023-12-19 Thread Philipp Rudo
Hi Pingfan,

On Mon, 18 Dec 2023 13:23:22 +0800
Pingfan Liu  wrote:

> From: Pingfan Liu 
> 
> 
> First of all, this series is only for proof of concept. It only passes 
> compilation.
> 
> For years, CMA is proposed to be used as crashkernel reserved memory.
> But DIO prevent us to follow it since DMA may be in-flight and ruin the
> kdump kernel.
> 
> This series exports the crash kernel's CMA area information through
> device-tree, and kdump kernel skips any page, which refcnt!=mapcount and
> has a potential DMA activity.
> 
> The exported information include:
>   u64 kdump_cma_pfn;
>   u64 kdump_cma_pg_cnt;
>   u64 kdump_cma_pg_paddr;
> 
> And they should be filled with Jiri's series "[PATCH 0/4] kdump:
> crashkernel reservation from CMA"
> 
> After the conjunction of two series, the CMA used for kdump has only the
> following risk, where the following conditions:
>   -1.a wrong code forges _refcnt and mapcount to the same value
>   -2.the page is also used by DIO
> 
> 
> Is it acceptable, or any rescue e.g. CRC on page?
> 
> Please share your thoughts.

I don't think your approach will work as intended. The problem is that
we are dealing with two separate kernels and there is no guarantee that
both kernels are identical. So you cannot rely on the definition of
struct page in the crash kernel to be identical to the one in the
panicked kernel. Meaning check_poison_page from the crash kernel cannot
simply operate on the struct pages from the panicked kernel.

To get this approach to work I see three possible "fixes"

1) enforce in kexec that only the currently running kernel can be
loaded as crash kernel.

2) pass all required "debuginfo" to the crash kernel so it can parse
the required data reliably from the dump. This also requires to have
all the mm helper functions to be reimplemented to work in
check_poison_page.

3) the required information is passed via a new data structure which
is designed in a way that it can easily be passed in between different
kernels. But this would require the mm subsystem to maintain the page
states in the CMA in two separate data structures.

Personally I don't think that any of the three "fixes" is desirable.

Thanks
Philipp

> Thanks,
> 
> Pingfan
> 
> 
> Cc: Jiri Bohac 
> Cc: Michal Hocko 
> Cc: Philipp Rudo 
> Cc: Baoquan He 
> Cc: Dave Young 
> To: kexec@lists.infradead.org
> ---
> Pingfan Liu (3):
>   crash_dump: Parse the CMA's mem_map in kdump
>   of: kexec: Set up properties for reusing CMA in kdump
>   of: fdt: Parse properties of reusing CMA in kdump
> 
>  drivers/of/fdt.c  | 43 +++
>  drivers/of/kexec.c| 14 
>  include/linux/kexec.h |  5 +++
>  init/main.c   |  4 +++
>  kernel/crash_dump.c   | 80 +++
>  5 files changed, 146 insertions(+)
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-07 Thread Philipp Rudo
On Wed, 6 Dec 2023 16:19:51 +0100
Michal Hocko  wrote:

> On Wed 06-12-23 14:49:51, Michal Hocko wrote:
> > On Wed 06-12-23 12:08:05, Philipp Rudo wrote:  
> [...]
> > > If I understand Documentation/core-api/pin_user_pages.rst correctly you
> > > missed case 1 Direct IO. In that case "short term" DMA is allowed for
> > > pages without FOLL_LONGTERM. Meaning that there is a way you can
> > > corrupt the CMA and with that the crash kernel after the production
> > > kernel has panicked.  
> > 
> > Could you expand on this? How exactly direct IO request survives across
> > into the kdump kernel? I do understand the RMDA case because the IO is
> > async and out of control of the receiving end.  
> 
> OK, I guess I get what you mean. You are worried that there is 
> DIO request
> program DMA controller to read into CMA memory
> 
> boot into crash kernel backed by CMA
> DMA transfer is done.
> 
> DIO doesn't migrate the pinned memory because it is considered a very
> quick operation which doesn't block the movability for too long. That is
> why I have considered that a non-problem. RDMA on the other might pin
> memory for transfer for much longer but that case is handled by
> migrating the memory away.

Right that is the scenario we need to prevent.

> Now I agree that there is a chance of the corruption from DIO. The
> question I am not entirely clear about right now is how big of a real
> problem that is. DMA transfers should be a very swift operation. Would
> it help to wait for a grace period before jumping into the kdump kernel?

Please see my other mail.

> > Also if direct IO is a problem how come this is not a problem for kexec
> > in general. The new kernel usually shares all the memory with the 1st
> > kernel.  
> 
> This is also more clear now. Pure kexec is shutting down all the devices
> which should terminate the in-flight DMA transfers.

Right, it _should_ terminate all transfers. But here we are back at the
shitty device drivers that don't have a working shutdown method. That's
why we have already seen the problem you describe above with kexec. And
please believe me that debugging such a scenario is an absolute pain.
Especially when it's a proprietary, out-of-tree driver that caused the
mess.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-07 Thread Philipp Rudo
On Thu, 7 Dec 2023 09:55:20 +0100
Michal Hocko  wrote:

> On Thu 07-12-23 12:23:13, Baoquan He wrote:
> [...]
> > We can't guarantee how swift the DMA transfer could be in the cma, case,
> > it will be a venture.  
> 
> We can't guarantee this of course but AFAIK the DMA shouldn't take
> minutes, right? While not perfect, waiting for some time before jumping
> into the crash kernel should be acceptable from user POV and it should
> work around most of those potential lingering programmed DMA transfers.

I don't think that simply waiting is acceptable. For one it doesn't
guarantee that there is no corruption (please also see below) but only
reduces its probability. Furthermore, how long would you wait?
Thing is that users don't only want to reduce the memory usage but also
the downtime of kdump. In the end I'm afraid that "simply waiting" will
make things unnecessarily more complex without really solving any issue.

> So I guess what we would like to hear from you as kdump maintainers is
> this. Is it absolutely imperative that these issue must be proven
> impossible or is a best effort approach something worth investing time
> into? Because if the requirement is an absolute guarantee then I simply
> do not see any feasible way to achieve the goal of reusable memory.
> 
> Let me reiterate that the existing reservation mechanism is showing its
> limits for production systems and I strongly believe this is something
> that needs addressing because crash dumps are very often the only tool
> to investigate complex issues.

Because having a crash dump is so important I want a prove that no
legal operation can corrupt the crashkernel memory. The easiest way to
achieve this is by simply keeping the two memory regions fully
separated like it is today. In theory it should also be possible to
prevent any kind of page pinning in the shared crashkernel memory. But
I don't know which side effect this has for mm. Such an idea needs to
be discussed on the mm mailing list first.

Finally, let me question whether the whole approach actually solves
anything. For me the difficulty in determining the correct crashkernel
memory is only a symptom. The real problem is that most developers
don't expect their code to run outside their typical environment.
Especially not in an memory constraint environment like kdump. But that
problem won't be solved by throwing more memory at it as this
additional memory will eventually run out as well. In the end we are
back at the point where we are today but with more memory.

Finally finally, one tip. Next time a customer complaints about how
much memory the crashkernel "wastes" ask them how much one day of down
time for one machine costs them and how much memory they could buy for
that money. After that calculation I'm pretty sure that an additional
100M of crashkernel memory becomes much more tempting.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-06 Thread Philipp Rudo
On Fri, 1 Dec 2023 17:59:02 +0100
Michal Hocko  wrote:

> On Fri 01-12-23 16:51:13, Philipp Rudo wrote:
> > On Fri, 1 Dec 2023 12:55:52 +0100
> > Michal Hocko  wrote:
> >   
> > > On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
> > > [...]  
> > > > And yes, those are all what-if concerns but unfortunately that is all
> > > > we have right now.
> > > 
> > > Should theoretical concerns without an actual evidence (e.g. multiple
> > > drivers known to be broken) become a roadblock for this otherwise useful
> > > feature?   
> > 
> > Those concerns aren't just theoretical. They are experiences we have
> > from a related feature that suffers exactly the same problem regularly
> > which wouldn't exist if everybody would simply work "properly".  
> 
> What is the related feature?

kexec

> > And yes, even purely theoretical concerns can become a roadblock for a
> > feature when the cost of those theoretical concerns exceed the benefit
> > of the feature. The thing is that bugs will be reported against kexec.
> > So _we_ need to figure out which of the shitty drivers caused the
> > problem. That puts additional burden on _us_. What we are trying to
> > evaluate at the moment is if the benefit outweighs the extra burden
> > with the information we have at the moment.  
> 
> I do understand your concerns! But I am pretty sure you do realize that
> it is really hard to argue theoreticals.  Let me restate what I consider
> facts. Hopefully we can agree on these points
>   - the CMA region can be used by user space memory which is a
> great advantage because the memory is not wasted and our
> experience has shown that users do care about this a lot. We
> _know_ that pressure on making those reservations smaller
> results in a less reliable crashdump and more resources spent
> on tuning and testing (especially after major upgrades).  A
> larger reservation which is not completely wasted for the
> normal runtime is addressing that concern.
>   - There is no other known mechanism to achieve the reusability
> of the crash kernel memory to stop the wastage without much
> more intrusive code/api impact (e.g. a separate zone or
> dedicated interface to prevent any hazardous usage like RDMA).
>   - implementation wise the patch has a very small footprint. It
> is using an existing infrastructure (CMA) and it adds a
> minimal hooking into crashkernel configuration.
>   - The only identified risk so far is RDMA acting on this memory
> without using proper pinning interface. If it helps to have a
> statement from RDMA maintainers/developers then we can pull
> them in for a further discussion of course.
>   - The feature requires an explicit opt-in so this doesn't bring
> any new risk to existing crash kernel users until they decide
> to use it. AFAIU there is no way to tell that the crash kernel
> memory used to be CMA based in the primary kernel. If you
> believe that having that information available for
> debugability would help then I believe this shouldn't be hard
> to add.  I think it would even make sense to mark this feature
> experimental to make it clear to users that this needs some
> time before it can be marked production ready.
> 
> I hope I haven't really missed anything important. The final

If I understand Documentation/core-api/pin_user_pages.rst correctly you
missed case 1 Direct IO. In that case "short term" DMA is allowed for
pages without FOLL_LONGTERM. Meaning that there is a way you can
corrupt the CMA and with that the crash kernel after the production
kernel has panicked.

With that I don't see a chance this series can be included unless
someone can explain me that that the documentation is wrong or I
understood it wrong.

Having that said
NAcked-by: Philipp Rudo 

> cost/benefit judgment is up to you, maintainers, of course but I would
> like to remind that we are dealing with a _real_ problem that many
> production systems are struggling with and that we don't really have any
> other solution available.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-01 Thread Philipp Rudo
On Fri, 1 Dec 2023 12:55:52 +0100
Michal Hocko  wrote:

> On Fri 01-12-23 12:33:53, Philipp Rudo wrote:
> [...]
> > And yes, those are all what-if concerns but unfortunately that is all
> > we have right now.  
> 
> Should theoretical concerns without an actual evidence (e.g. multiple
> drivers known to be broken) become a roadblock for this otherwise useful
> feature? 

Those concerns aren't just theoretical. They are experiences we have
from a related feature that suffers exactly the same problem regularly
which wouldn't exist if everybody would simply work "properly".

And yes, even purely theoretical concerns can become a roadblock for a
feature when the cost of those theoretical concerns exceed the benefit
of the feature. The thing is that bugs will be reported against kexec.
So _we_ need to figure out which of the shitty drivers caused the
problem. That puts additional burden on _us_. What we are trying to
evaluate at the moment is if the benefit outweighs the extra burden
with the information we have at the moment.

> > Only alternative would be to run extended tests in
> > the field. Which means this user facing change needs to be included.
> > Which also means that we are stuck with it as once a user facing change
> > is in it's extremely hard to get rid of it again...  
> 
> I am not really sure I follow you here. Are you suggesting once
> crashkernel=cma is added it would become a user api and therefore
> impossible to get rid of?

Yes, sort of. I wouldn't rank a command line parameter as user api. So
we still can get rid of it. But there will be long discussions I'd like
to avoid if possible.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-01 Thread Philipp Rudo
Hi Jiri,

I'd really love to see something like this to work. Although I also
share the concerns about shitty device drivers corrupting the CMA.
Please see my other mail for that.

Anyway, one more comment below.

On Fri, 24 Nov 2023 20:54:36 +0100
Jiri Bohac  wrote:

[...]
 
> Now, specifying
>   crashkernel=100M craskhernel=1G,cma
> on the command line will make a standard crashkernel reservation
> of 100M, where kexec will load the kernel and initrd.
> 
> An additional 1G will be reserved from CMA, still usable by the
> production system. The crash kernel will have 1.1G memory
> available. The 100M can be reliably predicted based on the size
> of the kernel and initrd.

I doubt that the fixed part can be predicted "reliably". For sure it
will be more reliable than today but IMHO we will still be stuck with
some guessing. Otherwise it would mean that you already know during
boot which initrd the user space will be loading later on. Which IMHO is
impossible as the initrd can always be rebuild with a larger size.
Furthermore, I'd be careful when you are dealing with compressed kernel
images. As I'm not sure how the different decompressor phases would
handle scenarios where the (fixed) crashkernel memory is large enough
to hold the compressed kernel (+initrd) but not the decompressed one.

One more thing, I'm not sure I like that you need to reserve two
separate memory regions. Personally I would prefer it if you could
reserve one large region for the crashkernel but allow parts of it to
be reused via CMA. Otherwise I'm afraid there will be people who only
have one ,cma entry on the command line and cannot figure out why they
cannot load the crash kernel.

Thanks
Philipp
 
> When no crashkernel=size,cma is specified, everything works as
> before.
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-12-01 Thread Philipp Rudo
Hi Michal,

On Thu, 30 Nov 2023 14:41:12 +0100
Michal Hocko  wrote:

> On Thu 30-11-23 20:31:44, Baoquan He wrote:
> [...]
> > > > which doesn't use the proper pinning API (which would migrate away from
> > > > the CMA) then what is the worst case? We will get crash kernel corrupted
> > > > potentially and fail to take a proper kernel crash, right? Is this
> > > > worrisome? Yes. Is it a real roadblock? I do not think so. The problem  
> > 
> > We may fail to take a proper kernel crash, why isn't it a roadblock?  
> 
> It would be if the threat was practical. So far I only see very
> theoretical what-if concerns. And I do not mean to downplay those at
> all. As already explained proper CMA users shouldn't ever leak out any
> writes across kernel reboot.

You are right, "proper" CMA users don't do that. But "proper" drivers
also provide a working shutdown() method. Experience shows that there
are enough shitty drivers out there without working shutdown(). So I
think it is naive to assume you are only dealing with "proper" CMA
users.

For me the question is, what is less painful? Hunting down shitty
(potentially out of tree) drivers that cause a memory corruption? Or ...

> > We
> > have stable way with a little more memory, why would we take risk to
> > take another way, just for saving memory? Usually only high end server
> > needs the big memory for crashkernel and the big end server usually have
> > huge system ram. The big memory will be a very small percentage relative
> > to huge system RAM.  
> 
> Jiri will likely talk more specific about that but our experience tells
> that proper crashkernel memory scaling has turned out a real
> maintainability problem because existing setups tend to break with major
> kernel version upgrades or non trivial changes.

... frequently test if the crashkernel memory is still appropriate? The
big advantage of the latter I see is that an OOM situation has very
easy to detect and debug. A memory corruption isn't. Especially when
it was triggered by an other kernel.

And yes, those are all what-if concerns but unfortunately that is all
we have right now. Only alternative would be to run extended tests in
the field. Which means this user facing change needs to be included.
Which also means that we are stuck with it as once a user facing change
is in it's extremely hard to get rid of it again...

Thanks
Philipp

> > > > seems theoretical to me and it is not CMA usage at fault here IMHO. It
> > > > is the said theoretical driver that needs fixing anyway.  
> > 
> > Now, what we want to make clear is if it's a theoretical possibility, or
> > very likely happen. We have met several on-flight DMA stomping into
> > kexec kernel's initrd in the past two years because device driver didn't
> > provide shutdown() methor properly. For kdump, once it happen, the pain
> > is we don't know how to debug. For kexec reboot, customer allows to
> > login their system to reproduce and figure out the stomping. For kdump,
> > the system corruption rarely happend, and the stomping could rarely
> > happen too.  
> 
> yes, this is understood.
>  
> > The code change looks simple and the benefit is very attractive. I
> > surely like it if finally people confirm there's no risk. As I said, we
> > can't afford to take the risk if it possibly happen. But I don't object
> > if other people would rather take risk, we can let it land in kernel.  
> 
> I think it is fair to be cautious and I wouldn't impose the new method
> as a default. Only time can tell how safe this really is. It is hard to
> protect agains theoretical issues though. Bugs should be fixed.
> I believe this option would allow to configure kdump much easier and
> less fragile.
>  
> > My personal opinion, thanks for sharing your thought.  
> 
> Thanks for sharing.
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] Sign the Image which is zboot's payload

2023-09-25 Thread Philipp Rudo
Hi Dave,

On Fri, 22 Sep 2023 13:41:22 +0800
Dave Young  wrote:

> Hi Jan,
> 
> On Fri, 22 Sept 2023 at 13:19, Jan Hendrik Farr  wrote:
> >
> > Hi Pingfan!
> >
> > On 21 21:37:01, Pingfan Liu wrote:  
> > > From: Pingfan Liu 
> > >  
> >  
> > > For security boot, the vmlinuz.efi will be signed so UEFI boot loader
> > > can check against it. But at present, there is no signature for kexec
> > > file load, this series makes a signature on the zboot's payload -- Image
> > > before it is compressed. As a result, the kexec-tools parses and
> > > decompresses the Image.gz to get the Image, which has signature and can
> > > be checked against during kexec file load  
> >
> > I missed some of the earlier discussion about this zboot kexec support.
> > So just let me know if I'm missing something here. You were exploring
> > these two options in getting this supported:
> >
> > 1. Making kexec_file_load do all the work.
> >
> > This option makes the signature verification easy. kexec_file_load
> > checks the signature on the pe file and then extracts it and does the
> > kexec.
> >
> > This is similar to how I'm approaching UKI support in [1].
> >
> > 2. Extract in userspace and pass decompressed kernel to kexec_file_load
> >
> > This options requires the decompressed kernel to have a valid signature on
> > it. That's why this patch adds the ability to add that signature to the
> > kernel contained inside the zboot image.
> >
> > This option would not make sense for UKI support as it would not
> > validate the signature with respect to the initrd and cmdline that it
> > contains.  
> 
> Another possibility for the cmdline could be using the bootconfig
> facility which was
> introduced for boot time tracking:
> Documentation/admin-guide/bootconfig.rst
> 
> So the initrd+cmdline can be signed as well.  Has this been discussed
> before for UKI?

Not that I know of. But I'm not sure if the bootconfig the way it works
today does the trick.

For one the bootconfig is simply glued to the end of the initrd. But
that makes it part of the UKI as well. So there is no added gain.

Plus, adding the cmdline to the UKI was done on purpose to prevent any
unauthorized editing. That basically means that any change to the
cmdline needs to be signed as well. But I don't see any signature
verification while processing the bootconfig. 

Finally the bootconfig is setup too late in the boot process,
in particular after setup_arch which reserves the crashkernel
memory and needs to parse the kernel command line for that. An even more
extreme example is the decompressor phase on s390. There the command
line is parsed as well. And that is code that runs before start_kernel.

All in all I don't believe that using the bootconfig adds much benefit
for the UKI.

Thanks
Philipp



Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-25 Thread Philipp Rudo
Hi Jan,

On Thu, 21 Sep 2023 00:02:25 +0200
Jan Hendrik Farr  wrote:

[...]

> > Maybe we should do a BoF at LPC to discuss this further?  
> 
> I definetly won't be at LPC, is it possible to join virtually?

Yes, LPC will be hybrid again this year. Virtual access costs $50
although you can apply for an 50% discount when you are a
non-professional.

https://lpc.events/event/17/page/212-attend

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-18 Thread Philipp Rudo
Hi Jan,

On Thu, 14 Sep 2023 23:04:32 +0200
"Jan Hendrik Farr"  wrote:

> On Thu, Sep 14, 2023, at 8:51 PM, Philipp Rudo wrote:
> > [...]
> >
> > In this context I hope it is also clear to you that when more and more
> > people rely on the spec you need a more formal process when including
> > changes. Especially when the change might break the implementation of
> > others. So no more making the .cmdline optional and allowing it to be
> > overwritten all on the same day.
> >
> > Having that said, what does "local override" exactly mean? Does that
> > mean a distro can allow a user to freely choose the cmdline without
> > checking any signatures?  
> 
> The behavior of systemd-stub is to allow the bootloader (or whatever
> called sd-stub) supplied cmdline when there is no .cmdline section in
> the UKI. That's how I understand "local override" here. For WIP v3 of
> this patch the behavior is to use the cmdline supplied by userspace to
> the kexec_file_load syscall if no .cmdline section is in the UKI.
> 
> empty .cmdline section -> empty cmdline always passed to kernel
> .cmdline section -> use bootloader/user supplied cmdline (which would
> be empty by default)
> 
> This setup does not make sense for a locked down / secure system though.
> 
> Maybe the word "override" is not ideal. There is nothing actually being
> overridden as there is no cmdline in the UKI in the first place.
> 
> sd-stub also allows the bootloader supplied cmdline if not using secure
> boot. So maybe the kernel could allow user supplied cmdline if not in
> lockdown mode for kexec maybe? If not in lockdown mode somebody can just
> kexec an unsigned kernel + unsigned cmdline using the kexec_load syscall
> anyways. For this case the word "override" makes sense.
> 
> The logic for all of this in sd-stub is in [1].
> 
> > I.e. does that mean we can get rid of this
> >  https://github.com/systemd/systemd/issues/24539  
> 
> This is a different usecase IMO.

Yeah, I expected that. The whole question was meant to be rhetorical.
The point I wanted to make was that when a spec uses terms like "local
override" it needs to explain what it means.

Thanks
Philipp

> >> Hence, seeing the spec as set in stone and as inherently low quality
> >> is the wrong way to see it I am sure. Instead, the goal here is to
> >> adjust the spec to make it work really nicely for *both* systemd and
> >> the kernel.  
> >
> > Sorry, I never wanted to intend that the spec inherently low quality.
> > Just that it doesn't meat my expectations, yet. But that is fine. The
> > spec isn't even a year old and there's only a single implementation,
> > yet. So it's more documentation rather than a spec.  
> 
> Let's make it happen.
> 
> 
> [1] 
> https://github.com/systemd/systemd/blob/5898cef22a35ceefa068d5f46929eced2baab0ed/src/boot/efi/stub.c#L140
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-14 Thread Philipp Rudo
Hi Jan,

On Wed, 13 Sep 2023 16:42:33 +0200
"Jan Hendrik Farr"  wrote:

> On Wed, Sep 13, 2023, at 4:00 PM, Philipp Rudo wrote:

[...]

> In [5] Luca writes:
> > [...] we fully intend for the UKI format to be an open and stable
> > specification, that anybody can support and rely on.  
> But that is unfortunately not where the format is at this point.
> 
> What is annoying though is where this leaves a user that actually
> wants this feature. They can carry a patch or they might have to wait
> a long time.
> 
> Can you indicate what it would take for the kernel community to consider
> this spec as stable enough?

I don't think there is a good answer to that question. In fact I
believe if you ask 10 people from the community you will get 20+
different answers.

My guess is that either (1) the spec is moved to some official standard
committee where people spend decades to polish it before it makes it
into the kernel or (2) there's a big flamewar on LKML until Linus had
enough and passes his judgment on it. So definitely (2) ;-)

Thanks
Philipp

> 
> 
> > In the end the only benefit this series brings is to extend the
> > signature checking on the whole UKI except of just the kernel image.
> > Everything else can also be done in user space. Compared to the
> > problems described above this is a very small gain for me.  
> 
> Correct. That is the benefit of pulling the UKI apart in the
> kernel. However having to sign the kernel inside the UKI defeats
> the whole point.
> 
> 
> [1] https://uapi-group.org/specifications/specs/unified_kernel_image/
> [2] https://github.com/uapi-group/specifications/pull/72
> [3] https://github.com/uapi-group/specifications/pull/73
> [4] https://github.com/uapi-group/specifications/issues/74
> [5] https://github.com/systemd/systemd/issues/28538
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-14 Thread Philipp Rudo
Hi Lennart,

On Thu, 14 Sep 2023 11:32:20 +0200
Lennart Poettering  wrote:

> On Mi, 13.09.23 16:00, Philipp Rudo (pr...@redhat.com) wrote:
> 
> > For example there are two definitions for the UKI which contradict each 
> > other.
> > The dedicated one [1] you have cited earlier and the one in the BLS for 
> > type #2
> > entries [2]. In [1] the .linux and .initrd sections are mandatory and the
> > .osrel and .cmdline sections are optional while in [2] it is the other way
> > round. Which definition should the kernel follow?
> >
> > Furthermore, I absolutely don't understand how the spec should be read. All
> > the spec does is defining some file formats. There is no word about which
> > component in the boot chain is supposed to handle them and what exactly this
> > component is supposed to do with it. But that is crucial if we want to add 
> > UKI
> > support for kexec as the kexec systemcall will replace the stub. So we need 
> > to
> > know what tasks the stub is supposed to perform. Currently this is only some
> > implementation detail of the systemd-stub [3] that can change any moment 
> > and I
> > strongly oppose to base any uapi on it.
> >
> > In the end the only benefit this series brings is to extend the signature
> > checking on the whole UKI except of just the kernel image. Everything else 
> > can
> > also be done in user space. Compared to the problems described above this 
> > is a
> > very small gain for me.
> >
> > Until the spec got fixed I don't see a chance to add UKI support for kexec. 
> >  
> 
> So that spec is initially just a generalization of what
> systemd-stub/systemd-boot/ukify does. The descrepancies between the
> cited specs mostly come from the that generalization. If you want to
> enumerate kernels and order them the ".osrel" stuff for example is
> necessary, hence the boot loader spec really wants it. If you don't
> care about the boot loader spec though and just want to register the
> kernel UKI PE directly in BootXXX efi vars for example, then there's
> no need to include .osrel. That all said we should certainly make the
> two specs align better, and clarify the situation. Suggestions/patches
> more than welcome.

Thanks for the explanation. I know that writing a spec isn't easy and
no matter how hard you try it will never be "perfect". That's why I
think it's important to have a proper Introduction at the beginning of
the spec that gives a context to the reader on the problem the spec is
trying to solve, the different use cases/environments considered
while writing the spec, the different components involved and how they
interact with each other, which limitations there are, etc.. The BLS
sort of has it (if you also consider the "Additional discussion") but
for the UKI the introduction basically boils down to "it's a file that
contains stuff", which isn't very helpful. Having such a context not
only makes it easier to understand where such contradictions come from
but also help to prevent problems where people from different
backgrounds interpret the spec differently because they look at it from
their context.

An other thing I don't understand is, why the Extension Images (I
assume describe the "Companion Files" in the systmd-stub man pages) are
a separate spec. With the initrd and cmdline being part of the UKI and
thus fixed you take away a lot of flexibility users have today. These
extensions bring back part of the flexibility which IMHO is needed by
general purpose distros as for them a simple one-size-fits-all solution
doesn't work. That's why for me both belong together and thus should be
described in the same spec.

The extensions are also a good example why you need to properly define
the different components and their responsibility. In a secureboot
environment these extensions need to be signed and verified during
boot. But wich component is responsible to check the signature? Is it
the stub? The kernel? or even the initrd? If you don't define that in
the spec you will eventually end up in situations where nobody checks
the signature because everybody is sure it's "someone else's job".

> Ultimately, I think a spec written as description with a single
> implementation in mind (i.e. systemd) is a generally a bad spec. Hence
> if kexec in the Linux kernel wants to add support for it, that'd be
> great but I'd see that as an opportunity to adjust the spec to the
> needs of the Linux kernel in this area, so that it reflects well more
> than just one backend implementation.

Fully agree. I must admit my first mail sounds pretty negative. But I
don't oppose the UKI. All I wanted to say that at the moment the spec
doesn't work for the kernel. But it can (and should) be fixed.

I

Re: [PATCH v2 0/2] x86/kexec: UKI Support

2023-09-13 Thread Philipp Rudo
Hi Jan,

All in all the code looks fine to me. Nevertheless I don't think UKI support
should be added at the moment. This is because IMHO you basically reinterpret
the kexec_file systemcall and thus add a new uapi to the kernel. Once
introduced it is extremely hard to remove or change an uapi again. The problem
I see is that the spec you based your work on is in such a poor shape that I
don't feel comfortable to add any new uapi based on it.

For example there are two definitions for the UKI which contradict each other.
The dedicated one [1] you have cited earlier and the one in the BLS for type #2
entries [2]. In [1] the .linux and .initrd sections are mandatory and the
.osrel and .cmdline sections are optional while in [2] it is the other way
round. Which definition should the kernel follow?

Furthermore, I absolutely don't understand how the spec should be read. All
the spec does is defining some file formats. There is no word about which
component in the boot chain is supposed to handle them and what exactly this
component is supposed to do with it. But that is crucial if we want to add UKI
support for kexec as the kexec systemcall will replace the stub. So we need to
know what tasks the stub is supposed to perform. Currently this is only some
implementation detail of the systemd-stub [3] that can change any moment and I
strongly oppose to base any uapi on it.

In the end the only benefit this series brings is to extend the signature
checking on the whole UKI except of just the kernel image. Everything else can
also be done in user space. Compared to the problems described above this is a
very small gain for me.

Until the spec got fixed I don't see a chance to add UKI support for kexec.

Thanks
Philipp

[1] https://uapi-group.org/specifications/specs/unified_kernel_image/
[2] 
https://uapi-group.org/specifications/specs/boot_loader_specification/#type-2-efi-unified-kernel-images
[3] https://www.freedesktop.org/software/systemd/man/systemd-stub.html

On Mon, 11 Sep 2023 07:25:33 +0200
Jan Hendrik Farr  wrote:

> Hello,
> 
> this patch (v2) implements UKI support for kexec_file_load. It will require
> support in the kexec-tools userspace utility. For testing purposes the
> following can be used: https://github.com/Cydox/kexec-test/
> 
> Creating UKIs for testing can be done with ukify (included in systemd),
> sbctl, and mkinitcpio, etc.
> 
> There has been discussion on this topic in an issue on GitHub that is linked
> below for reference.
> 
> Changes for v2:
> - .cmdline section is now optional
> - moving pefile_parse_binary is now in a separate commit for clarity
> - parse_pefile.c is now in /lib instead of arch/x86/kernel (not sure if
>   this is the best location, but it definetly shouldn't have been in an
>   architecture specific location)
> - parse_pefile.h is now in include/kernel instead of architecture
>   specific location
> - if initrd or cmdline is manually supplied EPERM is returned instead of
>   being silently ignored
> - formatting tweaks
> 
> 
> Some links:
> - Related discussion: https://github.com/systemd/systemd/issues/28538
> - Documentation of UKIs: 
> https://uapi-group.org/specifications/specs/unified_kernel_image/
> 
> Jan Hendrik Farr (2):
>   move pefile_parse_binary to its own file
>   x86/kexec: UKI support
> 
>  arch/x86/include/asm/kexec-uki.h   |   7 ++
>  arch/x86/kernel/Makefile   |   1 +
>  arch/x86/kernel/kexec-uki.c| 126 +
>  arch/x86/kernel/machine_kexec_64.c |   2 +
>  crypto/asymmetric_keys/mscode_parser.c |   2 +-
>  crypto/asymmetric_keys/verify_pefile.c | 110 +++--
>  crypto/asymmetric_keys/verify_pefile.h |  16 
>  include/linux/parse_pefile.h   |  32 +++
>  lib/Makefile   |   3 +
>  lib/parse_pefile.c | 109 +
>  10 files changed, 292 insertions(+), 116 deletions(-)
>  create mode 100644 arch/x86/include/asm/kexec-uki.h
>  create mode 100644 arch/x86/kernel/kexec-uki.c
>  create mode 100644 include/linux/parse_pefile.h
>  create mode 100644 lib/parse_pefile.c
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH 0/4] kdump: add generic functions to simplify crashkernel crashkernel in architecture

2023-07-10 Thread Philipp Rudo
Hi Baoquan,
Hi Dave,

On Sat, 8 Jul 2023 10:15:53 +0800
Dave Young  wrote:

> On 06/19/23 at 01:59pm, Baoquan He wrote:
> > In the current arm64, crashkernel=,high support has been finished after
> > several rounds of posting and careful reviewing. The code in arm64 which
> > parses crashkernel kernel parameters firstly, then reserve memory can be
> > a good example for other ARCH to refer to.
> > 
> > Whereas in x86_64, the code mixing crashkernel parameter parsing and
> > memory reserving is twisted, and looks messy. Refactoring the code to
> > make it more readable maintainable is necessary.
> > 
> > Here, try to abstract the crashkernel parameter parsing code into a
> > generic function parse_crashkernel_generic(), and the crashkernel memory
> > reserving code into a generic function reserve_crashkernel_generic().
> > Then, in ARCH which crashkernel=,high support is needed, a simple
> > arch_reserve_crashkernel() can be added to call above two generic
> > functions. This can remove the duplicated implmentation code in each
> > ARCH, like arm64, x86_64.  
> 
> Hi Baoquan, the parse_crashkernel_common and parse_crashkernel_generic
> are confusion to me.  Thanks for the effort though.

I agree, having both parse_crashkernel_common and
parse_crashkernel_generic is pretty confusing.

> I'm not sure if it will be easy or not, but ideally I think the parse
> function can be arch independent, something like a general funtion
> parse_crashkernel() which can return the whole necessary infomation of
> crashkenrel for arch code to use, for example return like
> below pseudo stucture(just a concept, may need to think more):
> 
> structure crashkernel_range {
>   size,
>   range,
>   struct list_head list;
> }
> 
> structure crashkernel{
>   structure crashkernel_range *range_list;
>   union {
>   offset,
>   low_high
>   }
> }
> 
> So the arch code can just get the data of crashkernel and then check
> about the details, if it does not support low and high reservation then
> it can just ignore the option.
> 
> Thoughts?

Sounds reasonable. The only thing I would make different is for the
parser to take the systems ram into account and simply return the size
that needs to be allocated in case multiple ranges are given.

I've played around a little on how this might look like (probably
wasting way too much time on it) and came up with the patch below. With
that patch parse_crashkernel_{common,high,low} and friends could be
removed once all architectures are ported to the new parse_crashkernel
function.

Please note that I never tested the patch. So it probably doesn't even
compile. Nevertheless I hope it helps to get an idea on what I think
should work :-)

Thanks
Philipp



diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index fb904cc57ab1..fd5d01872c53 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -66,22 +66,12 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
 
 static void __init arch_reserve_crashkernel(void)
 {
-   unsigned long long low_size = 0;
-   unsigned long long crash_base, crash_size;
char *cmdline = boot_command_line;
-   bool high = false;
-   int ret;
 
if (!IS_ENABLED(CONFIG_KEXEC_CORE))
return;
 
-   ret = parse_crashkernel_generic(cmdline, _size, _base,
-   _size, );
-   if (ret)
-   return;
-
-   reserve_crashkernel_generic(cmdline, crash_size, crash_base,
-   low_size, high);
+   reserve_crashkernel_generic(cmdline);
 }
 
 /*
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index b26221022283..4a78bf8ad494 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -486,28 +486,17 @@ unsigned long crash_low_size_default(void)
 
 static void __init arch_reserve_crashkernel(void)
 {
-   unsigned long long crash_base, crash_size, low_size = 0;
char *cmdline = boot_command_line;
-   bool high = false;
-   int ret;
 
if (!IS_ENABLED(CONFIG_KEXEC_CORE))
return;
 
-   ret = parse_crashkernel_generic(cmdline, _size, _base,
-   _size, );
-   if (ret)
-   return;
-
if (xen_pv_domain()) {
pr_info("Ignoring crashkernel for a Xen PV domain\n");
return;
}
 
-   reserve_crashkernel_generic(cmdline, crash_size, crash_base,
-   low_size, high);
-
-   return;
+   reserve_crashkernel_generic(cmdline);
 }
 
 static struct resource standard_io_resources[] = {
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 1b12868cad1b..a1ebd6c47467 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -84,35 +84,25 @@ int parse_crashkernel_high(char *cmdline, unsigned long 
long system_ram,
 int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
   

Re: [PATCH v5 1/2] kexec: Support purgatories with .text.hot sections

2023-04-03 Thread Philipp Rudo
Hi Ricardo,

On Thu, 30 Mar 2023 11:44:47 +0200
Ricardo Ribalda  wrote:

> Clang16 links the purgatory text in two sections:
> 
>   [ 1] .text PROGBITS   0040
>11a1    AX   0 0 16
>   [ 2] .rela.textRELA   3498
>0648  0018   I  24 1 8
>   ...
>   [17] .text.hot.PROGBITS   3220
>020b    AX   0 0 1
>   [18] .rela.text.hot.   RELA   4428
>0078  0018   I  2417 8
> 
> And both of them have their range [sh_addr ... sh_addr+sh_size] on the
> area pointed by `e_entry`.
> 
> This causes that image->start is calculated twice, once for .text and
> another time for .text.hot. The second calculation leaves image->start
> in a random location.
> 
> Because of this, the system crashes immediately after:
> 
> kexec_core: Starting new kernel
> 
> Cc: sta...@vger.kernel.org
> Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory")
> Reviewed-by: Ross Zwisler 
> Signed-off-by: Ricardo Ribalda 
> ---
>  kernel/kexec_file.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index f1a0e4e3fb5c..c7a0e51a6d87 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -901,10 +901,22 @@ static int kexec_purgatory_setup_sechdrs(struct 
> purgatory_info *pi,
>   }
>  
>   offset = ALIGN(offset, align);
> +
> + /*
> +  * Check if the segment contains the entry point, if so,
> +  * calculate the value of image->start based on it.
> +  * If the compiler has produced more than one .text section
> +  * (Eg: .text.hot), they are generally after the main .text
> +  * section, and they shall not be used to calculate
> +  * image->start. So do not re-calculate image->start if it
> +  * is not set to the initial value, and warn the user so they
> +  * have a chance to fix their purgatory's linker script.
> +  */
>   if (sechdrs[i].sh_flags & SHF_EXECINSTR &&
>   pi->ehdr->e_entry >= sechdrs[i].sh_addr &&
>   pi->ehdr->e_entry < (sechdrs[i].sh_addr
> -  + sechdrs[i].sh_size)) {
> +  + sechdrs[i].sh_size) &&
> + !WARN_ON(kbuf->image->start != pi->ehdr->e_entry)) {

Looks good to me. I'm not sure if it is better to use WARN_ON_ONCE to
avoid spamming the log when there are more than two .text sections or
when you reload the kernel. But that's only a rare corner case. So no
strong opinion from my side. Either way 

Reviewed-by: Philipp Rudo 

>   kbuf->image->start -= sechdrs[i].sh_addr;
>   kbuf->image->start += kbuf->mem + offset;
>   }
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] kexec: Support purgatories with .text.hot sections

2023-04-03 Thread Philipp Rudo
Hi Ricardo,

sorry for the late reply...

On Mon, 27 Mar 2023 13:52:08 +0200
Ricardo Ribalda  wrote:

[...]

> 
> I tried removing the -r from arch/x86/purgatory/Makefile and that resulted 
> into:
> 
> [  115.631578] BUG: unable to handle page fault for address: 93224d5c8e20
> [  115.631583] #PF: supervisor write access in kernel mode
> [  115.631585] #PF: error_code(0x0002) - not-present page
> [  115.631586] PGD 10067 P4D 10067 PUD 1001ed067 PMD 132b58067 PTE 0
> [  115.631589] Oops: 0002 [#1] PREEMPT SMP NOPTI
> [  115.631592] CPU: 0 PID: 5291 Comm: kexec-lite Tainted: G U
>   5.15.103-17399-g852a928df601-dirty #19
> cd159e0d6a91f03e06035a0a8eb7fc984a8f3e82
> [  115.631594] Hardware name: Google Crota/Crota, BIOS
> Google_Crota.14505.288.0 11/08/2022
> [  115.631595] RIP: 0010:memcpy_erms+0x6/0x10
> [  115.631599] Code: 5d 00 eb bd eb 1e 0f 1f 00 48 89 f8 48 89 d1 48
> c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 cc cc cc cc 66 90 48 89 f8
> 48 89 d1  a4 c3 cc cc cc cc 0f 1f 00 48 89 f8 48 83 fa 20 72 7e 40
> 38 fe
> [  115.631601] RSP: 0018:93224f65fe50 EFLAGS: 00010246
> [  115.631602] RAX: 93224d5c8e20 RBX: ffea RCX: 
> 0100
> [  115.631603] RDX: 0100 RSI: 9322407bd000 RDI: 
> 93224d5c8e20
> [  115.631604] RBP: 93224f65fe88 R08:  R09: 
> 92133cd3ef08
> [  115.631605] R10: 9322407be000 R11: a1b4f2e0 R12: 
> 
> [  115.631606] R13: 92133cee4c00 R14: 0100 R15: 
> a2b6f14f
> [  115.631607] FS:  78e8b9dbf7c0() GS:92143780()
> knlGS:
> [  115.631609] CS:  0010 DS:  ES:  CR0: 80050033
> [  115.631610] CR2: 93224d5c8e20 CR3: 00015be26001 CR4: 
> 00770ef0
> [  115.631611] PKRU: 5554
> [  115.631612] Call Trace:
> [  115.631614]  
> [  115.631615]  kexec_purgatory_get_set_symbol+0x82/0xd3
> [  115.631619]  __se_sys_kexec_file_load+0x523/0x644
> [  115.631621]  do_syscall_64+0x58/0xa5
> [  115.631623]  entry_SYSCALL_64_after_hwframe+0x61/0xcb

Yeah, simply dropping -r doesn't work. You at least need to add -fPIE
to the CFLAGS. But probably you need more. When you go down this route
you really need to pay attention to some nasty details...

> And I did not continue in that direction.

That's totally fine.

Thanks
Philipp

> I also tried finding a flag for llvm that would avoid splitting .text,
> but was not lucky either.
> 
> I will look into making a linker script for x86, we could combine it
> with something like:
> 
> if (sechdrs[i].sh_flags & SHF_EXECINSTR &&
> pi->ehdr->e_entry >= sechdrs[i].sh_addr &&
> pi->ehdr->e_entry < (sechdrs[i].sh_addr
> -+ sechdrs[i].sh_size) &&
> -   kbuf->image->start == pi->ehdr->e_entry) {
> -   kbuf->image->start -= sechdrs[i].sh_addr;
> -   kbuf->image->start += kbuf->mem + offset;
> ++ sechdrs[i].sh_size)) {
> +   if (!WARN_ON(kbuf->image->start != 
> pi->ehdr->e_entry)) {
> +   kbuf->image->start -= sechdrs[i].sh_addr;
> +   kbuf->image->start += kbuf->mem + offset;
> +   }
> }
> 
> So developers have some hints of what to look at.
> 
> Thanks!
> 
> 
> >
> > Thanks
> > Philipp
> >  
> > >   kbuf->image->start -= sechdrs[i].sh_addr;
> > >   kbuf->image->start += kbuf->mem + offset;
> > >   }
> > >
> > > ---
> > > base-commit: 17214b70a159c6547df9ae204a6275d983146f6b
> > > change-id: 20230321-kexec_clang16-4510c23d129c
> > >
> > > Best regards,  
> >  
> 
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v3] kexec: Support purgatories with .text.hot sections

2023-03-24 Thread Philipp Rudo
Hi Ricardo,

On Wed, 22 Mar 2023 20:09:21 +0100
Ricardo Ribalda  wrote:

> Clang16 links the purgatory text in two sections:
> 
>   [ 1] .text PROGBITS   0040
>11a1    AX   0 0 16
>   [ 2] .rela.textRELA   3498
>0648  0018   I  24 1 8
>   ...
>   [17] .text.hot.PROGBITS   3220
>020b    AX   0 0 1
>   [18] .rela.text.hot.   RELA   4428
>0078  0018   I  2417 8
> 
> And both of them have their range [sh_addr ... sh_addr+sh_size] on the
> area pointed by `e_entry`.
> 
> This causes that image->start is calculated twice, once for .text and
> another time for .text.hot. The second calculation leaves image->start
> in a random location.
> 
> Because of this, the system crashes inmediatly after:
> 
> kexec_core: Starting new kernel

Great analysis!

> Signed-off-by: Ricardo Ribalda 
> ---
> kexec: Fix kexec_file_load for llvm16
> 
> When upreving llvm I realised that kexec stopped working on my test
> platform. This patch fixes it.
> 
> To: Eric Biederman 
> Cc: Baoquan He 
> Cc: Philipp Rudo 
> Cc: kexec@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> ---
> Changes in v3:
> - Fix initial value. Thanks Ross!
> - Link to v2: 
> https://lore.kernel.org/r/20230321-kexec_clang16-v2-0-d10e5d517...@chromium.org
> 
> Changes in v2:
> - Fix if condition. Thanks Steven!.
> - Update Philipp email. Thanks Baoquan.
> - Link to v1: 
> https://lore.kernel.org/r/20230321-kexec_clang16-v1-0-a768fc2c7...@chromium.org
> ---
>  kernel/kexec_file.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index f1a0e4e3fb5c..25a37d8f113a 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -901,10 +901,21 @@ static int kexec_purgatory_setup_sechdrs(struct 
> purgatory_info *pi,
>   }
>  
>   offset = ALIGN(offset, align);
> +
> + /*
> +  * Check if the segment contains the entry point, if so,
> +  * calculate the value of image->start based on it.
> +  * If the compiler has produced more than one .text sections
> +  * (Eg: .text.hot), they are generally after the main .text
> +  * section, and they shall not be used to calculate
> +  * image->start. So do not re-calculate image->start if it
> +  * is not set to the initial value.
> +  */
>   if (sechdrs[i].sh_flags & SHF_EXECINSTR &&
>   pi->ehdr->e_entry >= sechdrs[i].sh_addr &&
>   pi->ehdr->e_entry < (sechdrs[i].sh_addr
> -  + sechdrs[i].sh_size)) {
> +  + sechdrs[i].sh_size) &&
> + kbuf->image->start == pi->ehdr->e_entry) {

I'm not entirely sure if this is the solution to go with. As you state
in the comment above this solution assumes that the .text section comes
before any other .text.* section. But this assumption isn't much
stronger than the assumption that there is only a single .text section,
which is used nowadays.

The best solution I can come up with right now is to introduce a linker
script for the purgatory that simply merges the .text sections into
one. Similar to what I did for s390 in
arch/s390/purgatory/purgatory.lds.S (although for a different reason).
But that would require every architecture to get one. An alternative
would be to find a way to get rid of the -r option on the LD_FLAGS,
which IIRC is the reason why both section overlap in the first place.

Thanks
Philipp

>   kbuf->image->start -= sechdrs[i].sh_addr;
>   kbuf->image->start += kbuf->mem + offset;
>   }
> 
> ---
> base-commit: 17214b70a159c6547df9ae204a6275d983146f6b
> change-id: 20230321-kexec_clang16-4510c23d129c
> 
> Best regards,


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: Default __NR_kexec_file_load is set to undefined

2023-03-03 Thread Philipp Rudo
Hi Baoquan,

On Tue, 28 Feb 2023 21:32:26 +0800
Baoquan He  wrote:

> Hi Philipp,
> 
> On 02/27/23 at 04:19pm, Philipp Rudo wrote:
> .. 
> > > diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h
> > > index be6ccd5..ea77936 100644
> > > --- a/kexec/kexec-syscall.h
> > > +++ b/kexec/kexec-syscall.h
> > > @@ -59,9 +59,7 @@
> > >  #endif
> > >  #endif /*ifndef __NR_kexec_load*/
> > >  
> > > -#ifdef __arm__
> > >  #undef __NR_kexec_file_load
> > > -#endif
> > >  
> > >  #ifndef __NR_kexec_file_load  
> > 
> > I don't think this will work as intended. 
> > 
> > On the top of the file sys/syscall.h gets included. In there
> > architectures that support kexec_file_load define __NR_kexec_file_load.
> > This also means that if an architecture doesn't support kexec_file_load
> > __NR_kexec_file_load shouldn't be defined in the first place. Thus I
> > suggest that you find out why sys/syscall.h defines
> > __NR_kexec_file_load for LoongArch even when the system call is not
> > supported and fix it there.  
> 
> Checking whether LoongArch has defined __NR_kexec_file_load sounds a
> good suggestion. Wondering why we still need add __NR_kexec_file_load
> definition in kexec-tools. E.g below s390 kexec_file support you added.
>  sometime won't be found or __NR_kexec_file_load could be
> not defined yet?
> 
> commit d4a948c268272cf37c71be820fb02bf40e56292b
> Author: Philipp Rudo 
> Date:   Wed May 16 14:27:18 2018 +0200
> 
> kexec/s390: Add support for kexec_file_load
> 

To be honest I don't remember why I have added it back then. Most
likely I simply copied what others have done before.

The benefit in having the extra definition I see is that you don't need
to rebuild glibc when you implement the syscall and don't use the
generic unistd.h. But that is something only few people should ever
encounter.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: Default __NR_kexec_file_load is set to undefined

2023-03-03 Thread Philipp Rudo
Hi Youling,

On Tue, 28 Feb 2023 10:02:16 +0800
Youling Tang  wrote:

> Hi, Philipp
> 
> On 02/27/2023 11:19 PM, Philipp Rudo wrote:
> > Hi Youling,
> >
> > On Fri, 24 Feb 2023 17:51:07 +0800
> > Youling Tang  wrote:
> >  
> >> The initial reason is that after the merger of 29fe5067ed07
> >> ("kexec: make -a the default"), kexec cannot be used on LoongArch,
> >> MIPS .etc architectures. We need to add "-c" for normal use. The
> >> current kexec_file_load system call is not implemented in
> >> architectures such as LoongArch, so it needs to pass kexec_load.
> >> So we need to set __NR_kexec_file_load to undefined in unsupported
> >> architectures. This will return EFALLBACK via 
> >> is_kexec_file_load_implemented,
> >> and then via kexec_load.
> >>
> >> Signed-off-by: Youling Tang 
> >> ---
> >>  kexec/kexec-syscall.h | 2 --
> >>  1 file changed, 2 deletions(-)
> >>
> >> diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h
> >> index be6ccd5..ea77936 100644
> >> --- a/kexec/kexec-syscall.h
> >> +++ b/kexec/kexec-syscall.h
> >> @@ -59,9 +59,7 @@
> >>  #endif
> >>  #endif /*ifndef __NR_kexec_load*/
> >>
> >> -#ifdef __arm__
> >>  #undef __NR_kexec_file_load
> >> -#endif
> >>
> >>  #ifndef __NR_kexec_file_load  
> >
> > I don't think this will work as intended.  
> 
> Works fine in LoongArch after applying this patch.

I believe you. The problem is that it changes the behavior on other
architectures.

> >
> > On the top of the file sys/syscall.h gets included. In there
> > architectures that support kexec_file_load define __NR_kexec_file_load.
> > This also means that if an architecture doesn't support kexec_file_load
> > __NR_kexec_file_load shouldn't be defined in the first place. Thus I
> > suggest that you find out why sys/syscall.h defines
> > __NR_kexec_file_load for LoongArch even when the system call is not
> > supported and fix it there.  
> Yes, in the kernel, LoongArch uses the generic syscall number (in
> include/uapi/asm-generic/unistd.h), so __NR_kexec_file_load is
> defined as 294.

Ok, I see. I have expected that it is wrapped in some ifdef
CONFIG_KEXEC_FILE. But apparently that is not the case. Thanks for the
clarification.

> I think the simpler way is to modify it in kexec-tools as follows,
> #ifdef __loongarch__
> #undef __NR_kexec_file_load
> #endif

Yeah, looks like this is the best solution. Only thing you could do is
to merge it with the arm special handling. I.e. so it reads

#if defined(__arm__) || defined(__longarch__)
#undef __NR_kexec_file_load
#endif

But that is more a matter of taste. So you and Simon need to agree what
you like more.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: Default __NR_kexec_file_load is set to undefined

2023-02-27 Thread Philipp Rudo
Hi Youling,

On Fri, 24 Feb 2023 17:51:07 +0800
Youling Tang  wrote:

> The initial reason is that after the merger of 29fe5067ed07
> ("kexec: make -a the default"), kexec cannot be used on LoongArch,
> MIPS .etc architectures. We need to add "-c" for normal use. The
> current kexec_file_load system call is not implemented in
> architectures such as LoongArch, so it needs to pass kexec_load.
> So we need to set __NR_kexec_file_load to undefined in unsupported
> architectures. This will return EFALLBACK via is_kexec_file_load_implemented,
> and then via kexec_load.
> 
> Signed-off-by: Youling Tang 
> ---
>  kexec/kexec-syscall.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h
> index be6ccd5..ea77936 100644
> --- a/kexec/kexec-syscall.h
> +++ b/kexec/kexec-syscall.h
> @@ -59,9 +59,7 @@
>  #endif
>  #endif /*ifndef __NR_kexec_load*/
>  
> -#ifdef __arm__
>  #undef __NR_kexec_file_load
> -#endif
>  
>  #ifndef __NR_kexec_file_load

I don't think this will work as intended. 

On the top of the file sys/syscall.h gets included. In there
architectures that support kexec_file_load define __NR_kexec_file_load.
This also means that if an architecture doesn't support kexec_file_load
__NR_kexec_file_load shouldn't be defined in the first place. Thus I
suggest that you find out why sys/syscall.h defines
__NR_kexec_file_load for LoongArch even when the system call is not
supported and fix it there.

Thanks
Philipp 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: Unified Kernel Image Support

2023-02-06 Thread Philipp Rudo
Hi,

On Mon, 6 Feb 2023 17:19:33 +0800
Baoquan He  wrote:

> On 02/03/23 at 10:46pm, Yishen Miao wrote:
> > Hello all,
> > 
> > I am experimenting kexec on my box. It uses systemd-boot as the bootloader 
> > and boots from a unified kernel image (objcopy'ed cmdline, kernel, 
> > initrdramfs, and microcode updates). As of kexec-tools 2.0.25 and systemd 
> > 252.5, when I rum systemctl kexec, it returns the following:
> > 
> > 
> > # sudo systemctl kexec
> > Running /usr/bin/kexec --load "/efi/EFI/Linux/ArchLinux-linux.efi" --append 
> > "root=UUID=----"(null)
> > Cannot determine the file type of /efi/EFI/Linux/ArchLinux-linux.efi
> > 
> > It seems that systemctl successfully identified the UKI from systemd-boot, 
> > however, kexec could not recognize it.
> > 
> > Are there any plans to add UKI support to kexec? Your response is greatly 
> > appreciated!  
> 
> My colleageus mentioned UKI recently. We have plan to support it, while
> haven't started to work on that.

I've looked into UKI recently. In order to provide some base support
one should only need to teach kexec_file_load the new file format [1].
However that still leaves two fundamental issues which limit the
usefulness of that support.

1) The way I understand it the initrd contained in the UKI is only a
stub that is supposed to read further "modules" from disk which
together form the initrd needed for the given hardware/system
configuration. The problem is that the kexec_file_load syscall only
accepts one fd for the kernel and one fd for the initrd. So to support
multiple modules we would either need to introduce a new syscall or
define a uABI that allows to pass multiple initrds via this one fd.
Either way it's a new user interface and should be designed with care.

2) As the kernel command line is part of the UKI and is protected by
the signature it cannot be changed by users. So to support kdump with a
UKI a distro would need to find one crashkernel= parameter that works
for all users which is impossible. Thus before kdump can be supported
with UKI there needs to be a  mechanism in place that allows users to
edit the command line. Others have the same problem. There is an open
issue on github [2] to add this support.

So all in all I believe there will be kexec support for UKI but I don't
see it to come anytime soon.

Thanks
Philipp

[1] ...and kexec-tools if you like to support kexec_load. But as the
main use case for UKI is together with Secure Boot I don't think that's
necessary.

[2] https://github.com/systemd/systemd/issues/24539

> I have a testing machine at hand right now, just finished teseting
> upstream patches. If you have the detailed steps about how to make UKI,
> privately or publicly, I can try it now, and see what we can do.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Enable runtime allocation of crash_image

2022-11-29 Thread Philipp Rudo
Ricardo,

On Mon, 28 Nov 2022 18:07:06 +0100
Ricardo Ribalda  wrote:

> Hi Philipp
> 
> 
> Thanks for your review.
> 
> 
> On Mon, 28 Nov 2022 at 18:00, Philipp Rudo  wrote:
> >
> > Hi Ricardo,
> >
> > On Thu, 24 Nov 2022 23:23:36 +0100
> > Ricardo Ribalda  wrote:
> >  
> > > Usually crash_image is defined statically via the crashkernel parameter
> > > or DT.
> > >
> > > But if the crash kernel is not used, or is smaller than then
> > > area pre-allocated that memory is wasted.
> > >
> > > Also, if the crash kernel was not defined at bootime, there is no way to
> > > use the crash kernel.
> > >
> > > Enable runtime allocation of the crash_image if the crash_image is not
> > > defined statically. Following the same memory allocation/validation path
> > > that for the reboot kexec kernel.
> > >
> > > Signed-off-by: Ricardo Ribalda   
> >
> > I don't think this patch will work as intended. For one you omit
> > setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that
> > type you will find that there is a lot of special handling done for it.
> > I don't believe that this can simply be skipped without causing
> > problems.
> >
> > Furthermore I think you have missed one important detail. The memory
> > reserved for the crash kernel is not just a buffer for the image but
> > the memory it runs in! For that it has to be a continuous piece of
> > physical memory with usually some additional arch specific limitations.
> > When allocated dynamically all those limitations need to be considered.
> > But a standard kexec doesn't care about those limitations as it doesn't
> > care about the os running before itself. It can simply overwrite the
> > memory when booting. But if the crash kernel does the same it will
> > corrupt the dump it is supposed to generate.  
> 
> Right now, I do not intend to use it to fetch a kdump, I am using it
> as the image that will run when the system crashes.

the crash_image is currently all about creating a dump. If you want to
change that you need to discuss the new behavior in the commit message!
Please update the commit message.

Thanks
Philipp

> 
> It seems to work fine on the two devices that I am using for tests.
> 
> >
> > Thanks
> > Philipp
> >  
> > > ---
> > > kexec: Enable runtime allocation of crash_image
> > >
> > > To: Eric Biederman 
> > > Cc: kexec@lists.infradead.org
> > > Cc: linux-ker...@vger.kernel.org
> > > Cc: Sergey Senozhatsky 
> > > Cc: linux-ker...@vger.kernel.org
> > > Cc: Ross Zwisler 
> > > Cc: Philipp Rudo 
> > > Cc: Baoquan He 
> > > ---
> > >  include/linux/kexec.h | 1 +
> > >  kernel/kexec.c| 9 +
> > >  kernel/kexec_core.c   | 5 +
> > >  kernel/kexec_file.c   | 7 ---
> > >  4 files changed, 15 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > > index 41a686996aaa..98ca9a32bc8e 100644
> > > --- a/include/linux/kexec.h
> > > +++ b/include/linux/kexec.h
> > > @@ -427,6 +427,7 @@ extern int kexec_load_disabled;
> > >  extern bool kexec_in_progress;
> > >
> > >  int crash_shrink_memory(unsigned long new_size);
> > > +bool __crash_memory_valid(void);
> > >  ssize_t crash_get_memory_size(void);
> > >
> > >  #ifndef arch_kexec_protect_crashkres
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index cb8e6e6f983c..b5c17db25e88 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > > @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> > > unsigned long entry,
> > >   struct kimage *image;
> > >   bool kexec_on_panic = flags & KEXEC_ON_CRASH;
> > >
> > > - if (kexec_on_panic) {
> > > + if (kexec_on_panic && __crash_memory_valid()) {
> > >   /* Verify we have a valid entry point */
> > >   if ((entry < phys_to_boot_phys(crashk_res.start)) ||
> > >   (entry > phys_to_boot_phys(crashk_res.end)))
> > > @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> > > unsigned long entry,
> > >   image->nr_segments = nr_segments;
> > >   memcpy(image->segment, segments, nr_segments * sizeof(*segments));
> > >
> > > - if (kexec_on_panic) {
> > > + if (kexec_on_panic 

Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-29 Thread Philipp Rudo
Hi Steven,

On Mon, 28 Nov 2022 11:42:00 -0500
Steven Rostedt  wrote:

> On Thu, 24 Nov 2022 16:01:15 +0100
> Philipp Rudo  wrote:
> 
> > No, I think the implementation is fine. I'm currently only struggling
> > to understand what problem kexec_reboot_disabled solves that cannot be
> > solved by kexec_load_disabled.  
> 
> Hi Philipp,
> 
> Thanks for working with us on this.
> 
> Let me try to explain our use case. We want kexec/kdump enabled, but we
> really do not want kexec used for any other purpose. We must have the kexec
> kernel loaded at boot up and not afterward.
> 
> Your recommendation of:
> 
>   kexec -p dump_kernel
>   echo 1 > /proc/sys/kernel/kexec_load_disabled
> 
> can work, and we will probably add it. But we are taking the paranoid
> approach, and what I learned in security 101 ;-) and that is, only open up
> the minimal attack surface as possible.

Well that's sort of my problem. When you go all in on paranoia I would
expect that you also restrict the crash kernel. Otherwise you keep most
of the attack surface. But disabling 'reboot' of the crash kernel is
quite intrusive and probably not what you want. That's why I think it
is better do the restriction on the 'load' rather than the 'reboot'
path.

One solution for that is the script above. But that pushes all the
responsibilities concerning syncing and error handling to the sysadmin.
Depending on your level of paranoia that might be too risky. Personally
I think it's fine as the kernel alone cannot fix all potential security
problems. In my opinion this has to be done in the layer that is
responsible for the task done. So when a security problem arises due to
ill syncing when starting different services it's the job of the init
system to fix the issue.

An alternative approach and sort of compromise I see is to convert
kexec_load_disabled from a simple on/off switch to a counter on how
often a kexec load can be made (in practice a tristate on/off/one-shot
should be sufficient). Ideally the reboot and panic path will
have separate counters. With that you could for example use
kexec_load_limit.reboot=0 and kexec_load_limit.panic=1 to disable the
load of images for reboot while still allow to load a crash kernel
once. With this you have the flexibility you need while also preventing
a race where an attacker overwrites your crash kernel before you can
toggle the switch. What do you think?

> Yes, it's highly unlikely that the above would crash. But as with most
> security vulnerabilities, it's not going to be an attacker that creates a
> new gadget here, but probably another script in the future that causes this
> to be delayed or something, and a new window of opportunity will arise for
> an attacker. Maybe, that new window only works for non panic kernels. Yes,
> this is a contrived scenario, but the work vs risk is very low in adding
> this feature.

True, but that problem is not limited to userspace. For example see
Ricardos other patch [1] where he treats the load of a crash kernel
just like a standard load. In my opinion he creates such a gadget in
that patch.

[1] 
https://lore.kernel.org/all/20221124-kexec-noalloc-v1-0-d78361e99...@chromium.org/

Thanks
Philipp

> Perhaps the attack surface that a reboot kexec could be, is that the
> attacker gets the ability at boot up to load the kexec for reboot and not 
> panic.
> Then the attack must wait for the victim to reboot their machine before
> they have access to the new kernel. Again, I admit this is contrived, but
> just because I can't think of a real situation that this could be a problem
> doesn't mean that one doesn't exist.
> 
> In other words, if we never want to allow a kexec reboot, why allow it at
> all from the beginning? The above allows it, until we don't. That alone
> makes us nervous. Whereas this patch is rather trivial and doesn't add
> complexity.
> 
> Thanks for your time, we appreciate it.
> 
> -- Steve
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec: Enable runtime allocation of crash_image

2022-11-28 Thread Philipp Rudo
Hi Ricardo,

On Thu, 24 Nov 2022 23:23:36 +0100
Ricardo Ribalda  wrote:

> Usually crash_image is defined statically via the crashkernel parameter
> or DT.
> 
> But if the crash kernel is not used, or is smaller than then
> area pre-allocated that memory is wasted.
> 
> Also, if the crash kernel was not defined at bootime, there is no way to
> use the crash kernel.
> 
> Enable runtime allocation of the crash_image if the crash_image is not
> defined statically. Following the same memory allocation/validation path
> that for the reboot kexec kernel.
> 
> Signed-off-by: Ricardo Ribalda 

I don't think this patch will work as intended. For one you omit
setting the image->type to KEXEC_TYPE_CRASH. But when you grep for that
type you will find that there is a lot of special handling done for it.
I don't believe that this can simply be skipped without causing
problems.

Furthermore I think you have missed one important detail. The memory
reserved for the crash kernel is not just a buffer for the image but
the memory it runs in! For that it has to be a continuous piece of
physical memory with usually some additional arch specific limitations.
When allocated dynamically all those limitations need to be considered.
But a standard kexec doesn't care about those limitations as it doesn't
care about the os running before itself. It can simply overwrite the
memory when booting. But if the crash kernel does the same it will
corrupt the dump it is supposed to generate.

Thanks
Philipp

> ---
> kexec: Enable runtime allocation of crash_image
> 
> To: Eric Biederman 
> Cc: kexec@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: Sergey Senozhatsky 
> Cc: linux-ker...@vger.kernel.org
> Cc: Ross Zwisler 
> Cc: Philipp Rudo 
> Cc: Baoquan He 
> ---
>  include/linux/kexec.h | 1 +
>  kernel/kexec.c| 9 +
>  kernel/kexec_core.c   | 5 +
>  kernel/kexec_file.c   | 7 ---
>  4 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 41a686996aaa..98ca9a32bc8e 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -427,6 +427,7 @@ extern int kexec_load_disabled;
>  extern bool kexec_in_progress;
>  
>  int crash_shrink_memory(unsigned long new_size);
> +bool __crash_memory_valid(void);
>  ssize_t crash_get_memory_size(void);
>  
>  #ifndef arch_kexec_protect_crashkres
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index cb8e6e6f983c..b5c17db25e88 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -28,7 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> unsigned long entry,
>   struct kimage *image;
>   bool kexec_on_panic = flags & KEXEC_ON_CRASH;
>  
> - if (kexec_on_panic) {
> + if (kexec_on_panic && __crash_memory_valid()) {
>   /* Verify we have a valid entry point */
>   if ((entry < phys_to_boot_phys(crashk_res.start)) ||
>   (entry > phys_to_boot_phys(crashk_res.end)))
> @@ -44,7 +44,7 @@ static int kimage_alloc_init(struct kimage **rimage, 
> unsigned long entry,
>   image->nr_segments = nr_segments;
>   memcpy(image->segment, segments, nr_segments * sizeof(*segments));
>  
> - if (kexec_on_panic) {
> + if (kexec_on_panic && __crash_memory_valid()) {
>   /* Enable special crash kernel control page alloc policy. */
>   image->control_page = crashk_res.start;
>   image->type = KEXEC_TYPE_CRASH;
> @@ -101,7 +101,7 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>  
>   if (flags & KEXEC_ON_CRASH) {
>   dest_image = _crash_image;
> - if (kexec_crash_image)
> + if (kexec_crash_image && __crash_memory_valid())
>   arch_kexec_unprotect_crashkres();
>   } else {
>   dest_image = _image;
> @@ -157,7 +157,8 @@ static int do_kexec_load(unsigned long entry, unsigned 
> long nr_segments,
>   image = xchg(dest_image, image);
>  
>  out:
> - if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
> + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image &&
> + __crash_memory_valid())
>   arch_kexec_protect_crashkres();
>  
>   kimage_free(image);
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index ca2743f9c634..77083c9760fb 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1004,6 +1004,11 @@ void crash_kexec(struct pt_regs *regs)
>   }
>  }
>  
> +bool __crash_memory_valid(void)
> +{
> + return crashk_res.end != crashk_res.start;

Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-28 Thread Philipp Rudo
Hi Ricardo,

On Thu, 24 Nov 2022 23:32:34 +0100
Ricardo Ribalda  wrote:

> Hi Philipp
> 
> 
> On Thu, 24 Nov 2022 at 16:01, Philipp Rudo  wrote:
> >
> > On Thu, 24 Nov 2022 13:52:58 +0100
> > Ricardo Ribalda  wrote:
> >  
> > > On Thu, 24 Nov 2022 at 12:40, Philipp Rudo  wrote:  
> > > >
> > > > Hi Ricardo,
> > > >
> > > > On Wed, 23 Nov 2022 09:58:08 +0100
> > > > Ricardo Ribalda  wrote:
> > > >  
> > > > > Hi Philipp
> > > > >
> > > > > Thanks for your review.
> > > > >
> > > > > My scenario is a trusted system, where even if you are root, your
> > > > > access to the system is very limited.
> > > > >
> > > > > Let's assume LOADPIN and verity are enabled.  
> > > >
> > > > My point is that on such systems I expect that a sysadmin also wants to
> > > > control the crash kernel including its initramfs (which also has to be 
> > > > part
> > > > of the signed kernel?). But if that's the case a sysadmin can simply arm
> > > > kdump early during boot and then toggle kexec_load_disabled. With that
> > > > LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be 
> > > > loaded
> > > > while kdump works. Thus there is no need to add the new interface. Or am
> > > > I missing anything?  
> > >
> > > Let's say that you have a script that does something like this
> > >
> > >
> > > kexec -p dump_kernel
> > > echo 1 > /proc/sys/kernel/kexec_load_disabled
> > >
> > > If an attacker can DDos the system and make that script crash... then
> > > kexec is still accessible
> > >
> > > On the other hand, if you load the kernel with the commandline
> > >
> > > sysctl.kernel.kexec_load_disabled=1  
> >   
> >   reboot?  
> 
> yes :)  thanks!
> 
> > Otherwise you shouldn't be able to load the crash kernel at all.
> >  
> > > Then even if the script crashes, the only way to abuse kexec is by
> > > panicing the running kernel  
> >
> > True. But  when an attacker can DDos the system the final workload is
> > already running. So wouldn't it be enough to make sure that the script
> > above has finished before starting you workload. E.g. by setting an
> > appropriate Before=/After= in the systemd.unit?  
> 
> What if the kexec binary crashes and the unit will never succeed?

Then there are options like OnFailure= or FailureAction= the sysadmin
can use do what ever he seems appropriate.

> Or worse, your distro does not use systemd !!!

In that case there are other ways to achieve the same. The two options
are only examples. Why can't this be used as a replacement?

> > Furthermore, I don't think that restricting kexec reboot alone is
> > sufficient when the attacker can still control the crash kernel. At
> > least my assumption is that triggering a panic instead of just
> > rebooting is just a mild inconvenience for somebody who is able to pull
> > off an attack like that.  
> 
> The attacker does not control the crash kernel completely. loadpin is
> still in place.
> Yes, they can downgrade the whole system to a vulnerable kernel image.
> But the choices are limited :)
> 
> With physical access to the device panicing a kernel is easily doable
> (but not trivial). But remotely, it is more challenging.


Well the same holds for kexec. So the only difference is triggering the
panic where I'm still not convinced it's a huge obstacle for someone
who is able to pull off all the steps before for such an attack.

To be honest I don't think we make a progress here at the moment. I
would like to hear from others what they think about this.

Thanks
Philipp

> 
> >  
> > > Would it make you more comfortable if I model this as a kernel config
> > > instead of a runtime option?  
> >
> > No, I think the implementation is fine. I'm currently only struggling
> > to understand what problem kexec_reboot_disabled solves that cannot be
> > solved by kexec_load_disabled.
> >  
> > > Thanks!  
> >
> > Happy to help.
> >
> > Thanks
> > Philipp
> >  
> > >
> > >  
> > > >
> > > > Thanks
> > > > Philipp
> > > >  
> > > > >
> > > > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo  wrote:  
> > > > > >
> > > > > > Hi Ricardo,
> > > > > >
>

Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-24 Thread Philipp Rudo
On Thu, 24 Nov 2022 13:52:58 +0100
Ricardo Ribalda  wrote:

> On Thu, 24 Nov 2022 at 12:40, Philipp Rudo  wrote:
> >
> > Hi Ricardo,
> >
> > On Wed, 23 Nov 2022 09:58:08 +0100
> > Ricardo Ribalda  wrote:
> >  
> > > Hi Philipp
> > >
> > > Thanks for your review.
> > >
> > > My scenario is a trusted system, where even if you are root, your
> > > access to the system is very limited.
> > >
> > > Let's assume LOADPIN and verity are enabled.  
> >
> > My point is that on such systems I expect that a sysadmin also wants to
> > control the crash kernel including its initramfs (which also has to be part
> > of the signed kernel?). But if that's the case a sysadmin can simply arm
> > kdump early during boot and then toggle kexec_load_disabled. With that
> > LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be loaded
> > while kdump works. Thus there is no need to add the new interface. Or am
> > I missing anything?  
> 
> Let's say that you have a script that does something like this
> 
> 
> kexec -p dump_kernel
> echo 1 > /proc/sys/kernel/kexec_load_disabled
> 
> If an attacker can DDos the system and make that script crash... then
> kexec is still accessible
> 
> On the other hand, if you load the kernel with the commandline
> 
> sysctl.kernel.kexec_load_disabled=1
  
  reboot?
Otherwise you shouldn't be able to load the crash kernel at all.

> Then even if the script crashes, the only way to abuse kexec is by
> panicing the running kernel

True. But  when an attacker can DDos the system the final workload is
already running. So wouldn't it be enough to make sure that the script
above has finished before starting you workload. E.g. by setting an
appropriate Before=/After= in the systemd.unit?

Furthermore, I don't think that restricting kexec reboot alone is
sufficient when the attacker can still control the crash kernel. At
least my assumption is that triggering a panic instead of just
rebooting is just a mild inconvenience for somebody who is able to pull
off an attack like that.

> Would it make you more comfortable if I model this as a kernel config
> instead of a runtime option?

No, I think the implementation is fine. I'm currently only struggling
to understand what problem kexec_reboot_disabled solves that cannot be
solved by kexec_load_disabled.

> Thanks!

Happy to help.

Thanks
Philipp

> 
> 
> >
> > Thanks
> > Philipp
> >  
> > >
> > > On Mon, 21 Nov 2022 at 15:10, Philipp Rudo  wrote:  
> > > >
> > > > Hi Ricardo,
> > > >
> > > > On Thu, 17 Nov 2022 16:15:07 +0100
> > > > Ricardo Ribalda  wrote:
> > > >  
> > > > > Hi Philipp
> > > > >
> > > > > Thanks for your review!  
> > > >
> > > > happy to help.
> > > >  
> > > > >
> > > > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo  wrote:  
> > > > > >
> > > > > > Hi Ricardo,
> > > > > >
> > > > > > all in all I think this patch makes sense. However, there is one 
> > > > > > point
> > > > > > I don't like...
> > > > > >
> > > > > > On Mon, 14 Nov 2022 14:18:39 +0100
> > > > > > Ricardo Ribalda  wrote:
> > > > > >  
> > > > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, 
> > > > > > > reducing the
> > > > > > > attack surface to a system.
> > > > > > >
> > > > > > > Without this toogle, an attacker can only reboot into a different 
> > > > > > > kernel
> > > > > > > if they can create a panic().
> > > > > > >
> > > > > > > Signed-off-by: Ricardo Ribalda 
> > > > > > >
> > > > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> > > > > > > b/Documentation/admin-guide/sysctl/kernel.rst
> > > > > > > index 97394bd9d065..25d019682d33 100644
> > > > > > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > > > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > > > > > @@ -462,6 +462,17 @@ altered.
> > > > > > >  Generally used together with the `modules_disabled`_ sysctl.
> > > > > > >
> > > > > > >
> > > > > > > +kexec_reboot

Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-24 Thread Philipp Rudo
Hi Ricardo,

On Wed, 23 Nov 2022 09:58:08 +0100
Ricardo Ribalda  wrote:

> Hi Philipp
> 
> Thanks for your review.
> 
> My scenario is a trusted system, where even if you are root, your
> access to the system is very limited.
> 
> Let's assume LOADPIN and verity are enabled.

My point is that on such systems I expect that a sysadmin also wants to
control the crash kernel including its initramfs (which also has to be part
of the signed kernel?). But if that's the case a sysadmin can simply arm
kdump early during boot and then toggle kexec_load_disabled. With that
LINUX_REBOOT_CMD_KEXEC also gets disabled as no kexec kernel can be loaded
while kdump works. Thus there is no need to add the new interface. Or am
I missing anything?

Thanks
Philipp

> 
> On Mon, 21 Nov 2022 at 15:10, Philipp Rudo  wrote:
> >
> > Hi Ricardo,
> >
> > On Thu, 17 Nov 2022 16:15:07 +0100
> > Ricardo Ribalda  wrote:
> >  
> > > Hi Philipp
> > >
> > > Thanks for your review!  
> >
> > happy to help.
> >  
> > >
> > > On Thu, 17 Nov 2022 at 16:07, Philipp Rudo  wrote:  
> > > >
> > > > Hi Ricardo,
> > > >
> > > > all in all I think this patch makes sense. However, there is one point
> > > > I don't like...
> > > >
> > > > On Mon, 14 Nov 2022 14:18:39 +0100
> > > > Ricardo Ribalda  wrote:
> > > >  
> > > > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, reducing the
> > > > > attack surface to a system.
> > > > >
> > > > > Without this toogle, an attacker can only reboot into a different 
> > > > > kernel
> > > > > if they can create a panic().
> > > > >
> > > > > Signed-off-by: Ricardo Ribalda 
> > > > >
> > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> > > > > b/Documentation/admin-guide/sysctl/kernel.rst
> > > > > index 97394bd9d065..25d019682d33 100644
> > > > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > > > @@ -462,6 +462,17 @@ altered.
> > > > >  Generally used together with the `modules_disabled`_ sysctl.
> > > > >
> > > > >
> > > > > +kexec_reboot_disabled
> > > > > +=
> > > > > +
> > > > > +A toggle indicating if ``LINUX_REBOOT_CMD_KEXEC`` has been disabled.
> > > > > +This value defaults to 0 (false: ``LINUX_REBOOT_CMD_KEXEC`` enabled),
> > > > > +but can be set to 1 (true: ``LINUX_REBOOT_CMD_KEXEC`` disabled).
> > > > > +Once true, kexec can no longer be used for reboot and the toggle
> > > > > +cannot be set back to false.
> > > > > +This toggle does not affect the use of kexec during a crash.
> > > > > +
> > > > > +
> > > > >  kptr_restrict
> > > > >  =
> > > > >
> > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > > > > index 41a686996aaa..15c3fad8918b 100644
> > > > > --- a/include/linux/kexec.h
> > > > > +++ b/include/linux/kexec.h
> > > > > @@ -407,6 +407,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct 
> > > > > kimage *image);
> > > > >  extern struct kimage *kexec_image;
> > > > >  extern struct kimage *kexec_crash_image;
> > > > >  extern int kexec_load_disabled;
> > > > > +extern int kexec_reboot_disabled;
> > > > >
> > > > >  #ifndef kexec_flush_icache_page
> > > > >  #define kexec_flush_icache_page(page)
> > > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > > index cb8e6e6f983c..43063f803d81 100644
> > > > > --- a/kernel/kexec.c
> > > > > +++ b/kernel/kexec.c
> > > > > @@ -196,6 +196,10 @@ static inline int kexec_load_check(unsigned long 
> > > > > nr_segments,
> > > > >   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
> > > > >   return -EPERM;
> > > > >
> > > > > + /* Check if the system admin has disabled kexec reboot. */
> > > > > + if (!(flags & KEXEC_ON_CRASH) && kexec_reboot_disabled)
> > > > > + return -EPERM;  
> > > >
> > > > ... Allowing to load a crashkernel doesn't make sense in my opin

Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-21 Thread Philipp Rudo
Hi Ricardo,

On Thu, 17 Nov 2022 16:15:07 +0100
Ricardo Ribalda  wrote:

> Hi Philipp
> 
> Thanks for your review!

happy to help.

> 
> On Thu, 17 Nov 2022 at 16:07, Philipp Rudo  wrote:
> >
> > Hi Ricardo,
> >
> > all in all I think this patch makes sense. However, there is one point
> > I don't like...
> >
> > On Mon, 14 Nov 2022 14:18:39 +0100
> > Ricardo Ribalda  wrote:
> >  
> > > Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, reducing the
> > > attack surface to a system.
> > >
> > > Without this toogle, an attacker can only reboot into a different kernel
> > > if they can create a panic().
> > >
> > > Signed-off-by: Ricardo Ribalda 
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> > > b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 97394bd9d065..25d019682d33 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -462,6 +462,17 @@ altered.
> > >  Generally used together with the `modules_disabled`_ sysctl.
> > >
> > >
> > > +kexec_reboot_disabled
> > > +=
> > > +
> > > +A toggle indicating if ``LINUX_REBOOT_CMD_KEXEC`` has been disabled.
> > > +This value defaults to 0 (false: ``LINUX_REBOOT_CMD_KEXEC`` enabled),
> > > +but can be set to 1 (true: ``LINUX_REBOOT_CMD_KEXEC`` disabled).
> > > +Once true, kexec can no longer be used for reboot and the toggle
> > > +cannot be set back to false.
> > > +This toggle does not affect the use of kexec during a crash.
> > > +
> > > +
> > >  kptr_restrict
> > >  =
> > >
> > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > > index 41a686996aaa..15c3fad8918b 100644
> > > --- a/include/linux/kexec.h
> > > +++ b/include/linux/kexec.h
> > > @@ -407,6 +407,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage 
> > > *image);
> > >  extern struct kimage *kexec_image;
> > >  extern struct kimage *kexec_crash_image;
> > >  extern int kexec_load_disabled;
> > > +extern int kexec_reboot_disabled;
> > >
> > >  #ifndef kexec_flush_icache_page
> > >  #define kexec_flush_icache_page(page)
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index cb8e6e6f983c..43063f803d81 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > > @@ -196,6 +196,10 @@ static inline int kexec_load_check(unsigned long 
> > > nr_segments,
> > >   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
> > >   return -EPERM;
> > >
> > > + /* Check if the system admin has disabled kexec reboot. */
> > > + if (!(flags & KEXEC_ON_CRASH) && kexec_reboot_disabled)
> > > + return -EPERM;  
> >
> > ... Allowing to load a crashkernel doesn't make sense in my opinion. If
> > an attacker is capable of creating a malicious kernel, planting it on
> > the victims system and then find a way to boot it via kexec this
> > attacker also knows how to load the malicious kernel as crashkernel and
> > trigger a panic. So you haven't really gained anything. That's why I
> > would simply drop this hunk (and the corresponding one from
> > kexec_file_load) and let users who worry about this use a combination of
> > kexec_load_disabled and kexec_reboot_disabled.  
> 
> If for whatever reason your sysadmin configured kexec_reboot_disabed
> it can be nice that when a user try to load it they get a warning.
> It is easier to debug than waiting two steps later when they run kexec -e

I'm having second thoughts about this patch. My main problem is that I
don't see a real use case where kexec_reboot_disabled is advantageous
over kexec_load_disabled. The point is that disabling
LINUX_REBOOT_CMD_KEXEC is almost identical to toggling kexec_load_disabled 
without
a loaded kernel (when you don't have a kernel loaded you cannot reboot
into it). With this the main use case of kexec_reboot_disabled is
already covered by kexec_load_disabled.

However, there are two differences

1) with kexec_reboot_disable you can still (re-)load a crash kernel
e.g. to update the initramfs after a config change. But as discussed in
my first mail this comes on the cost that an attacker could still load a
malicious crash kernel and then 'panic into it'.
 
2) kexec_load_disabled also prevents unloading of a loaded kernel. So
once loaded kexec_load_disabled cannot prevent the reboot into this
kernel.

Re: [PATCH RFC] kexec: Freeze processes before kexec

2022-11-17 Thread Philipp Rudo
Hi Steve,

On Wed, 16 Nov 2022 15:16:10 -0500
Steven Rostedt  wrote:

> On Wed, 16 Nov 2022 19:56:24 +
> "Joel Fernandes (Google)"  wrote:
> 
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -1175,6 +1175,12 @@ int kernel_kexec(void)
> > } else
> >  #endif
> > {
> > +   error = freeze_processes();
> > +   if (error) {
> > +   error = -EBUSY;
> > +   goto Unlock;
> > +   }  
> 
> If this is the path of a kernel panic, do we really want to check the
> return error of freeze_processes()? We are panicing, there's not much more
> we can do.

kernel_kexec isn't called during panic. We don't need to worry about it
here.

Having that said I think this is a problem in the device driver _not_
in kexec. In my opinion it's the job of the driver to prevent such
races during shutdown.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 2/2] kexec: Introduce kexec_reboot_disabled

2022-11-17 Thread Philipp Rudo
Hi Ricardo,

all in all I think this patch makes sense. However, there is one point
I don't like...

On Mon, 14 Nov 2022 14:18:39 +0100
Ricardo Ribalda  wrote:

> Create a new toogle that disables LINUX_REBOOT_CMD_KEXEC, reducing the
> attack surface to a system.
> 
> Without this toogle, an attacker can only reboot into a different kernel
> if they can create a panic().
> 
> Signed-off-by: Ricardo Ribalda 
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> b/Documentation/admin-guide/sysctl/kernel.rst
> index 97394bd9d065..25d019682d33 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -462,6 +462,17 @@ altered.
>  Generally used together with the `modules_disabled`_ sysctl.
>  
>  
> +kexec_reboot_disabled
> +=
> +
> +A toggle indicating if ``LINUX_REBOOT_CMD_KEXEC`` has been disabled.
> +This value defaults to 0 (false: ``LINUX_REBOOT_CMD_KEXEC`` enabled),
> +but can be set to 1 (true: ``LINUX_REBOOT_CMD_KEXEC`` disabled).
> +Once true, kexec can no longer be used for reboot and the toggle
> +cannot be set back to false.
> +This toggle does not affect the use of kexec during a crash.
> +
> +
>  kptr_restrict
>  =
>  
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 41a686996aaa..15c3fad8918b 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -407,6 +407,7 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage 
> *image);
>  extern struct kimage *kexec_image;
>  extern struct kimage *kexec_crash_image;
>  extern int kexec_load_disabled;
> +extern int kexec_reboot_disabled;
>  
>  #ifndef kexec_flush_icache_page
>  #define kexec_flush_icache_page(page)
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index cb8e6e6f983c..43063f803d81 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -196,6 +196,10 @@ static inline int kexec_load_check(unsigned long 
> nr_segments,
>   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
>   return -EPERM;
>  
> + /* Check if the system admin has disabled kexec reboot. */
> + if (!(flags & KEXEC_ON_CRASH) && kexec_reboot_disabled)
> + return -EPERM;

... Allowing to load a crashkernel doesn't make sense in my opinion. If
an attacker is capable of creating a malicious kernel, planting it on
the victims system and then find a way to boot it via kexec this
attacker also knows how to load the malicious kernel as crashkernel and
trigger a panic. So you haven't really gained anything. That's why I
would simply drop this hunk (and the corresponding one from
kexec_file_load) and let users who worry about this use a combination of
kexec_load_disabled and kexec_reboot_disabled.

Thanks
Philipp

> +
>   /* Permit LSMs and IMA to fail the kexec */
>   result = security_kernel_load_data(LOADING_KEXEC_IMAGE, false);
>   if (result < 0)
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index ca2743f9c634..fe82e2525705 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -929,6 +929,7 @@ int kimage_load_segment(struct kimage *image,
>  struct kimage *kexec_image;
>  struct kimage *kexec_crash_image;
>  int kexec_load_disabled;
> +int kexec_reboot_disabled;
>  #ifdef CONFIG_SYSCTL
>  static struct ctl_table kexec_core_sysctls[] = {
>   {
> @@ -941,6 +942,16 @@ static struct ctl_table kexec_core_sysctls[] = {
>   .extra1 = SYSCTL_ONE,
>   .extra2 = SYSCTL_ONE,
>   },
> + {
> + .procname   = "kexec_reboot_disabled",
> + .data   = _reboot_disabled,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + /* only handle a transition from default "0" to "1" */
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ONE,
> + .extra2 = SYSCTL_ONE,
> + },
>   { }
>  };
>  
> @@ -1138,7 +1149,7 @@ int kernel_kexec(void)
>  
>   if (!kexec_trylock())
>   return -EBUSY;
> - if (!kexec_image) {
> + if (!kexec_image || kexec_reboot_disabled) {
>   error = -EINVAL;
>   goto Unlock;
>   }
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 45637511e0de..583fba6de5cb 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -333,6 +333,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> initrd_fd,
>   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
>   return -EPERM;
>  
> + /* Check if the system admin has disabled kexec reboot. */
> + if (!(flags & (KEXEC_FILE_ON_CRASH | KEXEC_FILE_UNLOAD))
> + && kexec_reboot_disabled)
> + return -EPERM;
> +
>   /* Make sure we have a legal set of flags */
>   if (flags != (flags & KEXEC_FILE_FLAGS))
>   return -EINVAL;
> 


___
kexec mailing list

Re: [PATCH] kernel/crash_core.c : Remove redundant checks for ck_cmdline is NULL

2022-05-03 Thread Philipp Rudo
Hi lizhe,

On Mon, 2 May 2022 18:11:20 +0800 (CST)
lizhe   wrote:

> HI Philipp Rudo.
> 
> 
>   When ck_cmdline is NULL. The last three lines of this function are 
> equivalent to : 
>   if ( !  NULL)
>  return NULL;
>   return NULL;
>  This is obviously a redundant check.
> 
> 
>I will use the above description to  describe the patch,

the explanation looks good to me.

Thanks!
Philipp


>  
> 
> thanks.
> 
> lizhe
>   
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> At 2022-04-26 16:39:52, "Philipp Rudo"  wrote:
> >Hi lizhe,
> >
> >On Mon, 25 Apr 2022 08:38:57 -0700
> >lizhe  wrote:
> >  
> >>  When ck_cmdline is NULL, the only caller of get_last_crashkernel()
> >>  has already done non-NULL check(see __parse_crashkernel()),
> >>  so it doesn't make any sense to make a check here  
> >
> >sorry, but I still don't like the description. What I don't understand
> >in particular is why you are mentioning the caller (__parse_crashkernel)
> >here. ck_cmdline is a local variable to get_last_crashkernel. So the
> >caller cannot perform any check on the variable but only the return
> >value of the function. So the patch description should describe why we
> >can remove the additional return NULL without changing the behavior of
> >the function.
> >
> >Thanks
> >Philipp
> >  
> >> Signed-off-by: lizhe 
> >> ---
> >>  kernel/crash_core.c | 3 ---
> >>  1 file changed, 3 deletions(-)
> >> 
> >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> >> index 256cf6db573c..c232f01a2c54 100644
> >> --- a/kernel/crash_core.c
> >> +++ b/kernel/crash_core.c
> >> @@ -222,9 +222,6 @@ static __init char *get_last_crashkernel(char *cmdline,
> >>p = strstr(p+1, name);
> >>}
> >>  
> >> -  if (!ck_cmdline)
> >> -  return NULL;
> >> -
> >>return ck_cmdline;
> >>  }
> >>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kernel/crash_core.c : Remove redundant checks for ck_cmdline is NULL

2022-04-26 Thread Philipp Rudo
Hi lizhe,

On Mon, 25 Apr 2022 08:38:57 -0700
lizhe  wrote:

>  When ck_cmdline is NULL, the only caller of get_last_crashkernel()
>  has already done non-NULL check(see __parse_crashkernel()),
>  so it doesn't make any sense to make a check here

sorry, but I still don't like the description. What I don't understand
in particular is why you are mentioning the caller (__parse_crashkernel)
here. ck_cmdline is a local variable to get_last_crashkernel. So the
caller cannot perform any check on the variable but only the return
value of the function. So the patch description should describe why we
can remove the additional return NULL without changing the behavior of
the function.

Thanks
Philipp

> Signed-off-by: lizhe 
> ---
>  kernel/crash_core.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 256cf6db573c..c232f01a2c54 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -222,9 +222,6 @@ static __init char *get_last_crashkernel(char *cmdline,
>   p = strstr(p+1, name);
>   }
>  
> - if (!ck_cmdline)
> - return NULL;
> -
>   return ck_cmdline;
>  }
>  


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kernel/crash_core.c: No judgment required

2022-04-26 Thread Philipp Rudo
Hi,

On Tue, 26 Apr 2022 10:17:18 +0200
Philipp Rudo  wrote:

> Hi lizhe,
> 
> On Mon, 25 Apr 2022 14:22:31 +0800 (CST)
> lizhe   wrote:
> 
> > HI :
> > 
> > 
> > I found the problem at the first time and gave the solution,
> > 
> > 
> > 
> > 
> > Pphilipp Rudo just saw the solution to the problem and gave an explanation.
> > the author of this patch should only be me  
> 
> right, I only commented on the patch you sent.
> 
> Could you please update the commit message and send a v2.


should have checked the rest of my mails first...


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kernel/crash_core.c: No judgment required

2022-04-26 Thread Philipp Rudo
Hi lizhe,

On Mon, 25 Apr 2022 14:22:31 +0800 (CST)
lizhe   wrote:

> HI :
> 
> 
> I found the problem at the first time and gave the solution,
> 
> 
> 
> 
> Pphilipp Rudo just saw the solution to the problem and gave an explanation.
> the author of this patch should only be me

right, I only commented on the patch you sent.

Could you please update the commit message and send a v2.

Thanks
Philipp

> 
> 
> 
> 
> 
> 
> 
>   
>  lizhe
> 
> 
> 
> 
> 
> 
> 
> 
> At 2022-04-25 09:36:17, "Baoquan He"  wrote:
> >On 12/14/21 at 05:32pm, Philipp Rudo wrote:  
> >> Hi lizhe,
> >> 
> >> On Thu,  9 Dec 2021 19:20:03 -0800
> >> lizhe  wrote:
> >>   
> >> > No judgment required ck_cmdline is NULL
> >> > its caller has alreadly judged, see __parse_crashkernel
> >> > function
> >> > 
> >> > Signed-off-by: lizhe 
> >> > ---
> >> >  kernel/crash_core.c | 3 ---
> >> >  1 file changed, 3 deletions(-)
> >> > 
> >> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> >> > index eb53f5ec62c9..9981cf9b9fe4 100644
> >> > --- a/kernel/crash_core.c
> >> > +++ b/kernel/crash_core.c
> >> > @@ -221,9 +221,6 @@ static __init char *get_last_crashkernel(char 
> >> > *cmdline,
> >> >  p = strstr(p+1, name);
> >> >  }
> >> >  
> >> > -if (!ck_cmdline)
> >> > -return NULL;
> >> > -
> >> >  return ck_cmdline;
> >> >  }
> >> >
> >> 
> >> I agree that the if-block is not needed and can be removed. However, I
> >> cannot follow your reasoning in the commit message. Could you please
> >> explain it in more detail.
> >> 
> >> The reason why I think that the 'if' can be removed is that the
> >> expression can only be true when ck_cmdline = NULL. But with that the
> >> last three lines are equivalent to
> >> 
> >>if (!ck_cmdline)
> >>return ck_cmdline;
> >> 
> >>return ck_cmdline;
> >> 
> >> Which simply doesn't make any sense.  
> >
> >Right, the judgement actually introduces redundant codes. As Zhe
> >replied, maybe you can rewrite the log and repost with your
> >Signed-off-by, Philipp. As for Author, you two can discuss in private
> >mail.  


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH makedumpfile] Avoid false-positive mem_section validation with vmlinux

2022-04-25 Thread Philipp Rudo
Hi Kazu,

On Wed, 20 Apr 2022 23:58:29 +
HAGIO KAZUHITO(萩尾 一仁)  wrote:

> Currently get_mem_section() validates if SYMBOL(mem_section) is the address
> of the mem_section array first.  But there was a report that the first
> validation wrongly returned TRUE with -x vmlinux and SPARSEMEM_EXTREME
> (4.15+) on s390x.  This leads to crash failing statup with the following
> seek error:
> 
>   crash: seek error: kernel virtual address: 67fffc2800  type: "memory 
> section root table"
> 
> Skip the first validation when satisfying the conditions.
> 
> Reported-by: Dave Wysochanski 
> Signed-off-by: Kazuhito Hagio 

for me this patch looks fine and exactly addresses the problem, that
mem_section has different types in the vmlinux and vmcoreinfo. So, for
what I want

Reviewed-and-Tested-by: Philipp Rudo 

> ---
>  makedumpfile.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/makedumpfile.c b/makedumpfile.c
> index a2f45c84cee3..65d1c7c2f02c 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -3698,6 +3698,22 @@ validate_mem_section(unsigned long *mem_sec,
>   return ret;
>  }
>  
> +/*
> + * SYMBOL(mem_section) varies with the combination of memory model and
> + * its source:
> + *
> + * SPARSEMEM
> + *   vmcoreinfo: address of mem_section root array
> + *   -x vmlinux: address of mem_section root array
> + *
> + * SPARSEMEM_EXTREME v1
> + *   vmcoreinfo: address of mem_section root array
> + *   -x vmlinux: address of mem_section root array
> + *
> + * SPARSEMEM_EXTREME v2 (with 83e3c48729d9 and a0b1280368d1) 4.15+
> + *   vmcoreinfo: address of mem_section root array
> + *   -x vmlinux: address of pointer to mem_section root array
> + */
>  static int
>  get_mem_section(unsigned int mem_section_size, unsigned long *mem_maps,
>   unsigned int num_section)
> @@ -3710,12 +3726,27 @@ get_mem_section(unsigned int mem_section_size, 
> unsigned long *mem_maps,
>   strerror(errno));
>   return FALSE;
>   }
> +
> + /*
> +  * There was a report that the first validation wrongly returned TRUE
> +  * with -x vmlinux and SPARSEMEM_EXTREME v2 on s390x, so skip it.
> +  * Howerver, leave the fallback validation as it is for the -i option.
> +  */
> + if (is_sparsemem_extreme() && info->name_vmlinux) {
> + unsigned long flag = 0;
> + if (get_symbol_type_name("mem_section", 
> DWARF_INFO_GET_SYMBOL_TYPE,
> + NULL, )
> + && !(flag & TYPE_ARRAY))
> + goto skip_1st_validation;
> + }
> +
>   ret = validate_mem_section(mem_sec, SYMBOL(mem_section),
>  mem_section_size, mem_maps, num_section);
>  
>   if (!ret && is_sparsemem_extreme()) {
>   unsigned long mem_section_ptr;
>  
> +skip_1st_validation:
>   if (!readmem(VADDR, SYMBOL(mem_section), _section_ptr,
>sizeof(mem_section_ptr)))
>   goto out;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] makedumpfile: break loop after last dumpable page

2022-04-07 Thread Philipp Rudo
Hi Kazu,

On Thu, 7 Apr 2022 06:43:00 +
HAGIO KAZUHITO(萩尾 一仁)  wrote:

> -Original Message-
> > Once the last dumpable page was processed there is no need to finish the
> > loop to the last page. Thus exit early to improve performance.
> > 
> > Signed-off-by: Philipp Rudo 
> > ---
> >  makedumpfile.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/makedumpfile.c b/makedumpfile.c
> > index 2ef3458..c944d0e 100644
> > --- a/makedumpfile.c
> > +++ b/makedumpfile.c
> > @@ -8884,6 +8884,12 @@ write_kdump_pages_cyclic(struct cache_data 
> > *cd_header, struct cache_data *cd_pag
> > 
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > 
> > +   /*
> > +* All dumpable pages have been processed. No need to continue.
> > +*/
> > +   if (num_dumped == info->num_dumpable)
> > +   break;  
> 
> This patch is likely to increase the possibility of failure to capture
> /proc/kcore, although this is an unofficial functionality...
> 
>   # makedumpfile -ld31 /proc/kcore kcore.snap
>   # crash vmlinux kcore.snap
>   ...
>   crash: page incomplete: kernel virtual address: 916fbeffed00  type: 
> "pglist node_id"
> 
> In cyclic mode, makedumpfile first calculates only info->num_dumpable [1] and
> frees the used bitmap, and later creates 2nd bitmap again [2] at this time.
> 
>   create_dumpfile
> create_dump_bitmap
>   info->num_dumpable = get_num_dumpable_cyclic()  <<-- [1]
> writeout_dumpfile
>   write_kdump_pages_and_bitmap_cyclic
> foreach cycle
>   create_2nd_bitmap  <<-- [2]
>   write_kdump_pages_cyclic
> 
> So with live system, num_dumped can exceed info->num_dumpable.
> If it stops at info->num_dumpable, some necessary data can be missed.

thanks for the explanation! I haven't considered that case and assumed
info->num_dumpable is constant.

> Capturing /proc/kcore is still fragile and not considered enough, but
> sometimes useful... when I want to capture a snapshot of memory.
> 
> (the bitmap is allocated as block, so probably it's working as some buffer?)
> 
> So I will merge the 1/2 patch, but personally would not like to merge
> this patch.  How necessary is this?

I don't think this patch is absolutely necessary. It's only a small
performance improvement but shouldn't have a huge impact. If you like
you can drop the patch.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/2] makedumpfile: break loop after last dumpable page

2022-04-04 Thread Philipp Rudo
Once the last dumpable page was processed there is no need to finish the
loop to the last page. Thus exit early to improve performance.

Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/makedumpfile.c b/makedumpfile.c
index 2ef3458..c944d0e 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -8884,6 +8884,12 @@ write_kdump_pages_cyclic(struct cache_data *cd_header, 
struct cache_data *cd_pag
 
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 
+   /*
+* All dumpable pages have been processed. No need to continue.
+*/
+   if (num_dumped == info->num_dumpable)
+   break;
+
/*
 * Check the excluded page.
 */
-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/2] makedumpfile: omit unnecessary calls to print_progress

2022-04-04 Thread Philipp Rudo
Check first if a page is dumpable before printing the process. Otherwise
there is the chance that num_dumped % per == 0 at the beginning of the
block of undampable pages. In that case num_dumped isn't updated before
the next dumpable page and thus print_process is called for every page
in that block.

This is especially annoying when the block is after the last dumpable
page and thus num_dumped == info->num_dumpable. In that case
print_process will bypass its check to only print the process once every
second and thus flood the console with unnecessary prints. This can lead
to a severe decrease in performance especially when the console is in
line mode.

Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 14556db..2ef3458 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -8884,16 +8884,16 @@ write_kdump_pages_cyclic(struct cache_data *cd_header, 
struct cache_data *cd_pag
 
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 
-   if ((num_dumped % per) == 0)
-   print_progress(PROGRESS_COPY, num_dumped, 
info->num_dumpable, _start);
-
/*
 * Check the excluded page.
 */
if (!is_dumpable(info->bitmap2, pfn, cycle))
continue;
 
+   if ((num_dumped % per) == 0)
+   print_progress(PROGRESS_COPY, num_dumped, 
info->num_dumpable, _start);
num_dumped++;
+
if (!read_pfn(pfn, buf))
goto out;
 
-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 0/2] makedumpfile: improve write_kdump_pages_cyclic

2022-04-04 Thread Philipp Rudo
Hi,

I noticed that on s390 makedumpfile sometimes floods the console with

Copying data   : [100.0 %] -   eta: 0s

causing the dump process to slow down dramatically. The patches attached are
two somewhat separate approaches to fix this issue. In my opinion both patches
make sense to be included even when one of them would be enough to fix the
problem described above.

Thanks
Philipp

Philipp Rudo (2):
  makedumpfile: omit unnecessary calls to print_progress
  makedumpfile: break loop after last dumpable page

 makedumpfile.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: running into OOM killer with kexec loading large ramdisk

2022-03-25 Thread Philipp Rudo
Hi Tobias,

On Wed, 23 Mar 2022 12:14:59 +0100
Tobias Powalowski  wrote:

> Hi,
> again me,
> I try to load a 900MB ramdisk on a ramfs rootfs  with kexec and a 10MB kernel.
> With 2800MB RAM assigned to qemu.
> Memory free by /proc/memstat: 2.2GB
> It keeps on OOM killed while executing:
> kexec -l kernel --initrd=initrd.img
> I can safely unpack the initrd in the ramfs without getting OOM killed.
> What is kexec doing wrong here?

I don't think that kexec is doing anything wrong here.

The kexec_load syscall is designed in a way that user space prepares
everything in a huge buffer and passes a pointer to it to the system
call. The systemcall then needs to copy everything from the user buffer
to a kernel buffer. So there are three copies of the initrd in memory
(including the one in ramfs). Together they take up 3 * 900MB = 2.7GB.
So basically all the memory of your guest.

You can try using the kexec_file_load systemcall. That at least
eliminates the copy in the user buffer so there are only two copies
left. 400MB of free memory is still quite scarce, though.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] util_lib/elf_info: harden parsing of printk buffer

2022-03-23 Thread Philipp Rudo
The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size
buffer (log_buf) that contains all messages. The location for the next
message is stored in log_next_idx. In case the log_buf runs full
log_next_idx wraps around and starts overwriting old messages at the
beginning of the buffer. The wraparound is denoted by a message with
msg->len == 0.

Following the behavior described above blindly is dangerous as e.g. a
memory corruption could overwrite (parts of) the log_buf. If the
corruption adds a message with msg->len == 0 this leads to an endless
loop when dumping the dmesg. Fix this by verifying that not wrapped
around before when it encounters a message with msg->len == 0.

While at it also verify that the index is within the log_buf and thus
guard against corruptions with msg->len != 0.

The same bug has been reported and fixed in makedumpfile [1].

[1] http://lists.infradead.org/pipermail/kexec/2022-March/024272.html

Signed-off-by: Philipp Rudo 
---
 util_lib/elf_info.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c
index d252eff..ce71c60 100644
--- a/util_lib/elf_info.c
+++ b/util_lib/elf_info.c
@@ -763,8 +763,9 @@ static void dump_dmesg_structured(int fd, void 
(*handler)(char*, unsigned int))
 {
 #define OUT_BUF_SIZE   4096
uint64_t log_buf, log_buf_offset, ts_nsec;
-   uint32_t log_first_idx, log_next_idx, current_idx, len = 0, i;
+   uint32_t log_buf_len, log_first_idx, log_next_idx, current_idx, len = 
0, i;
char *buf, out_buf[OUT_BUF_SIZE];
+   bool has_wrapped_around = false;
ssize_t ret;
char *msg;
uint16_t text_len;
@@ -811,6 +812,7 @@ static void dump_dmesg_structured(int fd, void 
(*handler)(char*, unsigned int))
}
 
log_buf = read_file_pointer(fd, vaddr_to_offset(log_buf_vaddr));
+   log_buf_len = read_file_s32(fd, vaddr_to_offset(log_buf_len_vaddr));
 
log_first_idx = read_file_u32(fd, vaddr_to_offset(log_first_idx_vaddr));
log_next_idx = read_file_u32(fd, vaddr_to_offset(log_next_idx_vaddr));
@@ -882,11 +884,31 @@ static void dump_dmesg_structured(int fd, void 
(*handler)(char*, unsigned int))
 * and read the message at the start of the buffer.
 */
loglen = struct_val_u16(buf, log_offset_len);
-   if (!loglen)
+   if (!loglen) {
+   if (has_wrapped_around) {
+   if (len && handler)
+   handler(out_buf, len);
+   fprintf(stderr, "Cycle when parsing dmesg 
detected.\n");
+   fprintf(stderr, "The prink log_buf is most 
likely corrupted.\n");
+   fprintf(stderr, "log_buf = 0x%lx, idx = 0x%x\n",
+   log_buf, current_idx);
+   exit(68);
+   }
current_idx = 0;
-   else
+   has_wrapped_around = true;
+   } else {
/* Move to next record */
current_idx += loglen;
+   if(current_idx > log_buf_len - log_sz) {
+   if (len && handler)
+   handler(out_buf, len);
+   fprintf(stderr, "Index outside log_buf 
detected.\n");
+   fprintf(stderr, "The prink log_buf is most 
likely corrupted.\n");
+   fprintf(stderr, "log_buf = 0x%lx, idx = 0x%x\n",
+   log_buf, current_idx);
+   exit(69);
+   }
+   }
}
free(buf);
if (len && handler)
-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 0/4] makedumpfile: harden parsing of old prink buffer

2022-03-14 Thread Philipp Rudo
Hi,

dumping the dmesg can cause an endless loop for the old prink mechanism (>
v3.5.0 and < v5.10.0) when the log_buf got corrupted. This series fixes those
cases by adding a cycle detection. The cycle detection is implemented in a
generic way so that it can be reused in other parts of makedumpfile.

Thanks
Philipp

v2:
* Rename 'idx' to 'ptr'
* Also print the non-loop part when a cycle was detected. Such a
  situation can happen when log_buf wrapped around in the kernel
  (log_first_idx != 0) and the corruption occurred on an
  idx < log_first_idx.
* Add patch 4 fixing a bug independent from the memory corruption but
  found while investigating it.

Philipp Rudo (4):
  makedumpfile: add generic cycle detection
  makedumpfile: use pointer arithmetics for dump_dmesg
  makedumpfile: use cycle detection when parsing the prink log_buf
  makedumpfile: print error when reading with unsupported compression

 Makefile   |   2 +-
 detect_cycle.c |  99 +
 detect_cycle.h |  40 +++
 makedumpfile.c | 131 -
 4 files changed, 247 insertions(+), 25 deletions(-)
 create mode 100644 detect_cycle.c
 create mode 100644 detect_cycle.h

-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 3/4] makedumpfile: use cycle detection when parsing the prink log_buf

2022-03-14 Thread Philipp Rudo
The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size
buffer (log_buf) that contains all messages. The location for the next
message is stored in log_next_idx. In case the log_buf runs full
log_next_idx wraps around and starts overwriting old messages at the
beginning of the buffer. The wraparound is denoted by a message with
msg->len == 0.

Following the behavior described above blindly in makedumpfile is
dangerous as e.g. a memory corruption could overwrite (parts of) the
log_buf. If the corruption adds a message with msg->len == 0 this leads
to an endless loop when dumping the dmesg with makedumpfile appending
the messages up to the corruption over and over again to the output file
until file system is full. Fix this by using cycle detection and aboard
once one is detected.

While at it also verify that the index is within the log_buf and thus
guard against corruptions with msg->len != 0.

Reported-by: Audra Mitchell 
Suggested-by: Dave Wysochanski 
Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 9a05c96..b7ac999 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -15,6 +15,7 @@
  */
 #include "makedumpfile.h"
 #include "print_info.h"
+#include "detect_cycle.h"
 #include "dwarf_info.h"
 #include "elf_info.h"
 #include "erase_info.h"
@@ -5528,10 +5529,11 @@ dump_dmesg()
unsigned long index, log_buf, log_end;
unsigned int log_first_idx, log_next_idx;
unsigned long long first_idx_sym;
+   struct detect_cycle *dc = NULL;
unsigned long log_end_2_6_24;
unsigned  log_end_2_6_25;
char *log_buffer = NULL, *log_ptr = NULL;
-   char *ptr;
+   char *ptr, *next_ptr;
 
/*
 * log_end has been changed to "unsigned" since linux-2.6.25.
@@ -5679,12 +5681,55 @@ dump_dmesg()
goto out;
}
ptr = log_buffer + log_first_idx;
+   dc = dc_init(ptr, log_buffer, log_next);
while (ptr != log_buffer + log_next_idx) {
log_ptr = log_from_ptr(ptr, log_buffer);
if (!dump_log_entry(log_ptr, info->fd_dumpfile,
info->name_dumpfile))
goto out;
ptr = log_next(ptr, log_buffer);
+   if (dc_next(dc, (void **) _ptr)) {
+   unsigned long len;
+   int in_cycle;
+   char *first;
+
+   /* Clear everything we have already written... 
*/
+   ftruncate(info->fd_dumpfile, 0);
+   lseek(info->fd_dumpfile, 0, SEEK_SET);
+
+   /* ...and only write up to the corruption. */
+   dc_find_start(dc, (void **) , );
+   ptr = log_buffer + log_first_idx;
+   in_cycle = FALSE;
+   while (len) {
+   log_ptr = log_from_ptr(ptr, log_buffer);
+   if (!dump_log_entry(log_ptr,
+   info->fd_dumpfile,
+   
info->name_dumpfile))
+   goto out;
+   ptr = log_next(ptr, log_buffer);
+
+   if (log_ptr == first)
+   in_cycle = TRUE;
+
+   if (in_cycle)
+   len--;
+   }
+   ERRMSG("Cycle when parsing dmesg detected.\n");
+   ERRMSG("The printk log_buf is most likely 
corrupted.\n");
+   ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", 
log_buf, ptr - log_buffer);
+   close_files_for_creating_dumpfile();
+   goto out;
+   }
+   if (next_ptr < log_buffer ||
+   next_ptr > log_buffer + log_buf_len - 
SIZE(printk_log)) {
+   ERRMSG("Index outside log_buf detected.\n");
+   ERRMSG("The printk log_buf is most likely 
corrupted.\n");
+   ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", 
log_buf, ptr - log_buffer);
+   close_files_for_creating_dumpfile();
+  

[PATCH v2 4/4] makedumpfile: print error when reading with unsupported compression

2022-03-14 Thread Philipp Rudo
Currently makedumpfile only checks if the required compression algorithm
was enabled during build when compressing a dump but not when reading
from one. This can lead to situations where, one version of makedumpfile
creates the dump using a compression algorithm an other version of
makedumpfile doesn't support. When the second version now tries to, e.g.
extract the dmesg from the dump it will fail with an error similar to

# makedumpfile --dump-dmesg vmcore dmesg.txt
__vtop4_x86_64: Can't get a valid pgd.
readmem: Can't convert a virtual address(92e18284) to physical 
address.
readmem: type_addr: 0, addr:92e18284, size:390
check_release: Can't get the address of system_utsname.

makedumpfile Failed.

That's because readpage_kdump_compressed{_parallel} does not return
with an error if the page it is trying to read is compressed with an
unsupported compression algorithm. Thus readmem copies random data from
the (uninitialized) cachebuf to its caller and thus causing the error
above.

Fix this by checking if the required compression algorithm is supported
in readpage_kdump_compressed{_parallel} and print a proper error message
if it isn't.

Reported-by: Dave Wysochanski 
Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 56 ++
 1 file changed, 48 insertions(+), 8 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index b7ac999..56f3b6c 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -865,9 +865,14 @@ readpage_kdump_compressed(unsigned long long paddr, void 
*bufptr)
ERRMSG("Uncompress failed: %d\n", ret);
goto out_error;
}
+   } else if ((pd.flags & DUMP_DH_COMPRESSED_LZO)) {
 #ifdef USELZO
-   } else if (info->flag_lzo_support
-  && (pd.flags & DUMP_DH_COMPRESSED_LZO)) {
+   if (!info->flag_lzo_support) {
+   ERRMSG("lzo compression unsupported\n");
+   out = FALSE;
+   goto out_error;
+   }
+
retlen = info->page_size;
ret = lzo1x_decompress_safe((unsigned char *)buf, pd.size,
(unsigned char *)bufptr, ,
@@ -876,9 +881,14 @@ readpage_kdump_compressed(unsigned long long paddr, void 
*bufptr)
ERRMSG("Uncompress failed: %d\n", ret);
goto out_error;
}
+#else
+   ERRMSG("lzo compression unsupported\n");
+   ERRMSG("Try `make USELZO=on` when building.\n");
+   out = FALSE;
+   goto out_error;
 #endif
-#ifdef USESNAPPY
} else if ((pd.flags & DUMP_DH_COMPRESSED_SNAPPY)) {
+#ifdef USESNAPPY
 
ret = snappy_uncompressed_length(buf, pd.size, (size_t 
*));
if (ret != SNAPPY_OK) {
@@ -891,14 +901,24 @@ readpage_kdump_compressed(unsigned long long paddr, void 
*bufptr)
ERRMSG("Uncompress failed: %d\n", ret);
goto out_error;
}
+#else
+   ERRMSG("snappy compression unsupported\n");
+   ERRMSG("Try `make USESNAPPY=on` when building.\n");
+   out = FALSE;
+   goto out_error;
 #endif
-#ifdef USEZSTD
} else if ((pd.flags & DUMP_DH_COMPRESSED_ZSTD)) {
+#ifdef USEZSTD
ret = ZSTD_decompress(bufptr, info->page_size, buf, pd.size);
if (ZSTD_isError(ret) || (ret != info->page_size)) {
ERRMSG("Uncompress failed: %d\n", ret);
goto out_error;
}
+#else
+   ERRMSG("zstd compression unsupported\n");
+   ERRMSG("Try `make USEZSTD=on` when building.\n");
+   out = FALSE;
+   goto out_error;
 #endif
}
 
@@ -964,9 +984,14 @@ readpage_kdump_compressed_parallel(int fd_memory, unsigned 
long long paddr,
ERRMSG("Uncompress failed: %d\n", ret);
goto out_error;
}
+   } else if ((pd.flags & DUMP_DH_COMPRESSED_LZO)) {
 #ifdef USELZO
-   } else if (info->flag_lzo_support
-  && (pd.flags & DUMP_DH_COMPRESSED_LZO)) {
+   if (!info->flag_lzo_support) {
+   ERRMSG("lzo compression unsupported\n");
+   out = FALSE;
+   goto out_error;
+   }
+
retlen = info->page_size;
ret = lzo1x_decompress_safe((unsigned char *)buf, pd.size,
(unsigned char *)bufptr, ,
@@ -975,9 +1000,14 @@ readpage_kdump_compressed_parallel(int fd_memory, 
unsigned long 

[PATCH v2 1/4] makedumpfile: add generic cycle detection

2022-03-14 Thread Philipp Rudo
In order to work makedumpfile needs to interpret data read from the
dump. This can cause problems as the data from the dump cannot be
trusted (otherwise the kernel wouldn't have panicked in the first
place). This also means that every loop which stop condition depend on
data read from the dump has a chance to loop forever. Thus add a generic
cycle detection mechanism that allows to detect and handle such
situations appropriately.

For cycle detection use Brent's algorithm [1] as it has constant memory
usage. With this it can also be used in the kdump kernel without the
danger that it runs oom when iterating large data structures.
Furthermore it only depends on some pointer arithmetic. Thus the
performance impact (as long as no cycle was detected) should be
comparatively small.

[1] https://en.wikipedia.org/wiki/Cycle_detection#Brent's_algorithm

Suggested-by: Dave Wysochanski 
Signed-off-by: Philipp Rudo 
---
 Makefile   |  2 +-
 detect_cycle.c | 99 ++
 detect_cycle.h | 40 
 3 files changed, 140 insertions(+), 1 deletion(-)
 create mode 100644 detect_cycle.c
 create mode 100644 detect_cycle.h

diff --git a/Makefile b/Makefile
index f118b31..3441364 100644
--- a/Makefile
+++ b/Makefile
@@ -45,7 +45,7 @@ CFLAGS_ARCH += -m32
 endif
 
 SRC_BASE = makedumpfile.c makedumpfile.h diskdump_mod.h sadump_mod.h 
sadump_info.h
-SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c 
cache.c tools.c printk.c
+SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c 
cache.c tools.c printk.c detect_cycle.c
 OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART))
 SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c 
arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c
 OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH))
diff --git a/detect_cycle.c b/detect_cycle.c
new file mode 100644
index 000..6b551a7
--- /dev/null
+++ b/detect_cycle.c
@@ -0,0 +1,99 @@
+/*
+ * detect_cycle.c  --  Generic cycle detection using Brent's algorithm
+ *
+ * Created by: Philipp Rudo 
+ *
+ * Copyright (c) 2022 Red Hat, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+
+#include "detect_cycle.h"
+
+struct detect_cycle {
+   /* First entry of the list */
+   void *head;
+
+   /* Variables required by Brent's algorithm */
+   void *fast_p;
+   void *slow_p;
+   unsigned long length;
+   unsigned long power;
+
+   /* Function to get the next entry in the list */
+   dc_next_t next;
+
+   /* Private data passed to next */
+   void *data;
+};
+
+struct detect_cycle *dc_init(void *head, void *data, dc_next_t next)
+{
+   struct detect_cycle *new;
+
+   new = malloc(sizeof(*new));
+   if (!new)
+   return NULL;
+
+   new->next = next;
+   new->data = data;
+
+   new->head = head;
+   new->slow_p = head;
+   new->fast_p = head;
+   new->length = 0;
+   new->power  = 2;
+
+   return new;
+}
+
+int dc_next(struct detect_cycle *dc, void **next)
+{
+
+   if (dc->length == dc->power) {
+   dc->length = 0;
+   dc->power *= 2;
+   dc->slow_p = dc->fast_p;
+   }
+
+   dc->fast_p = dc->next(dc->fast_p, dc->data);
+   dc->length++;
+
+   if (dc->slow_p == dc->fast_p)
+   return 1;
+
+   *next = dc->fast_p;
+   return 0;
+}
+
+void dc_find_start(struct detect_cycle *dc, void **first, unsigned long *len)
+{
+   void *slow_p, *fast_p;
+   unsigned long tmp;
+
+   slow_p = fast_p = dc->head;
+   tmp = dc->length;
+
+   while (tmp) {
+   fast_p = dc->next(fast_p, dc->data);
+   tmp--;
+   }
+
+   while (slow_p != fast_p) {
+   slow_p = dc->next(slow_p, dc->data);
+   fast_p = dc->next(fast_p, dc->data);
+   }
+
+   *first = slow_p;
+   *len = dc->length;
+}
diff --git a/detect_cycle.h b/detect_cycle.h
new file mode 100644
index 000..2ca75c7
--- /dev/null
+++ b/detect_cycle.h
@@ -0,0 +1,40 @@
+/*
+ * detect_cycle.h  --  Generic cycle detection using Brent's algorithm
+ *
+ * Created by: Philipp Rudo 
+ *
+ * Copyright (c) 2022 Red Hat, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Gene

[PATCH v2 2/4] makedumpfile: use pointer arithmetics for dump_dmesg

2022-03-14 Thread Philipp Rudo
When parsing the printk buffer for the old printk mechanism (> v3.5.0+ and
< 5.10.0) a log entry is currently specified by the offset into the
buffer where the entry starts. Change this to use a pointers instead.
This is done in preparation for using the new cycle detection mechanism.

Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 29 +
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 7ed9756..9a05c96 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -5482,13 +5482,10 @@ dump_log_entry(char *logptr, int fp, const char 
*file_name)
  * get log record by index; idx must point to valid message.
  */
 static char *
-log_from_idx(unsigned int idx, char *logbuf)
+log_from_ptr(char *logptr, char *logbuf)
 {
-   char *logptr;
unsigned int msglen;
 
-   logptr = logbuf + idx;
-
/*
 * A length == 0 record is the end of buffer marker.
 * Wrap around and return the message at the start of
@@ -5502,14 +5499,13 @@ log_from_idx(unsigned int idx, char *logbuf)
return logptr;
 }
 
-static long
-log_next(unsigned int idx, char *logbuf)
+static void *
+log_next(void *_logptr, void *_logbuf)
 {
-   char *logptr;
+   char *logptr = _logptr;
+   char *logbuf = _logbuf;
unsigned int msglen;
 
-   logptr = logbuf + idx;
-
/*
 * A length == 0 record is the end of buffer marker. Wrap around and
 * read the message at the start of the buffer as *this* one, and
@@ -5519,10 +5515,10 @@ log_next(unsigned int idx, char *logbuf)
msglen = USHORT(logptr + OFFSET(printk_log.len));
if (!msglen) {
msglen = USHORT(logbuf + OFFSET(printk_log.len));
-   return msglen;
+   return logbuf + msglen;
}
 
-   return idx + msglen;
+   return logptr + msglen;
 }
 
 int
@@ -5530,11 +5526,12 @@ dump_dmesg()
 {
int log_buf_len, length_log, length_oldlog, ret = FALSE;
unsigned long index, log_buf, log_end;
-   unsigned int idx, log_first_idx, log_next_idx;
+   unsigned int log_first_idx, log_next_idx;
unsigned long long first_idx_sym;
unsigned long log_end_2_6_24;
unsigned  log_end_2_6_25;
char *log_buffer = NULL, *log_ptr = NULL;
+   char *ptr;
 
/*
 * log_end has been changed to "unsigned" since linux-2.6.25.
@@ -5681,13 +5678,13 @@ dump_dmesg()
ERRMSG("Can't open output file.\n");
goto out;
}
-   idx = log_first_idx;
-   while (idx != log_next_idx) {
-   log_ptr = log_from_idx(idx, log_buffer);
+   ptr = log_buffer + log_first_idx;
+   while (ptr != log_buffer + log_next_idx) {
+   log_ptr = log_from_ptr(ptr, log_buffer);
if (!dump_log_entry(log_ptr, info->fd_dumpfile,
info->name_dumpfile))
goto out;
-   idx = log_next(idx, log_buffer);
+   ptr = log_next(ptr, log_buffer);
}
if (!close_files_for_creating_dumpfile())
goto out;
-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] makedumpfile: use cycle detection when parsing the prink log_buf

2022-03-11 Thread Philipp Rudo
Hi Kazu,


On Fri, 11 Mar 2022 07:59:47 +
HAGIO KAZUHITO(萩尾 一仁)  wrote:

> -Original Message-
> > Hi Dave,
> > 
> > On Wed, 9 Mar 2022 03:48:12 -0500
> > David Wysochanski  wrote:
> >   
> > > On Mon, Mar 7, 2022 at 12:23 PM Philipp Rudo  wrote:  
> > > >
> > > > The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size
> > > > buffer (log_buf) that contains all messages. The location for the next
> > > > message is stored in log_next_idx. In case the log_buf runs full
> > > > log_next_idx wraps around and starts overwriting old messages at the
> > > > beginning of the buffer. The wraparound is denoted by a message with
> > > > msg->len == 0.
> > > >
> > > > Following the behavior described above blindly in makedumpfile is
> > > > dangerous as e.g. a memory corruption could overwrite (parts of) the
> > > > log_buf. If the corruption adds a message with msg->len == 0 this leads
> > > > to an endless loop when dumping the dmesg with makedumpfile appending
> > > > the messages up to the corruption over and over again to the output file
> > > > until file system is full. Fix this by using cycle detection and aboard
> > > > once one is detected.
> > > >
> > > > While at it also verify that the index is within the log_buf and thus
> > > > guard against corruptions with msg->len != 0.
> > > >
> > > > Fixes: 36c2458 ("[PATCH] --dump-dmesg fix for post 3.5 kernels.")
> > > > Reported-by: Audra Mitchell 
> > > > Suggested-by: Dave Wysochanski 
> > > > Signed-off-by: Philipp Rudo 
> > > > ---
> > > >  makedumpfile.c | 42 --
> > > >  1 file changed, 40 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/makedumpfile.c b/makedumpfile.c
> > > > index edf128b..2738d16 100644
> > > > --- a/makedumpfile.c
> > > > +++ b/makedumpfile.c
> > > > @@ -15,6 +15,7 @@
> > > >   */
> > > >  #include "makedumpfile.h"
> > > >  #include "print_info.h"
> > > > +#include "detect_cycle.h"
> > > >  #include "dwarf_info.h"
> > > >  #include "elf_info.h"
> > > >  #include "erase_info.h"
> > > > @@ -5528,10 +5529,11 @@ dump_dmesg()
> > > > unsigned long index, log_buf, log_end;
> > > > unsigned int log_first_idx, log_next_idx;
> > > > unsigned long long first_idx_sym;
> > > > +   struct detect_cycle *dc = NULL;
> > > > unsigned long log_end_2_6_24;
> > > > unsigned  log_end_2_6_25;
> > > > char *log_buffer = NULL, *log_ptr = NULL;
> > > > -   char *idx;
> > > > +   char *idx, *next_idx;
> > > >  
> > >
> > > Would be clearer to call the above "next_ptr" rather than "next_idx"
> > > (as far as I know 'index' refers to 32-bit quantities).
> > > Same comment about the "idx" variable, maybe "ptr"?  
> > 
> > Hmm... I stuck with idx as the kernel uses the same name. In my
> > opinion using the same name makes it easier to see that both variables
> > contain the same "quantity" even when the implementation is slightly
> > different (in the kernel idx is the offset in the log_buf just like it
> > was in makedumpfile before patch 2). But my opinion isn't very strong
> > on the naming. So when the consent is to rename the variable I'm open
> > to do it.
> > 
> > @Kazu: Do you have a preference here?  
> 
> Personally I think it will be more readable to use "*ptr" for pointers
> in this case, as Dave says.

ok, then I'll rename it in v2.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] makedumpfile: use cycle detection when parsing the prink log_buf

2022-03-10 Thread Philipp Rudo
Hi Dave,

On Wed, 9 Mar 2022 03:48:12 -0500
David Wysochanski  wrote:

> On Mon, Mar 7, 2022 at 12:23 PM Philipp Rudo  wrote:
> >
> > The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size
> > buffer (log_buf) that contains all messages. The location for the next
> > message is stored in log_next_idx. In case the log_buf runs full
> > log_next_idx wraps around and starts overwriting old messages at the
> > beginning of the buffer. The wraparound is denoted by a message with
> > msg->len == 0.
> >
> > Following the behavior described above blindly in makedumpfile is
> > dangerous as e.g. a memory corruption could overwrite (parts of) the
> > log_buf. If the corruption adds a message with msg->len == 0 this leads
> > to an endless loop when dumping the dmesg with makedumpfile appending
> > the messages up to the corruption over and over again to the output file
> > until file system is full. Fix this by using cycle detection and aboard
> > once one is detected.
> >
> > While at it also verify that the index is within the log_buf and thus
> > guard against corruptions with msg->len != 0.
> >
> > Fixes: 36c2458 ("[PATCH] --dump-dmesg fix for post 3.5 kernels.")
> > Reported-by: Audra Mitchell 
> > Suggested-by: Dave Wysochanski 
> > Signed-off-by: Philipp Rudo 
> > ---
> >  makedumpfile.c | 42 --
> >  1 file changed, 40 insertions(+), 2 deletions(-)
> >
> > diff --git a/makedumpfile.c b/makedumpfile.c
> > index edf128b..2738d16 100644
> > --- a/makedumpfile.c
> > +++ b/makedumpfile.c
> > @@ -15,6 +15,7 @@
> >   */
> >  #include "makedumpfile.h"
> >  #include "print_info.h"
> > +#include "detect_cycle.h"
> >  #include "dwarf_info.h"
> >  #include "elf_info.h"
> >  #include "erase_info.h"
> > @@ -5528,10 +5529,11 @@ dump_dmesg()
> > unsigned long index, log_buf, log_end;
> > unsigned int log_first_idx, log_next_idx;
> > unsigned long long first_idx_sym;
> > +   struct detect_cycle *dc = NULL;
> > unsigned long log_end_2_6_24;
> > unsigned  log_end_2_6_25;
> > char *log_buffer = NULL, *log_ptr = NULL;
> > -   char *idx;
> > +   char *idx, *next_idx;
> >  
> 
> Would be clearer to call the above "next_ptr" rather than "next_idx"
> (as far as I know 'index' refers to 32-bit quantities).
> Same comment about the "idx" variable, maybe "ptr"?

Hmm... I stuck with idx as the kernel uses the same name. In my
opinion using the same name makes it easier to see that both variables
contain the same "quantity" even when the implementation is slightly
different (in the kernel idx is the offset in the log_buf just like it
was in makedumpfile before patch 2). But my opinion isn't very strong
on the naming. So when the consent is to rename the variable I'm open
to do it.

@Kazu: Do you have a preference here?

Same for your comments in patch 2.

> > /*
> >  * log_end has been changed to "unsigned" since linux-2.6.25.
> > @@ -5679,12 +5681,47 @@ dump_dmesg()
> > goto out;
> > }
> > idx = log_buffer + log_first_idx;
> > +   dc = dc_init(idx, log_buffer, log_next);
> > while (idx != log_buffer + log_next_idx) {
> > log_ptr = log_from_idx(idx, log_buffer);
> > if (!dump_log_entry(log_ptr, info->fd_dumpfile,
> > info->name_dumpfile))
> > goto out;
> > -   idx = log_next(idx, log_buffer);
> > +   if (dc_next(dc, (void **) _idx)) {
> > +   unsigned long len;
> > +   char *first;
> > +
> > +   /* Clear everything we have already 
> > written... */
> > +   ftruncate(info->fd_dumpfile, 0);
> > +   lseek(info->fd_dumpfile, 0, SEEK_SET);
> > +  
> 
> I'm not sure I understand why you're doing this.

That's because in every pass of the loop the entry is written to the
output file. So most likely the output file already contains more
than needed. Thus we somehow need to trim the file to the end of
the cycle. Thus I decided to go brute force and simply clear all
content from the file and write all we need a second time.

In our very 

[PATCH 0/3] makedumpfile: harden parsing of old prink buffer

2022-03-07 Thread Philipp Rudo
Hi,

dumping the dmesg can cause an endless loop for the old prink mechanism (>
v3.5.0 and < v5.10.0) when the log_buf got corrupted. This series fixes those
cases by adding a cycle detection. The cycle detection is implemented in a
generic way so that it can be reused in other parts of makedumpfile.

Thanks
Philipp

Philipp Rudo (3):
  makedumpfile: add generic cycle detection
  makedumpfile: use pointer arithmetics for dump_dmesg
  makedumpfile: use cycle detection when parsing the prink log_buf

 Makefile   |  2 +-
 detect_cycle.c | 99 ++
 detect_cycle.h | 40 
 makedumpfile.c | 65 +
 4 files changed, 190 insertions(+), 16 deletions(-)
 create mode 100644 detect_cycle.c
 create mode 100644 detect_cycle.h

-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 3/3] makedumpfile: use cycle detection when parsing the prink log_buf

2022-03-07 Thread Philipp Rudo
The old printk mechanism (> v3.5.0 and < v5.10.0) had a fixed size
buffer (log_buf) that contains all messages. The location for the next
message is stored in log_next_idx. In case the log_buf runs full
log_next_idx wraps around and starts overwriting old messages at the
beginning of the buffer. The wraparound is denoted by a message with
msg->len == 0.

Following the behavior described above blindly in makedumpfile is
dangerous as e.g. a memory corruption could overwrite (parts of) the
log_buf. If the corruption adds a message with msg->len == 0 this leads
to an endless loop when dumping the dmesg with makedumpfile appending
the messages up to the corruption over and over again to the output file
until file system is full. Fix this by using cycle detection and aboard
once one is detected.

While at it also verify that the index is within the log_buf and thus
guard against corruptions with msg->len != 0.

Fixes: 36c2458 ("[PATCH] --dump-dmesg fix for post 3.5 kernels.")
Reported-by: Audra Mitchell 
Suggested-by: Dave Wysochanski 
Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index edf128b..2738d16 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -15,6 +15,7 @@
  */
 #include "makedumpfile.h"
 #include "print_info.h"
+#include "detect_cycle.h"
 #include "dwarf_info.h"
 #include "elf_info.h"
 #include "erase_info.h"
@@ -5528,10 +5529,11 @@ dump_dmesg()
unsigned long index, log_buf, log_end;
unsigned int log_first_idx, log_next_idx;
unsigned long long first_idx_sym;
+   struct detect_cycle *dc = NULL;
unsigned long log_end_2_6_24;
unsigned  log_end_2_6_25;
char *log_buffer = NULL, *log_ptr = NULL;
-   char *idx;
+   char *idx, *next_idx;
 
/*
 * log_end has been changed to "unsigned" since linux-2.6.25.
@@ -5679,12 +5681,47 @@ dump_dmesg()
goto out;
}
idx = log_buffer + log_first_idx;
+   dc = dc_init(idx, log_buffer, log_next);
while (idx != log_buffer + log_next_idx) {
log_ptr = log_from_idx(idx, log_buffer);
if (!dump_log_entry(log_ptr, info->fd_dumpfile,
info->name_dumpfile))
goto out;
-   idx = log_next(idx, log_buffer);
+   if (dc_next(dc, (void **) _idx)) {
+   unsigned long len;
+   char *first;
+
+   /* Clear everything we have already written... 
*/
+   ftruncate(info->fd_dumpfile, 0);
+   lseek(info->fd_dumpfile, 0, SEEK_SET);
+
+   /* ...and only write up to the corruption. */
+   dc_find_start(dc, (void **) , );
+   idx = log_buffer + log_first_idx;
+   while (len) {
+   log_ptr = log_from_idx(idx, log_buffer);
+   if (!dump_log_entry(log_ptr,
+   info->fd_dumpfile,
+   
info->name_dumpfile))
+   goto out;
+   idx = log_next(idx, log_buffer);
+   len--;
+   }
+   ERRMSG("Cycle when parsing dmesg detected.\n");
+   ERRMSG("The printk log_buf is most likely 
corrupted.\n");
+   ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", 
log_buf, idx - log_buffer);
+   close_files_for_creating_dumpfile();
+   goto out;
+   }
+   if (next_idx < log_buffer ||
+   next_idx > log_buffer + log_buf_len - 
SIZE(printk_log)) {
+   ERRMSG("Index outside log_buf detected.\n");
+   ERRMSG("The printk log_buf is most likely 
corrupted.\n");
+   ERRMSG("log_buf = 0x%lx, idx = 0x%lx\n", 
log_buf, idx - log_buffer);
+   close_files_for_creating_dumpfile();
+   goto out;
+   }
+   idx = next_idx;
}
if (!close_files_for_creating_dumpfile())
goto out;
@@ -5694,6 +5731,7 @@ dump_dmesg()
 o

[PATCH 2/3] makedumpfile: use pointer arithmetics for dump_dmesg

2022-03-07 Thread Philipp Rudo
When parsing the printk buffer for the old printk mechanism (> v3.5.0+ and
< 5.10.0) a log entry is currently specified by the offset into the
buffer where the entry starts. Change this to use a pointers instead.
This is done in preparation for using the new cycle detection mechanism.

Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 7ed9756..edf128b 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -5482,13 +5482,10 @@ dump_log_entry(char *logptr, int fp, const char 
*file_name)
  * get log record by index; idx must point to valid message.
  */
 static char *
-log_from_idx(unsigned int idx, char *logbuf)
+log_from_idx(char *logptr, char *logbuf)
 {
-   char *logptr;
unsigned int msglen;
 
-   logptr = logbuf + idx;
-
/*
 * A length == 0 record is the end of buffer marker.
 * Wrap around and return the message at the start of
@@ -5502,14 +5499,13 @@ log_from_idx(unsigned int idx, char *logbuf)
return logptr;
 }
 
-static long
-log_next(unsigned int idx, char *logbuf)
+static void *
+log_next(void *_logptr, void *_logbuf)
 {
-   char *logptr;
+   char *logptr = _logptr;
+   char *logbuf = _logbuf;
unsigned int msglen;
 
-   logptr = logbuf + idx;
-
/*
 * A length == 0 record is the end of buffer marker. Wrap around and
 * read the message at the start of the buffer as *this* one, and
@@ -5519,10 +5515,10 @@ log_next(unsigned int idx, char *logbuf)
msglen = USHORT(logptr + OFFSET(printk_log.len));
if (!msglen) {
msglen = USHORT(logbuf + OFFSET(printk_log.len));
-   return msglen;
+   return logbuf + msglen;
}
 
-   return idx + msglen;
+   return logptr + msglen;
 }
 
 int
@@ -5530,11 +5526,12 @@ dump_dmesg()
 {
int log_buf_len, length_log, length_oldlog, ret = FALSE;
unsigned long index, log_buf, log_end;
-   unsigned int idx, log_first_idx, log_next_idx;
+   unsigned int log_first_idx, log_next_idx;
unsigned long long first_idx_sym;
unsigned long log_end_2_6_24;
unsigned  log_end_2_6_25;
char *log_buffer = NULL, *log_ptr = NULL;
+   char *idx;
 
/*
 * log_end has been changed to "unsigned" since linux-2.6.25.
@@ -5681,8 +5678,8 @@ dump_dmesg()
ERRMSG("Can't open output file.\n");
goto out;
}
-   idx = log_first_idx;
-   while (idx != log_next_idx) {
+   idx = log_buffer + log_first_idx;
+   while (idx != log_buffer + log_next_idx) {
log_ptr = log_from_idx(idx, log_buffer);
if (!dump_log_entry(log_ptr, info->fd_dumpfile,
info->name_dumpfile))
-- 
2.35.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/3] makedumpfile: add generic cycle detection

2022-03-07 Thread Philipp Rudo
In order to work makedumpfile needs to interpret data read from the
dump. This can cause problems as the data from the dump cannot be
trusted (otherwise the kernel wouldn't have panicked in the first
place). This also means that every loop which stop condition depend on
data read from the dump has a chance to loop forever. Thus add a generic
cycle detection mechanism that allows to detect and handle such
situations appropriately.

For cycle detection use Brent's algorithm [1] as it has constant memory
usage. With this it can also be used in the kdump kernel without the
danger that it runs oom when iterating large data structures.
Furthermore it only depends on some pointer arithmetic. Thus the
performance impact (as long as no cycle was detected) should be
comparatively small.

[1] https://en.wikipedia.org/wiki/Cycle_detection#Brent's_algorithm

Suggested-by: Dave Wysochanski 
Signed-off-by: Philipp Rudo 
---
 Makefile   |  2 +-
 detect_cycle.c | 99 ++
 detect_cycle.h | 40 
 3 files changed, 140 insertions(+), 1 deletion(-)
 create mode 100644 detect_cycle.c
 create mode 100644 detect_cycle.h

diff --git a/Makefile b/Makefile
index 9f9fd22..e9a474f 100644
--- a/Makefile
+++ b/Makefile
@@ -45,7 +45,7 @@ CFLAGS_ARCH += -m32
 endif
 
 SRC_BASE = makedumpfile.c makedumpfile.h diskdump_mod.h sadump_mod.h 
sadump_info.h
-SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c 
cache.c tools.c printk.c
+SRC_PART = print_info.c dwarf_info.c elf_info.c erase_info.c sadump_info.c 
cache.c tools.c printk.c detect_cycle.c
 OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART))
 SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c 
arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c
 OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH))
diff --git a/detect_cycle.c b/detect_cycle.c
new file mode 100644
index 000..6b551a7
--- /dev/null
+++ b/detect_cycle.c
@@ -0,0 +1,99 @@
+/*
+ * detect_cycle.c  --  Generic cycle detection using Brent's algorithm
+ *
+ * Created by: Philipp Rudo 
+ *
+ * Copyright (c) 2022 Red Hat, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+
+#include "detect_cycle.h"
+
+struct detect_cycle {
+   /* First entry of the list */
+   void *head;
+
+   /* Variables required by Brent's algorithm */
+   void *fast_p;
+   void *slow_p;
+   unsigned long length;
+   unsigned long power;
+
+   /* Function to get the next entry in the list */
+   dc_next_t next;
+
+   /* Private data passed to next */
+   void *data;
+};
+
+struct detect_cycle *dc_init(void *head, void *data, dc_next_t next)
+{
+   struct detect_cycle *new;
+
+   new = malloc(sizeof(*new));
+   if (!new)
+   return NULL;
+
+   new->next = next;
+   new->data = data;
+
+   new->head = head;
+   new->slow_p = head;
+   new->fast_p = head;
+   new->length = 0;
+   new->power  = 2;
+
+   return new;
+}
+
+int dc_next(struct detect_cycle *dc, void **next)
+{
+
+   if (dc->length == dc->power) {
+   dc->length = 0;
+   dc->power *= 2;
+   dc->slow_p = dc->fast_p;
+   }
+
+   dc->fast_p = dc->next(dc->fast_p, dc->data);
+   dc->length++;
+
+   if (dc->slow_p == dc->fast_p)
+   return 1;
+
+   *next = dc->fast_p;
+   return 0;
+}
+
+void dc_find_start(struct detect_cycle *dc, void **first, unsigned long *len)
+{
+   void *slow_p, *fast_p;
+   unsigned long tmp;
+
+   slow_p = fast_p = dc->head;
+   tmp = dc->length;
+
+   while (tmp) {
+   fast_p = dc->next(fast_p, dc->data);
+   tmp--;
+   }
+
+   while (slow_p != fast_p) {
+   slow_p = dc->next(slow_p, dc->data);
+   fast_p = dc->next(fast_p, dc->data);
+   }
+
+   *first = slow_p;
+   *len = dc->length;
+}
diff --git a/detect_cycle.h b/detect_cycle.h
new file mode 100644
index 000..2ca75c7
--- /dev/null
+++ b/detect_cycle.h
@@ -0,0 +1,40 @@
+/*
+ * detect_cycle.h  --  Generic cycle detection using Brent's algorithm
+ *
+ * Created by: Philipp Rudo 
+ *
+ * Copyright (c) 2022 Red Hat, Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Gene

Re: [RFC PATCH] kdump: Add support for crashkernel=auto

2022-02-15 Thread Philipp Rudo
Hi Petr,

On Fri, 4 Feb 2022 06:34:19 +0100
Petr Tesařík  wrote:

> Hi Philipp,
> 
> Dne 31. 01. 22 v 11:33 Philipp Rudo napsal(a):
> > Hi,
> > 
> > On Fri, 28 Jan 2022 11:31:49 +0100
> > Petr Tesařík  wrote:
> >   
> >> Hi Tiezhu Yang,
> >>
> >> On Jan 28, 2022 at 02:20 Tiezhu Yang wrote:  
> >>> [...]
> >>> Hi Petr,
> >>>
> >>> Thank you for your reply.
> >>>
> >>> This is a RFC patch, the initial aim of this patch is to discuss what is
> >>> the proper way to support crashkernel=auto.  
> >>
> >> Well, the point I'm trying to make is that crashkernel=auto cannot be
> >> implemented. Your code would have to know what happens in the future,
> >> and AFAIK time travel has not been discovered yet. ;-)
> >>
> >> A better approach is to make a very large allocation initially, e.g.
> >> half of available RAM. The remaining RAM should still be big enough to
> >> start booting the system. Later, when a kdump user-space service knows
> >> what it wants to load, it can shrink the reservation by writing a lower
> >> value into /sys/kernel/kexec_crash_size.  
> > 
> > Even this approach doesn't work in every situation. For example it
> > requires that the system has at least twice the RAM it requires to
> > safely boot. That's not always given for e.g minimalistic VMs or
> > embedded systems.  
> 
> If you reserve more RAM for the panic kernel than for running your 
> actual workload, then you definitely have very special needs, and you 
> should not expect that everything works out of the box.

That was basically the point I was trying to make. There is always a
scenario with special needs so that is is basically impossible to find
that one solution that works for everybody.

> > Furthermore the memory requirement can also change during runtime due
> > to, e.g. workload spikes, device hot plug, moving the dump target from
> > an un-encrypted to an encrypted disk, etc.. So even when your user-space
> > program can exactly calculate the memory requirement at the moment
> > it loads kdump it might be too little at the moment the system panics.
> > In order for it to work the user-space would constantly need to monitor
> > how much memory is needed and adjust the requirement. But that would
> > also require to increase the reservation during runtime which would be
> > extremely expensive (if possible at all).
> > 
> > All in all I support Petr that time travel is the only proper solution
> > for implementing crashkernel=auto. But once we have time travel I
> > would prefer to use the gained knowledge to fix the bug that triggered
> > the panic rather than calculating the memory requirement for kdump.  
> 
> Yeah, long live patching! :-)
> 
> >> The alternative approach does not need any changes to the kernel, except
> >> maybe adding something like "crashkernel=max".  
> > 
> > A slightly different approach is for the user-space tool to simply set
> > the crashkernel= parameter on the kernel commandline for the next boot.
> > This also works for memory restrained systems. Needs a reboot though...  
> 
> The downside is that if you remove some memory while your system is off, 
> then a reservation calculate for the previous RAM size may no longer be 
> possible on the next boot, and the kernel will boot up without any 
> reservation. That's where "crashkernel=max" would come in handy. Let me 
> send a patch and see the discussion.

True, in that situation our approach will fail. I'm looking forward to
see your patch.

Thanks
Philipp

> >>> A moment ago, I find the following patch, it is more flexible, but it is
> >>> not merged into the upstream kernel now.
> >>>
> >>> kernel/crash_core: Add crashkernel=auto for vmcore creation
> >>>
> >>> https://lore.kernel.org/lkml/20210223174153.72802-1-saeed.mirzamohamm...@oracle.com/
> >>>   
> > 
> > The patch was ultimately rejected by Linus
> > 
> > https://lore.kernel.org/linux-mm/20210507010432.in24pudkt%25a...@linux-foundation.org/
> > 
> > Thanks
> > Philipp
> >   
> >>>  
> >>>> 
> >>>>> [...]
> >>>>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> >>>>> index 256cf6d..32c51e2 100644
> >>>>> --- a/kernel/crash_core.c
> >>>>> +++ b/kernel/crash_core.c
> >>>>> @@ -252,6 +252,26 @@ static int __init __parse_crashkernel(char
> >

Re: Is it possible to use a stream for initrd=

2022-02-15 Thread Philipp Rudo
Hi Tobias,

On Wed, 9 Feb 2022 10:32:24 +0100
Tobias Powalowski  wrote:

> Sorry you misunderstood, I don't want to create the initrd.img file.
> I want to pass the zstd directly to kexec initrd= option.

sorry, that is not possible. The kexec-tools expect a file name with
--initrd. Simply passing a binary blob to be used as initrd isn't
implemented.

Thanks
Philipp

> greetings
> tpowa
> 
> Am Mi., 9. Feb. 2022 um 10:28 Uhr schrieb Baoquan He :
> >
> > On 02/09/22 at 10:16am, Tobias Powalowski wrote:  
> > > Hi,
> > > I have a tmp directory with all files placed in:
> > > find . -mindepth 1 -printf '%P\0' | sort -z | LANG=C bsdtar --uid 0
> > > --gid 0 --null -cnf - -T - |\
> > > LANG=C bsdtar --null -cf - --format=newc @- | zstd -T0 > /initrd.img
> > >
> > > and this initrd.img I want to write into kexec without creating the 
> > > initrd file.
> > >
> > > kexec -l /vmlinuz-linux --initrd=/initrd.img --reuse-cmdline
> > >  systemctl kexec  
> >
> > See manpage of kexec, the EXAMPLE part:
> >
> > kexec -l /boot/vmlinux  --initrd=/boot/initrd --reuse-cmdline
> > kexec -e
> >  
> 
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC PATCH] kdump: Add support for crashkernel=auto

2022-01-31 Thread Philipp Rudo
Hi,

On Fri, 28 Jan 2022 11:31:49 +0100
Petr Tesařík  wrote:

> Hi Tiezhu Yang,
> 
> On Jan 28, 2022 at 02:20 Tiezhu Yang wrote:
> >[...]
> > Hi Petr,
> > 
> > Thank you for your reply.
> > 
> > This is a RFC patch, the initial aim of this patch is to discuss what is 
> > the proper way to support crashkernel=auto.  
> 
> Well, the point I'm trying to make is that crashkernel=auto cannot be 
> implemented. Your code would have to know what happens in the future, 
> and AFAIK time travel has not been discovered yet. ;-)
>
> A better approach is to make a very large allocation initially, e.g. 
> half of available RAM. The remaining RAM should still be big enough to 
> start booting the system. Later, when a kdump user-space service knows 
> what it wants to load, it can shrink the reservation by writing a lower 
> value into /sys/kernel/kexec_crash_size.

Even this approach doesn't work in every situation. For example it
requires that the system has at least twice the RAM it requires to
safely boot. That's not always given for e.g minimalistic VMs or
embedded systems.
Furthermore the memory requirement can also change during runtime due
to, e.g. workload spikes, device hot plug, moving the dump target from
an un-encrypted to an encrypted disk, etc.. So even when your user-space
program can exactly calculate the memory requirement at the moment
it loads kdump it might be too little at the moment the system panics.
In order for it to work the user-space would constantly need to monitor
how much memory is needed and adjust the requirement. But that would
also require to increase the reservation during runtime which would be
extremely expensive (if possible at all).

All in all I support Petr that time travel is the only proper solution
for implementing crashkernel=auto. But once we have time travel I
would prefer to use the gained knowledge to fix the bug that triggered
the panic rather than calculating the memory requirement for kdump.
 
> The alternative approach does not need any changes to the kernel, except 
> maybe adding something like "crashkernel=max".

A slightly different approach is for the user-space tool to simply set
the crashkernel= parameter on the kernel commandline for the next boot.
This also works for memory restrained systems. Needs a reboot though...

> > A moment ago, I find the following patch, it is more flexible, but it is 
> > not merged into the upstream kernel now.
> > 
> > kernel/crash_core: Add crashkernel=auto for vmcore creation
> > 
> > https://lore.kernel.org/lkml/20210223174153.72802-1-saeed.mirzamohamm...@oracle.com/
> >  

The patch was ultimately rejected by Linus

https://lore.kernel.org/linux-mm/20210507010432.in24pudkt%25a...@linux-foundation.org/

Thanks
Philipp

> >   
> >>  
> >>> [...]
> >>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> >>> index 256cf6d..32c51e2 100644
> >>> --- a/kernel/crash_core.c
> >>> +++ b/kernel/crash_core.c
> >>> @@ -252,6 +252,26 @@ static int __init __parse_crashkernel(char 
> >>> *cmdline,
> >>>   if (suffix)
> >>>   return parse_crashkernel_suffix(ck_cmdline, crash_size,
> >>>   suffix);
> >>> +
> >>> +    if (strncmp(ck_cmdline, "auto", 4) == 0) {
> >>> +#if defined(CONFIG_X86_64) || defined(CONFIG_S390)
> >>> +    ck_cmdline = "1G-4G:160M,4G-64G:192M,64G-1T:256M,1T-:512M";
> >>> +#elif defined(CONFIG_ARM64)
> >>> +    ck_cmdline = "2G-:448M";
> >>> +#elif defined(CONFIG_PPC64)
> >>> +    char *fadump_cmdline;
> >>> +
> >>> +    fadump_cmdline = get_last_crashkernel(cmdline, "fadump=", 
> >>> NULL);
> >>> +    fadump_cmdline = fadump_cmdline ?
> >>> +    fadump_cmdline + strlen("fadump=") : NULL;
> >>> +    if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) == 
> >>> 0))
> >>> +    ck_cmdline =
> >>> "2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G";
> >>> +    else
> >>> +    ck_cmdline =
> >>> "4G-16G:768M,16G-64G:1G,64G-128G:2G,128G-1T:4G,1T-2T:6G,2T-4T:12G,4T-8T:20G,8T-16T:36G,16T-32T:64G,32T-64T:128G,64T-:180G";
> >>>  
> >>>
> >>>
> >>> +#endif
> >>> +    pr_info("Using crashkernel=auto, the size chosen is a best
> >>> effort estimation.\n");
> >>> +    }
> >>> +  
> >>
> >> How did you even arrive at the above numbers?  
> > 
> > Memory requirements for kdump:
> > 
> > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/supported-kdump-configurations-and-targets_managing-monitoring-and-updating-the-kernel#memory-requirements-for-kdump_supported-kdump-configurations-and-targets
> >  
> > 
> > 
> > I've done some research on  
> >> this topic recently (ie. during the last 7 years or so). My x86_64
> >> system with 8G RAM running openSUSE Leap 15.3 seems needs 188M for
> >> saving to the local disk, and 203M to save over the network (using
> >> SFTP). My PPC64 LPAR with 16G RAM running latest Beta of SLES 15 SP4
> >> needs 587M, i.e. with the above 

Re: [PATCHv4 1/4] arm64: make phys_offset signed

2022-01-20 Thread Philipp Rudo
Hi Pingfan,

On Tue, 18 Jan 2022 15:48:09 +0800
Pingfan Liu  wrote:

> After kernel commit 7bc1a0f9e176 ("arm64: mm: use single quantity to
> represent the PA to VA translation"), phys_offset can be negative if
> running 52-bits kernel on 48-bits hardware.
> 
> So changing phys_offset from unsigned to signed.
> 
> Signed-off-by: Pingfan Liu 
> Cc: Kairui Song 
> Cc: Simon Horman 
> Cc: Philipp Rudo 
> To: kexec@lists.infradead.org
> ---
>  kexec/arch/arm64/kexec-arm64.c | 12 ++--
>  kexec/arch/arm64/kexec-arm64.h |  2 +-
>  util_lib/elf_info.c|  2 +-
>  util_lib/include/elf_info.h|  2 +-
>  4 files changed, 9 insertions(+), 9 deletions(-)
> 

[...]

> diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h
> index ed447ac..1844b67 100644
> --- a/kexec/arch/arm64/kexec-arm64.h
> +++ b/kexec/arch/arm64/kexec-arm64.h
> @@ -58,7 +58,7 @@ extern off_t initrd_size;
>   */
>  
>  struct arm64_mem {
> - uint64_t phys_offset;
> + long phys_offset;

I think this one should be int64_t as well.

Other than that
Reviewed-by: Philipp Rudo 

>   uint64_t text_offset;
>   uint64_t image_size;
>   uint64_t vp_offset;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv4 2/4] arm64/crashdump: unify routine to get page_offset

2022-01-20 Thread Philipp Rudo
Hi Pingfan,

On Tue, 18 Jan 2022 15:48:10 +0800
Pingfan Liu  wrote:

> There are two funcs to get page_offset:
>   get_kernel_page_offset()
>   get_page_offset()
> 
> Since get_kernel_page_offset() does not observe the kernel formula, and
> remove it. Unify them in order to introduce 52-bits VA kernel more
> easily in the coming patch.
> 
> Signed-off-by: Pingfan Liu 
> Cc: Kairui Song 
> Cc: Simon Horman 
> Cc: Philipp Rudo 
> To: kexec@lists.infradead.org

looks good

Reviewed-by: Philipp Rudo 

> ---
>  kexec/arch/arm64/crashdump-arm64.c | 23 +--
>  kexec/arch/arm64/kexec-arm64.c |  8 
>  kexec/arch/arm64/kexec-arm64.h |  1 +
>  3 files changed, 6 insertions(+), 26 deletions(-)
> 
> diff --git a/kexec/arch/arm64/crashdump-arm64.c 
> b/kexec/arch/arm64/crashdump-arm64.c
> index a02019a..0a8d44c 100644
> --- a/kexec/arch/arm64/crashdump-arm64.c
> +++ b/kexec/arch/arm64/crashdump-arm64.c
> @@ -46,27 +46,6 @@ static struct crash_elf_info elf_info = {
>   .machine= EM_AARCH64,
>  };
>  
> -/*
> - * Note: The returned value is correct only if !CONFIG_RANDOMIZE_BASE.
> - */
> -static uint64_t get_kernel_page_offset(void)
> -{
> - int i;
> -
> - if (elf_info.kern_vaddr_start == UINT64_MAX)
> - return UINT64_MAX;
> -
> - /* Current max virtual memory range is 48-bits. */
> - for (i = 48; i > 0; i--)
> - if (!(elf_info.kern_vaddr_start & (1UL << i)))
> - break;
> -
> - if (i <= 0)
> - return UINT64_MAX;
> - else
> - return UINT64_MAX << i;
> -}
> -
>  /*
>   * iomem_range_callback() - callback called for each iomem region
>   * @data: not used
> @@ -198,7 +177,7 @@ int load_crashdump_segments(struct kexec_info *info)
>   if (err)
>   return EFAILED;
>  
> - elf_info.page_offset = get_kernel_page_offset();
> + get_page_offset(_info.page_offset);
>   dbgprintf("%s: page_offset:   %016llx\n", __func__,
>   elf_info.page_offset);
>  
> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
> index c6c67e8..33cc258 100644
> --- a/kexec/arch/arm64/kexec-arm64.c
> +++ b/kexec/arch/arm64/kexec-arm64.c
> @@ -909,7 +909,7 @@ static int get_va_bits(void)
>   * get_page_offset - Helper for getting PAGE_OFFSET
>   */
>  
> -static int get_page_offset(void)
> +int get_page_offset(unsigned long *page_offset)
>  {
>   int ret;
>  
> @@ -917,8 +917,8 @@ static int get_page_offset(void)
>   if (ret < 0)
>   return ret;
>  
> - page_offset = (0xUL) << (va_bits - 1);
> - dbgprintf("page_offset : %lx\n", page_offset);
> + *page_offset = UINT64_MAX << (va_bits - 1);
> + dbgprintf("page_offset : %lx\n", *page_offset);
>  
>   return 0;
>  }
> @@ -954,7 +954,7 @@ int get_phys_base_from_pt_load(long *phys_offset)
>   unsigned long long phys_start;
>   unsigned long long virt_start;
>  
> - ret = get_page_offset();
> + ret = get_page_offset(_offset);
>   if (ret < 0)
>   return ret;
>  
> diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h
> index 1844b67..ed99d9d 100644
> --- a/kexec/arch/arm64/kexec-arm64.h
> +++ b/kexec/arch/arm64/kexec-arm64.h
> @@ -69,6 +69,7 @@ extern struct arm64_mem arm64_mem;
>  
>  uint64_t get_phys_offset(void);
>  uint64_t get_vp_offset(void);
> +int get_page_offset(unsigned long *offset);
>  
>  static inline void reset_vp_offset(void)
>  {


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv4 3/4] arm64: read VA_BITS from kcore for 52-bits VA kernel

2022-01-20 Thread Philipp Rudo
Hi Pingfan,

On Tue, 18 Jan 2022 15:48:11 +0800
Pingfan Liu  wrote:

> phys_to_virt() calculates virtual address. As a important factor,
> page_offset is excepted to be accurate.
> 
> Since arm64 kernel exposes va_bits through vmcore, using it.
> 
> Signed-off-by: Pingfan Liu 
> Cc: Kairui Song 
> Cc: Simon Horman 
> Cc: Philipp Rudo 
> To: kexec@lists.infradead.org

looks good

Reviewed-by: Philipp Rudo 

> ---
>  kexec/arch/arm64/kexec-arm64.c | 34 ++
>  util_lib/elf_info.c|  5 +
>  util_lib/include/elf_info.h|  1 +
>  3 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
> index 33cc258..793799b 100644
> --- a/kexec/arch/arm64/kexec-arm64.c
> +++ b/kexec/arch/arm64/kexec-arm64.c
> @@ -54,7 +54,7 @@
>  static bool try_read_phys_offset_from_kcore = false;
>  
>  /* Machine specific details. */
> -static int va_bits;
> +static int va_bits = -1;
>  static unsigned long page_offset;
>  
>  /* Global varables the core kexec routines expect. */
> @@ -876,7 +876,18 @@ static inline void set_phys_offset(int64_t v, char 
> *set_method)
>  
>  static int get_va_bits(void)
>  {
> - unsigned long long stext_sym_addr = get_kernel_sym("_stext");
> + unsigned long long stext_sym_addr;
> +
> + /*
> +  * if already got from kcore
> +  */
> + if (va_bits != -1)
> + goto out;
> +
> +
> + /* For kernel older than v4.19 */
> + fprintf(stderr, "Warning, can't get the VA_BITS from kcore\n");
> + stext_sym_addr = get_kernel_sym("_stext");
>  
>   if (stext_sym_addr == 0) {
>   fprintf(stderr, "Can't get the symbol of _stext.\n");
> @@ -900,6 +911,7 @@ static int get_va_bits(void)
>   return -1;
>   }
>  
> +out:
>   dbgprintf("va_bits : %d\n", va_bits);
>  
>   return 0;
> @@ -917,14 +929,27 @@ int get_page_offset(unsigned long *page_offset)
>   if (ret < 0)
>   return ret;
>  
> - *page_offset = UINT64_MAX << (va_bits - 1);
> + if (va_bits < 52)
> + *page_offset = UINT64_MAX << (va_bits - 1);
> + else
> + *page_offset = UINT64_MAX << va_bits;
> +
>   dbgprintf("page_offset : %lx\n", *page_offset);
>  
>   return 0;
>  }
>  
> +static void arm64_scan_vmcoreinfo(char *pos)
> +{
> + const char *str;
> +
> + str = "NUMBER(VA_BITS)=";
> + if (memcmp(str, pos, strlen(str)) == 0)
> + va_bits = strtoul(pos + strlen(str), NULL, 10);
> +}
> +
>  /**
> - * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET
> + * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET 
> (and va_bits)
>   * from VMCOREINFO note inside 'kcore'.
>   */
>  
> @@ -937,6 +962,7 @@ static int get_phys_offset_from_vmcoreinfo_pt_note(long 
> *phys_offset)
>   return EFAILED;
>   }
>  
> + arch_scan_vmcoreinfo = arm64_scan_vmcoreinfo;
>   ret = read_phys_offset_elf_kcore(fd, phys_offset);
>  
>   close(fd);
> diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c
> index 5574c7f..d252eff 100644
> --- a/util_lib/elf_info.c
> +++ b/util_lib/elf_info.c
> @@ -310,6 +310,8 @@ int get_pt_load(int idx,
>  
>  #define NOT_FOUND_LONG_VALUE (-1)
>  
> +void (*arch_scan_vmcoreinfo)(char *pos);
> +
>  void scan_vmcoreinfo(char *start, size_t size)
>  {
>   char *last = start + size - 1;
> @@ -551,6 +553,9 @@ void scan_vmcoreinfo(char *start, size_t size)
>   }
>   }
>  
> + if (arch_scan_vmcoreinfo != NULL)
> + (*arch_scan_vmcoreinfo)(pos);
> +
>   if (last_line)
>   break;
>   }
> diff --git a/util_lib/include/elf_info.h b/util_lib/include/elf_info.h
> index f550d86..fdf4c3d 100644
> --- a/util_lib/include/elf_info.h
> +++ b/util_lib/include/elf_info.h
> @@ -31,5 +31,6 @@ int get_pt_load(int idx,
>  int read_phys_offset_elf_kcore(int fd, long *phys_off);
>  int read_elf(int fd);
>  void dump_dmesg(int fd, void (*handler)(char*, unsigned int));
> +extern void (*arch_scan_vmcoreinfo)(char *pos);
>  
>  #endif /* ELF_INFO_H */


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv4 4/4] arm64: fix PAGE_OFFSET calc for flipped mm

2022-01-20 Thread Philipp Rudo
Hi Pingfan,

On Tue, 18 Jan 2022 15:48:12 +0800
Pingfan Liu  wrote:

> From: Kairui Song 
> 
> Since kernel commit 14c127c957c1 ('arm64: mm: Flip kernel VA space'),
> the memory layout on arm64 have changed, and kexec-tools can no longer
> get the the right PAGE_OFFSET based on _text symbol.
> 
> Prior to that, the kimage (_text) lays above PAGE_END with this layout:
> 0   -> VA_START : Usespace
> VA_START-> VA_START + 256M  : BPF JIT, Modules
> VA_START + 256M -> PAGE_OFFSET - (~GB misc) : Vmalloc (KERNEL _text HERE)
> PAGE_OFFSET -> ...  : * Linear map *
> 
> And here we have:
> VA_START= -1UL << VA_BITS
> PAGE_OFFSET = -1UL << (VA_BITS - 1)
> _text < -1UL << (VA_BITS - 1)
> 
> Kernel image lays somewhere between VA_START and PAGE_OFFSET, so we just
> calc VA_BITS by getting the highest unset bit of _text symbol address,
> and shift one less bit of VA_BITS to get page offset. This works as long
> as KASLR don't put kernel in a too high location (which is commented inline).
> 
> And after that commit, kernel layout have changed:
> 0   -> PAGE_OFFSET  : Userspace
> PAGE_OFFSET -> PAGE_END : * Linear map *
> PAGE_END-> PAGE_END + 128M  : bpf jit region
> PAGE_END + 128M -> PAGE_END + 256MB : modules
> PAGE_END + 256M -> ...  : vmalloc (KERNEL _text HERE)
> 
> Here we have:
> PAGE_OFFSET = -1UL << VA_BITS
> PAGE_END= -1UL << (VA_BITS - 1)
> _text > -1UL << (VA_BITS - 1)
> 
> Kernel image now lays above PAGE_END, so we have to shift one more bit to
> get the VA_BITS, and shift the exact VA_BITS for PAGE_OFFSET.
> 
> We can simply check if "_text > -1UL << (VA_BITS - 1)" is true to judge
> which layout is being used and shift the page offset occordingly.
> 
> Signed-off-by: Kairui Song 
> (rebased and stripped by Pingfan )
> Signed-off-by: Pingfan Liu 
> Cc: Simon Horman 
> Cc: Philipp Rudo 
> To: kexec@lists.infradead.org
> ---
>  kexec/arch/arm64/kexec-arm64.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
> index 793799b..ce7a5bb 100644
> --- a/kexec/arch/arm64/kexec-arm64.c
> +++ b/kexec/arch/arm64/kexec-arm64.c
> @@ -923,13 +923,25 @@ out:
>  
>  int get_page_offset(unsigned long *page_offset)
>  {
> + unsigned long long text_sym_addr, kernel_va_mid;
>   int ret;
>  
> + text_sym_addr = get_kernel_sym("_text");
> + if (text_sym_addr == 0) {
> + fprintf(stderr, "Can't get the symbol of _text to calculate 
> page_offset.\n");
> + return -1;
> + }
> +
>   ret = get_va_bits();
>   if (ret < 0)
>   return ret;
>  
> - if (va_bits < 52)
> + /* Since kernel 5.4, kernel image is put above
> +  * UINT64_MAX << (va_bits - 1)
> +  */
> + kernel_va_mid = UINT64_MAX << (va_bits - 1);
> + /* older kernel */
> + if (text_sym_addr < kernel_va_mid)
>   *page_offset = UINT64_MAX << (va_bits - 1);
>   else
>   *page_offset = UINT64_MAX << va_bits;

I would drop the kernel_va_mid and simply use

*page_offset = UINT64_MAX << (va_bits - 1)
if (*page_offset > text_sym_addr > *page_offset)
*page_offset = UINT64_MAX << va_bits

but that's more a matter of taste.

Reviewed-by: Philipp Rudo 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 3/5] add slurp_proc_file()

2021-12-15 Thread Philipp Rudo
Hi Sven,

sorry, i need to nag again.

On Wed, 15 Dec 2021 11:18:34 +0100
Sven Schnelle  wrote:

> slurp_file() cannot be used to read proc files, as they are returning
> a size of zero in stat(). Add a function slurp_proc_file() which is
> similar to slurp_file(), but doesn't require the size of the file to
> be known.
> 
> Signed-off-by: Sven Schnelle 
> ---
>  kexec/kexec.c | 32 
>  1 file changed, 32 insertions(+)
> 
> diff --git a/kexec/kexec.c b/kexec/kexec.c
> index f63b36b771eb..a1acba2adf2a 100644
> --- a/kexec/kexec.c
> +++ b/kexec/kexec.c
> @@ -1106,6 +1106,38 @@ static void remove_parameter(char *line, const char 
> *param_name)
>   }
>  }
>  
> +static char *slurp_proc_file(const char *filename, size_t *len)
> +{
> + ssize_t ret, startpos = 0;
> + unsigned int size = 64;
> + char *buf = NULL, *tmp;
> + int fd;
> +
> + fd = open(filename, O_RDONLY);
> + if (fd == -1)
> + return NULL;
> +
> + do {
> + size *= 2;
> + tmp = realloc(buf, size);
> + if (!tmp) {
> + free(buf);
> + return NULL;
> + }
> + buf = tmp;
> +
> + ret = read(fd, buf + startpos, size - startpos);
> + if (ret < 0) {
> + free(buf);
> + return NULL;
> + }
> + startpos += ret;
> + *len = startpos;
> + } while (ret == size);

I don't think this will work as intended. 

1) read returns the bytes read. So ret has a maximum value of
   size - startpos and thus can only be equal to size on the first pass
   when startpos = 0.

2) it's not an error when read reads less bytes than requested but can
   happen when, e.g. it get's interrupted.

The simplest solution I see is to simply use 'while (ret)' Even when
that means that there is an extra pass in the loop with an extra
call to realloc.

The cleaner solution probably would be to put the read into a second
loop. I.e. the following should work (no tested)

do {
[...]
do {
ret = read(fd, buf + startpos, size - startpos);
if (ret < 0) {
free(buf);
return NULL;
}
startpos += ret;
while (ret && startpos != size);

*len = startpos;
} while (ret);

Thanks
Philipp


> +
> + return buf;
> +}
> +
>  /*
>   * Returns the contents of the current command line to be used with
>   * --reuse-cmdline option.  The function gets called from architecture 
> specific


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] arm64: read VA_BITS from kcore for 52-bits VA kernel

2021-12-15 Thread Philipp Rudo
Hi Pingfang,

On Fri, 10 Dec 2021 11:07:35 +0800
Pingfan Liu  wrote:

> phys_to_virt() calculates virtual address. As a important factor,
> page_offset is excepted to be accurate.
> 
> Since arm64 kernel exposes va_bits through vmcore, using it.
> 
> Signed-off-by: Pingfan Liu 
> ---
>  kexec/arch/arm64/kexec-arm64.c | 31 +++
>  kexec/arch/arm64/kexec-arm64.h |  1 +
>  util_lib/elf_info.c|  5 +
>  3 files changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
> index bd650e6..ccc92db 100644
> --- a/kexec/arch/arm64/kexec-arm64.c
> +++ b/kexec/arch/arm64/kexec-arm64.c
> @@ -54,7 +54,7 @@
>  static bool try_read_phys_offset_from_kcore = false;
>  
>  /* Machine specific details. */
> -static int va_bits;
> +static int va_bits = -1;
>  static unsigned long page_offset;
>  
>  /* Global varables the core kexec routines expect. */
> @@ -876,7 +876,15 @@ static inline void set_phys_offset(long v, char 
> *set_method)
>  
>  static int get_va_bits(void)
>  {
> - unsigned long long stext_sym_addr = get_kernel_sym("_stext");
> + unsigned long long stext_sym_addr;
> +
> + /*
> +  * if already got from kcore
> +  */
> + if (va_bits != -1)
> + goto out;
> +
> + stext_sym_addr = get_kernel_sym("_stext");
>  
>   if (stext_sym_addr == 0) {
>   fprintf(stderr, "Can't get the symbol of _stext.\n");
> @@ -900,6 +908,7 @@ static int get_va_bits(void)
>   return -1;
>   }
>  
> +out:
>   dbgprintf("va_bits : %d\n", va_bits);
>  
>   return 0;
> @@ -917,14 +926,27 @@ int get_page_offset(unsigned long *page_offset)
>   if (ret < 0)
>   return ret;
>  
> - page_offset = (0xUL) << (va_bits - 1);
> + if (va_bits < 52)
> + *page_offset = (0xUL) << (va_bits - 1);
> + else
> + *page_offset = (0xUL) << va_bits;

wouldn't it make sense to use ULONG_MAX here? At least for me it would
be much better readable.

>   dbgprintf("page_offset : %lx\n", page_offset);
>  
>   return 0;
>  }
>  
> +static void arm64_scan_vmcoreinfo(char *pos)
> +{
> + const char *str;
> +
> + str = "NUMBER(VA_BITS)=";
> + if (memcmp(str, pos, strlen(str)) == 0)
> + va_bits = strtoul(pos + strlen(str), NULL, 10);
> +}
> +
>  /**
> - * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET
> + * get_phys_offset_from_vmcoreinfo_pt_note - Helper for getting PHYS_OFFSET 
> (and va_bits)
>   * from VMCOREINFO note inside 'kcore'.
>   */
>  
> @@ -937,6 +959,7 @@ static int get_phys_offset_from_vmcoreinfo_pt_note(long 
> *phys_offset)
>   return EFAILED;
>   }
>  
> + arch_scan_vmcoreinfo = arm64_scan_vmcoreinfo;
>   ret = read_phys_offset_elf_kcore(fd, phys_offset);
>  
>   close(fd);
> diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h
> index ed99d9d..d291705 100644
> --- a/kexec/arch/arm64/kexec-arm64.h
> +++ b/kexec/arch/arm64/kexec-arm64.h
> @@ -66,6 +66,7 @@ struct arm64_mem {
>  
>  #define arm64_mem_ngv UINT64_MAX
>  extern struct arm64_mem arm64_mem;
> +extern void (*arch_scan_vmcoreinfo)(char *pos);

This definition isn't arm64 specific. I think this should go to
util_lib/include/elf_info.h.

Thanks
Philipp

>  
>  uint64_t get_phys_offset(void);
>  uint64_t get_vp_offset(void);
> diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c
> index 5574c7f..d252eff 100644
> --- a/util_lib/elf_info.c
> +++ b/util_lib/elf_info.c
> @@ -310,6 +310,8 @@ int get_pt_load(int idx,
>  
>  #define NOT_FOUND_LONG_VALUE (-1)
>  
> +void (*arch_scan_vmcoreinfo)(char *pos);
> +
>  void scan_vmcoreinfo(char *start, size_t size)
>  {
>   char *last = start + size - 1;
> @@ -551,6 +553,9 @@ void scan_vmcoreinfo(char *start, size_t size)
>   }
>   }
>  
> + if (arch_scan_vmcoreinfo != NULL)
> + (*arch_scan_vmcoreinfo)(pos);
> +
>   if (last_line)
>   break;
>   }


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/3] arm64/crashdump: unify routine to get page_offset

2021-12-15 Thread Philipp Rudo
Hi Pinfang,

On Fri, 10 Dec 2021 11:07:34 +0800
Pingfan Liu  wrote:

> There are two funcs to get page_offset:
>   get_kernel_page_offset()
>   get_page_offset()
> 
> Since get_kernel_page_offset() does not observe the kernel formula, and
> remove it. Unify them in order to introduce 52-bits VA kernel more
> easily in the coming patch.
> 
> Signed-off-by: Pingfan Liu 
> ---
>  kexec/arch/arm64/crashdump-arm64.c | 23 +--
>  kexec/arch/arm64/kexec-arm64.c |  4 ++--
>  kexec/arch/arm64/kexec-arm64.h |  1 +
>  3 files changed, 4 insertions(+), 24 deletions(-)
> 
> diff --git a/kexec/arch/arm64/crashdump-arm64.c 
> b/kexec/arch/arm64/crashdump-arm64.c
> index a02019a..0a8d44c 100644
> --- a/kexec/arch/arm64/crashdump-arm64.c
> +++ b/kexec/arch/arm64/crashdump-arm64.c

[...]

> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
> index 7373fa3..bd650e6 100644
> --- a/kexec/arch/arm64/kexec-arm64.c
> +++ b/kexec/arch/arm64/kexec-arm64.c
> @@ -909,7 +909,7 @@ static int get_va_bits(void)
>   * get_page_offset - Helper for getting PAGE_OFFSET
>   */
>  
> -static int get_page_offset(void)
> +int get_page_offset(unsigned long *page_offset)
>  {
>   int ret;
>  

The new page_offset is never used in this patch. There's still the line
left with the global page_offset but that will cause a type miss-match
(unsigend long vs unsigned log *). In the next patch you do the update
(page_offset -> *page_offset) but in the meantime the code is broken.

Thanks
Philipp

> @@ -954,7 +954,7 @@ int get_phys_base_from_pt_load(long *phys_offset)
>   unsigned long long phys_start;
>   unsigned long long virt_start;
>  
> - ret = get_page_offset();
> + ret = get_page_offset(_offset);
>   if (ret < 0)
>   return ret;
>  
> diff --git a/kexec/arch/arm64/kexec-arm64.h b/kexec/arch/arm64/kexec-arm64.h
> index 1844b67..ed99d9d 100644
> --- a/kexec/arch/arm64/kexec-arm64.h
> +++ b/kexec/arch/arm64/kexec-arm64.h
> @@ -69,6 +69,7 @@ extern struct arm64_mem arm64_mem;
>  
>  uint64_t get_phys_offset(void);
>  uint64_t get_vp_offset(void);
> +int get_page_offset(unsigned long *offset);
>  
>  static inline void reset_vp_offset(void)
>  {


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kernel/crash_core.c: No judgment required

2021-12-14 Thread Philipp Rudo
Hi lizhe,

On Thu,  9 Dec 2021 19:20:03 -0800
lizhe  wrote:

> No judgment required ck_cmdline is NULL
> its caller has alreadly judged, see __parse_crashkernel
> function
> 
> Signed-off-by: lizhe 
> ---
>  kernel/crash_core.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index eb53f5ec62c9..9981cf9b9fe4 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -221,9 +221,6 @@ static __init char *get_last_crashkernel(char *cmdline,
>   p = strstr(p+1, name);
>   }
>  
> - if (!ck_cmdline)
> - return NULL;
> -
>   return ck_cmdline;
>  }
>  

I agree that the if-block is not needed and can be removed. However, I
cannot follow your reasoning in the commit message. Could you please
explain it in more detail.

The reason why I think that the 'if' can be removed is that the
expression can only be true when ck_cmdline = NULL. But with that the
last three lines are equivalent to

if (!ck_cmdline)
return ck_cmdline;

return ck_cmdline;

Which simply doesn't make any sense.

Thanks
Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/1] s390: handle R_390_PLT32DBL reloc entries in machine_apply_elf_rel()

2021-12-13 Thread Philipp Rudo
Hi Alexander,

@Alexander: Thanks for taking care of this.

On Wed,  8 Dec 2021 13:53:55 +0100
Alexander Egorenkov  wrote:

> Starting with gcc 11.3, the C compiler will generate PLT-relative function
> calls even if they are local and do not require it. Later on during linking,
> the linker will replace all PLT-relative calls to local functions with
> PC-relative ones. Unfortunately, the purgatory code of kexec/kdump is
> not being linked as a regular executable or shared library would have been,
> and therefore, all PLT-relative addresses remain in the generated purgatory
> object code unresolved. This leads to the situation where the purgatory
> code is being executed during kdump with all PLT-relative addresses
> unresolved. And this results in endless loops within the purgatory code.

Tiny nit. The last two sentences describe the situation in the kernel.
Luckily the kexec-tools do proper error checking and die with

"Unknown rela relocation: 0x14 0x73c0901c"

when they encounter an unknown relocation type.

Anyway, the code is correct

Reviewed-by: Philipp Rudo 

> Furthermore, the clang C compiler has always behaved like described above
> and this commit should fix the purgatory code built with the latter.
> 
> Because the purgatory code is no regular executable or shared library,
> contains only calls to local functions and has no PLT, all R_390_PLT32DBL
> relocation entries can be resolved just like a R_390_PC32DBL one.
> 
> * 
> https://refspecs.linuxfoundation.org/ELF/zSeries/lzsabi0_zSeries/x1633.html#AEN1699
> 
> Relocation entries of purgatory code generated with gcc 11.3
> 
> 
> $ readelf -r purgatory/purgatory.o
> 
> Relocation section '.rela.text' at offset 0x6e8 contains 27 entries:
>   Offset  Info   Type   Sym. ValueSym. Name + 
> Addend
> 000c  00030013 R_390_PC32DBL  .data + 2
> 001a  00100014 R_390_PLT32DBL sha256_starts + 
> 2
> 0030  00110014 R_390_PLT32DBL sha256_update + 
> 2
> 0046  00120014 R_390_PLT32DBL sha256_finish + 
> 2
> 0050  00030013 R_390_PC32DBL  .data + 102
> 005a  00130014 R_390_PLT32DBL memcmp + 2
> ...
> 0118  00160014 R_390_PLT32DBL setup_arch + 2
> 011e  00030013 R_390_PC32DBL  .data + 2
> 012c  000f0014 R_390_PLT32DBL 
> verify_sha256_digest + 2
> 0142  00170014 R_390_PLT32DBL
> post_verification[...] + 2
> 
> Relocation entries of purgatory code generated with gcc 11.2
> 
> 
> $ readelf -r purgatory/purgatory.o
> 
> Relocation section '.rela.text' at offset 0x6e8 contains 27 entries:
>   Offset  Info   Type   Sym. ValueSym. Name + 
> Addend
> 000e  00030013 R_390_PC32DBL  .data + 2
> 001c  00100013 R_390_PC32DBL  sha256_starts + 
> 2
> 0036  00110013 R_390_PC32DBL  sha256_update + 
> 2
> 0048  00120013 R_390_PC32DBL  sha256_finish + 
> 2
> 0052  00030013 R_390_PC32DBL  .data + 102
> 005c  00130013 R_390_PC32DBL  memcmp + 2
> ...
> 011a  00160013 R_390_PC32DBL  setup_arch + 2
> 0120  00030013 R_390_PC32DBL  .data + 122
> 0130  000f0013 R_390_PC32DBL  
> verify_sha256_digest + 2
> 0146  00170013 R_390_PC32DBL  
> post_verification[...] + 2
> 
> Corresponding s390 kernel discussion:
> * 
> https://lore.kernel.org/linux-s390/20211208105801.188140-1-egore...@linux.ibm.com/T/#u
> 
> Signed-off-by: Alexander Egorenkov 
> Reported-by: Tao Liu 
> Suggested-by: Philipp Rudo 
> ---
>  kexec/arch/s390/kexec-elf-rel-s390.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kexec/arch/s390/kexec-elf-rel-s390.c 
> b/kexec/arch/s390/kexec-elf-rel-s390.c
> index a5e1b7345578..91ba86a9991d 100644
> --- a/kexec/arch/s390/kexec-elf-rel-s390.c
> +++ b/kexec/arch/s390/kexec-elf-rel-s390.c
> @@ -56,6 +56,7 @@ void machine_apply_elf_rel(struct mem_ehdr *UNUSED(ehdr),
>   case R_390_PC16:/* PC relative 16 bit.  */
>   case R_390_PC16DBL: /* PC relative 16 bit shifted by 1.  */
>   case R_390_PC32DBL: /* PC relative 32 bit shifted by 1.  */
> + 

[PATCH v2] kernel/crash_core: suppress unknown crashkernel parameter warning

2021-12-08 Thread Philipp Rudo
When booting with crashkernel= on the kernel command line a warning
similar to

[0.038294] Kernel command line: ro console=ttyS0 crashkernel=256M
[0.038353] Unknown kernel command line parameters "crashkernel=256M", will 
be passed to user space.

is printed. This comes from crashkernel= being parsed independent from
the kernel parameter handling mechanism. So the code in init/main.c
doesn't know that crashkernel= is a valid kernel parameter and prints
this incorrect warning. Suppress the warning by adding a dummy
early_param handler for crashkernel=.

Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
Signed-off-by: Philipp Rudo 
Acked-by: Baoquan He 

---
v2:
- improve commit message
- add Acked-by from Baoquan

---
 kernel/crash_core.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index eb53f5ec62c9..256cf6db573c 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -295,6 +296,16 @@ int __init parse_crashkernel_low(char *cmdline,
"crashkernel=", suffix_tbl[SUFFIX_LOW]);
 }
 
+/*
+ * Add a dummy early_param handler to mark crashkernel= as a known command line
+ * parameter and suppress incorrect warnings in init/main.c.
+ */
+static int __init parse_crashkernel_dummy(char *arg)
+{
+   return 0;
+}
+early_param("crashkernel", parse_crashkernel_dummy);
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len)
 {
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 0/6] KEXEC_SIG with appended signature

2021-12-08 Thread Philipp Rudo
Hi Michal,

On Tue, 7 Dec 2021 18:32:21 +0100
Michal Suchánek  wrote:

> On Tue, Dec 07, 2021 at 05:10:14PM +0100, Philipp Rudo wrote:
> > Hi Michal,
> > 
> > i finally had the time to take a closer look at the series. Except for
> > the nit in patch 4 and my personal preference in patch 6 the code looks
> > good to me.
> > 
> > What I don't like are the commit messages on the first commits. In my
> > opinion they are so short that they are almost useless. For example in
> > patch 2 there is absolutely no explanation why you can simply copy the
> > s390 over to ppc.  
> 
> They use the same signature format. I suppose I can add a note saying
> that.

The note is what I was asking for. For me the commit message is an
important piece of documentation for other developers (or yourself in a
year). That's why in my opinion it's important to describe _why_ you do
something in it as you cannot get the _why_ by reading the code.

> > Or in patch 3 you are silently changing the error
> > code in kexec from EKEYREJECT to ENODATA. So I would appreciate it if  
> 
> Not sure what I should do about this. The different implementations use
> different random error codes, and when they are unified the error code
> clearly changes for one or the other.

My complaint wasn't that you change the return code. There's no way to
avoid choosing one over the other. It's again that you don't document
the change in the commit message for others.

> Does anything depend on a particular error code returned?

Not that I know of. At least in the kexec-tools ENODATA and EKEYREJECT
are handled the same way.

Thanks
Philipp


> Thanks
> 
> Michal
> 
> > you could improve them a little.
> > 
> > Thanks
> > Philipp
> > 
> > On Thu, 25 Nov 2021 19:02:38 +0100
> > Michal Suchanek  wrote:
> >   
> > > Hello,
> > > 
> > > This is resend of the KEXEC_SIG patchset.
> > > 
> > > The first patch is new because it'a a cleanup that does not require any
> > > change to the module verification code.
> > > 
> > > The second patch is the only one that is intended to change any
> > > functionality.
> > > 
> > > The rest only deduplicates code but I did not receive any review on that
> > > part so I don't know if it's desirable as implemented.
> > > 
> > > The first two patches can be applied separately without the rest.
> > > 
> > > Thanks
> > > 
> > > Michal
> > > 
> > > Michal Suchanek (6):
> > >   s390/kexec_file: Don't opencode appended signature check.
> > >   powerpc/kexec_file: Add KEXEC_SIG support.
> > >   kexec_file: Don't opencode appended signature verification.
> > >   module: strip the signature marker in the verification function.
> > >   module: Use key_being_used_for for log messages in
> > > verify_appended_signature
> > >   module: Move duplicate mod_check_sig users code to mod_parse_sig
> > > 
> > >  arch/powerpc/Kconfig | 11 +
> > >  arch/powerpc/kexec/elf_64.c  | 14 ++
> > >  arch/s390/kernel/machine_kexec_file.c| 42 ++
> > >  crypto/asymmetric_keys/asymmetric_type.c |  1 +
> > >  include/linux/module_signature.h |  1 +
> > >  include/linux/verification.h |  4 ++
> > >  kernel/module-internal.h |  2 -
> > >  kernel/module.c  | 12 +++--
> > >  kernel/module_signature.c| 56 +++-
> > >  kernel/module_signing.c  | 33 +++---
> > >  security/integrity/ima/ima_modsig.c  | 22 ++
> > >  11 files changed, 113 insertions(+), 85 deletions(-)
> > >   
> >   
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 6/6] module: Move duplicate mod_check_sig users code to mod_parse_sig

2021-12-07 Thread Philipp Rudo
Hi Michal,

On Thu, 25 Nov 2021 19:02:44 +0100
Michal Suchanek  wrote:

> Multiple users of mod_check_sig check for the marker, then call
> mod_check_sig, extract signature length, and remove the signature.
> 
> Put this code in one place together with mod_check_sig.
> 
> Signed-off-by: Michal Suchanek 
> ---
>  include/linux/module_signature.h|  1 +
>  kernel/module_signature.c   | 56 -
>  kernel/module_signing.c | 26 +++---
>  security/integrity/ima/ima_modsig.c | 22 ++--
>  4 files changed, 63 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/module_signature.h 
> b/include/linux/module_signature.h
> index 7eb4b00381ac..1343879b72b3 100644
> --- a/include/linux/module_signature.h
> +++ b/include/linux/module_signature.h
> @@ -42,5 +42,6 @@ struct module_signature {
>  
>  int mod_check_sig(const struct module_signature *ms, size_t file_len,
> const char *name);
> +int mod_parse_sig(const void *data, size_t *len, size_t *sig_len, const char 
> *name);
>  
>  #endif /* _LINUX_MODULE_SIGNATURE_H */
> diff --git a/kernel/module_signature.c b/kernel/module_signature.c
> index 00132d12487c..784b40575ee4 100644
> --- a/kernel/module_signature.c
> +++ b/kernel/module_signature.c
> @@ -8,14 +8,36 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> +/**
> + * mod_check_sig_marker - check that the given data has signature marker at 
> the end
> + *
> + * @data:Data with appended signature
> + * @len: Length of data. Signature marker length is subtracted on 
> success.
> + */
> +static inline int mod_check_sig_marker(const void *data, size_t *len)

I personally don't like it when a function has a "check" in it's name
as it doesn't describe what the function is checking for. For me
mod_has_sig_marker is much more precise. I would use that instead.

Thanks
Philipp

> +{
> + const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1;
> +
> + if (markerlen > *len)
> + return -ENODATA;
> +
> + if (memcmp(data + *len - markerlen, MODULE_SIG_STRING,
> +markerlen))
> + return -ENODATA;
> +
> + *len -= markerlen;
> + return 0;
> +}
> +
>  /**
>   * mod_check_sig - check that the given signature is sane
>   *
>   * @ms:  Signature to check.
> - * @file_len:Size of the file to which @ms is appended.
> + * @file_len:Size of the file to which @ms is appended (without the 
> marker).
>   * @name:What is being checked. Used for error messages.
>   */
>  int mod_check_sig(const struct module_signature *ms, size_t file_len,
> @@ -44,3 +66,35 @@ int mod_check_sig(const struct module_signature *ms, 
> size_t file_len,
>  
>   return 0;
>  }
> +
> +/**
> + * mod_parse_sig - check that the given signature is sane and determine 
> signature length
> + *
> + * @data:Data with appended signature.
> + * @len: Length of data. Signature and marker length is subtracted on 
> success.
> + * @sig_len: Length of signature. Filled on success.
> + * @name:What is being checked. Used for error messages.
> + */
> +int mod_parse_sig(const void *data, size_t *len, size_t *sig_len, const char 
> *name)
> +{
> + const struct module_signature *sig;
> + int rc;
> +
> + rc = mod_check_sig_marker(data, len);
> + if (rc)
> + return rc;
> +
> + if (*len < sizeof(*sig))
> + return -ENODATA;
> +
> + sig = (const struct module_signature *)(data + (*len - sizeof(*sig)));
> +
> + rc = mod_check_sig(sig, *len, name);
> + if (rc)
> + return rc;
> +
> + *sig_len = be32_to_cpu(sig->sig_len);
> + *len -= *sig_len + sizeof(*sig);
> +
> + return 0;
> +}
> diff --git a/kernel/module_signing.c b/kernel/module_signing.c
> index cef72a6f6b5d..02bbca90f467 100644
> --- a/kernel/module_signing.c
> +++ b/kernel/module_signing.c
> @@ -25,35 +25,17 @@ int verify_appended_signature(const void *data, size_t 
> *len,
> struct key *trusted_keys,
> enum key_being_used_for purpose)
>  {
> - const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1;
>   struct module_signature ms;
> - size_t sig_len, modlen = *len;
> + size_t sig_len;
>   int ret;
>  
> - pr_devel("==>%s %s(,%zu)\n", __func__, key_being_used_for[purpose], 
> modlen);  
> + pr_devel("==>%s %s(,%zu)\n", __func__, key_being_used_for[purpose], 
> *len);
>  
> - if (markerlen > modlen)
> - return -ENODATA;
> -
> - if (memcmp(data + modlen - markerlen, MODULE_SIG_STRING,
> -markerlen))
> - return -ENODATA;
> - modlen -= markerlen;
> -
> - if (modlen <= sizeof(ms))
> - return -EBADMSG;
> -
> - memcpy(, data + (modlen - sizeof(ms)), sizeof(ms));
> -
> - ret = mod_check_sig(, modlen, key_being_used_for[purpose]);
> + ret = mod_parse_sig(data, len, 

Re: [PATCH v2 4/6] module: strip the signature marker in the verification function.

2021-12-07 Thread Philipp Rudo
Hi Michal,

On Thu, 25 Nov 2021 19:02:42 +0100
Michal Suchanek  wrote:

> It is stripped by each caller separately.
> 
> Signed-off-by: Michal Suchanek 
> ---
>  arch/powerpc/kexec/elf_64.c   |  9 -
>  arch/s390/kernel/machine_kexec_file.c |  9 -
>  kernel/module.c   |  7 +--
>  kernel/module_signing.c   | 12 ++--

kernel/module_signing.c is only compiled with MODULE_SIG enabled but
KEXEC_SIG only selects MODULE_SIG_FORMAT. In the unlikely case that
KEXEC_SIG is enabled but MODULE_SIG isn't this causes a build breakage.
So you need to update KEXEC_SIG to select MODULE_SIG instead of
MODULE_SIG_FORMAT for s390 and ppc.

Thanks
Philipp

>  4 files changed, 11 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index 266cb26d3ca0..63634c95265d 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -156,15 +156,6 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
>  int elf64_verify_sig(const char *kernel, unsigned long length)
>  {
>   size_t kernel_len = length;
> - const unsigned long marker_len = sizeof(MODULE_SIG_STRING) - 1;
> -
> - if (marker_len > kernel_len)
> - return -EKEYREJECTED;
> -
> - if (memcmp(kernel + kernel_len - marker_len, MODULE_SIG_STRING,
> -marker_len))
> - return -EKEYREJECTED;
> - kernel_len -= marker_len;
>  
>   return verify_appended_signature(kernel, _len, 
> VERIFY_USE_PLATFORM_KEYRING,
>   "kexec_file");
> diff --git a/arch/s390/kernel/machine_kexec_file.c 
> b/arch/s390/kernel/machine_kexec_file.c
> index 432797249db3..c4632c1a1b59 100644
> --- a/arch/s390/kernel/machine_kexec_file.c
> +++ b/arch/s390/kernel/machine_kexec_file.c
> @@ -27,20 +27,11 @@ const struct kexec_file_ops * const kexec_file_loaders[] 
> = {
>  int s390_verify_sig(const char *kernel, unsigned long length)
>  {
>   size_t kernel_len = length;
> - const unsigned long marker_len = sizeof(MODULE_SIG_STRING) - 1;
>  
>   /* Skip signature verification when not secure IPLed. */
>   if (!ipl_secure_flag)
>   return 0;
>  
> - if (marker_len > kernel_len)
> - return -EKEYREJECTED;
> -
> - if (memcmp(kernel + kernel_len - marker_len, MODULE_SIG_STRING,
> -marker_len))
> - return -EKEYREJECTED;
> - kernel_len -= marker_len;
> -
>   return verify_appended_signature(kernel, _len, 
> VERIFY_USE_PLATFORM_KEYRING,
>   "kexec_file");
>  }
> diff --git a/kernel/module.c b/kernel/module.c
> index 8481933dfa92..d91ca0f93a40 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -2882,7 +2882,6 @@ static inline void kmemleak_load_module(const struct 
> module *mod,
>  static int module_sig_check(struct load_info *info, int flags)
>  {
>   int err = -ENODATA;
> - const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1;
>   const char *reason;
>   const void *mod = info->hdr;
>  
> @@ -2890,11 +2889,7 @@ static int module_sig_check(struct load_info *info, 
> int flags)
>* Require flags == 0, as a module with version information
>* removed is no longer the module that was signed
>*/
> - if (flags == 0 &&
> - info->len > markerlen &&
> - memcmp(mod + info->len - markerlen, MODULE_SIG_STRING, markerlen) 
> == 0) {
> - /* We truncate the module to discard the signature */
> - info->len -= markerlen;
> + if (flags == 0) {
>   err = verify_appended_signature(mod, >len,
>   VERIFY_USE_SECONDARY_KEYRING, 
> "module");
>   if (!err) {
> diff --git a/kernel/module_signing.c b/kernel/module_signing.c
> index f492e410564d..4c28cb55275f 100644
> --- a/kernel/module_signing.c
> +++ b/kernel/module_signing.c
> @@ -15,8 +15,7 @@
>  #include "module-internal.h"
>  
>  /**
> - * verify_appended_signature - Verify the signature on a module with the
> - * signature marker stripped.
> + * verify_appended_signature - Verify the signature on a module
>   * @data: The data to be verified
>   * @len: Size of @data.
>   * @trusted_keys: Keyring to use for verification
> @@ -25,12 +24,21 @@
>  int verify_appended_signature(const void *data, size_t *len,
> struct key *trusted_keys, const char *what)
>  {
> + const unsigned long markerlen = sizeof(MODULE_SIG_STRING) - 1;
>   struct module_signature ms;
>   size_t sig_len, modlen = *len;
>   int ret;
>  
>   pr_devel("==>%s(,%zu)\n", __func__, modlen);  
>  
> + if (markerlen > modlen)
> + return -ENODATA;
> +
> + if (memcmp(data + modlen - markerlen, MODULE_SIG_STRING,
> +markerlen))
> + return -ENODATA;
> + modlen -= markerlen;
> +
>   if (modlen <= 

Re: [PATCH v2 0/6] KEXEC_SIG with appended signature

2021-12-07 Thread Philipp Rudo
Hi Michal,

i finally had the time to take a closer look at the series. Except for
the nit in patch 4 and my personal preference in patch 6 the code looks
good to me.

What I don't like are the commit messages on the first commits. In my
opinion they are so short that they are almost useless. For example in
patch 2 there is absolutely no explanation why you can simply copy the
s390 over to ppc. Or in patch 3 you are silently changing the error
code in kexec from EKEYREJECT to ENODATA. So I would appreciate it if
you could improve them a little.

Thanks
Philipp

On Thu, 25 Nov 2021 19:02:38 +0100
Michal Suchanek  wrote:

> Hello,
> 
> This is resend of the KEXEC_SIG patchset.
> 
> The first patch is new because it'a a cleanup that does not require any
> change to the module verification code.
> 
> The second patch is the only one that is intended to change any
> functionality.
> 
> The rest only deduplicates code but I did not receive any review on that
> part so I don't know if it's desirable as implemented.
> 
> The first two patches can be applied separately without the rest.
> 
> Thanks
> 
> Michal
> 
> Michal Suchanek (6):
>   s390/kexec_file: Don't opencode appended signature check.
>   powerpc/kexec_file: Add KEXEC_SIG support.
>   kexec_file: Don't opencode appended signature verification.
>   module: strip the signature marker in the verification function.
>   module: Use key_being_used_for for log messages in
> verify_appended_signature
>   module: Move duplicate mod_check_sig users code to mod_parse_sig
> 
>  arch/powerpc/Kconfig | 11 +
>  arch/powerpc/kexec/elf_64.c  | 14 ++
>  arch/s390/kernel/machine_kexec_file.c| 42 ++
>  crypto/asymmetric_keys/asymmetric_type.c |  1 +
>  include/linux/module_signature.h |  1 +
>  include/linux/verification.h |  4 ++
>  kernel/module-internal.h |  2 -
>  kernel/module.c  | 12 +++--
>  kernel/module_signature.c| 56 +++-
>  kernel/module_signing.c  | 33 +++---
>  security/integrity/ima/ima_modsig.c  | 22 ++
>  11 files changed, 113 insertions(+), 85 deletions(-)
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/3] s390: add support for --reuse-cmdline

2021-12-07 Thread Philipp Rudo
Hi Sven,

makes absolutely sense to have this option. One problem though...

On Mon, 22 Nov 2021 08:14:01 +0100
Sven Schnelle  wrote:

> --reuse-cmdline reads the command line of the currently
> running kernel from /proc/cmdline and uses that for the
> kernel that should be kexec'd.
> 
> Signed-off-by: Sven Schnelle 
> Reviewed-by: Alexander Egorenkov 
> ---
>  kexec/arch/s390/include/arch/options.h | 10 ++
>  kexec/arch/s390/kexec-image.c  |  9 +
>  2 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/kexec/arch/s390/include/arch/options.h 
> b/kexec/arch/s390/include/arch/options.h
> index 76044a301ceb..b030b61d61be 100644
[...]
> diff --git a/kexec/arch/s390/kexec-image.c b/kexec/arch/s390/kexec-image.c
> index dbeb689b830a..310d967ea331 100644
> --- a/kexec/arch/s390/kexec-image.c
> +++ b/kexec/arch/s390/kexec-image.c
> @@ -72,6 +72,10 @@ int image_s390_load_file(int argc, char **argv, struct 
> kexec_info *info)
>   case OPT_RAMDISK:
>   ramdisk = optarg;
>   break;
> + case OPT_REUSE_CMDLINE:
> + free(command_line);
> + command_line = get_command_line();

get_command_line reads a maximum of 2048 bytes from /prc/cmdline. With
the configurable size on s390 defaulting to 4096 bytes this will
ultimately cause problems. So you need to make get_command_line more
flexible first.

Thanks
Philipp

> + break;
>   }
>   }
>  
> @@ -123,6 +127,10 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   if (command_line_add(optarg))
>   return -1;
>   break;
> + case OPT_REUSE_CMDLINE:
> + free(command_line);
> + command_line = get_command_line();
> + break;
>   case OPT_RAMDISK:
>   ramdisk = optarg;
>   break;
> @@ -223,5 +231,6 @@ image_s390_usage(void)
>   printf("--command-line=STRING Set the kernel command line to STRING.\n"
>  "--append=STRING   Set the kernel command line to STRING.\n"
>  "--initrd=FILENAME Use the file FILENAME as a ramdisk.\n"
> +"--reuse-cmdline   Use kernel command line from running 
> system.\n"
>   );
>  }


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/3] s390: add variable command line size

2021-12-07 Thread Philipp Rudo
Hi Sven,

On Mon, 22 Nov 2021 08:13:59 +0100
Sven Schnelle  wrote:

> Newer s390 kernels support a command line size longer than 896
> bytes. Such kernels contain a new member in the parameter area,
> which might be utilized by tools like kexec. Older kernels have
> the location initialized to zero, so we check whether there's a
> non-zero number present and use that. If there isn't, we fallback
> to the legacy command line size of 896 bytes.
> 
> Signed-off-by: Sven Schnelle 
> Reviewed-by: Alexander Egorenkov 
> ---
>  kexec/arch/s390/kexec-image.c | 53 ---
>  kexec/arch/s390/kexec-s390.h  | 19 +++--
>  2 files changed, 46 insertions(+), 26 deletions(-)
> 
> diff --git a/kexec/arch/s390/kexec-image.c b/kexec/arch/s390/kexec-image.c
> index 3c24fdfe3c7c..7747d02399db 100644
> --- a/kexec/arch/s390/kexec-image.c
> +++ b/kexec/arch/s390/kexec-image.c
> @@ -25,7 +25,7 @@
>  #include 
>  
>  static uint64_t crash_base, crash_end;
> -static char command_line[COMMAND_LINESIZE];
> +static char *command_line;

isn't this the perfect opportunity to get rid of this global variable
and...

>  static void add_segment_check(struct kexec_info *info, const void *buf,
> size_t bufsz, unsigned long base, size_t memsz)
> @@ -38,11 +38,16 @@ static void add_segment_check(struct kexec_info *info, 
> const void *buf,
>  
>  int command_line_add(const char *str)

... simply pass the pointer as an argument ;)

Thanks
Philipp

>  {
> - if (strlen(command_line) + strlen(str) + 1 > COMMAND_LINESIZE) {
> - fprintf(stderr, "Command line too long.\n");
> + char *tmp = NULL;
> +
> + tmp = concat_cmdline(command_line, str);
> + if (!tmp) {
> + fprintf(stderr, "out of memory\n");
>   return -1;
>   }
> - strcat(command_line, str);
> +
> + free(command_line);
> + command_line = tmp;
>   return 0;
>  }
>  
> @@ -78,6 +83,8 @@ int image_s390_load_file(int argc, char **argv, struct 
> kexec_info *info)
>   if (info->initrd_fd == -1) {
>   fprintf(stderr, "Could not open initrd file %s:%s\n",
>   ramdisk, strerror(errno));
> + free(command_line);
> + command_line = NULL;
>   return -1;
>   }
>   }
> @@ -97,7 +104,7 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   const char *ramdisk;
>   off_t ramdisk_len;
>   unsigned int ramdisk_origin;
> - int opt;
> + int opt, ret = -1;
>  
>   if (info->file_mode)
>   return image_s390_load_file(argc, argv, info);
> @@ -112,7 +119,6 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   };
>   static const char short_options[] = KEXEC_OPT_STR "";
>  
> - command_line[0] = 0;
>   ramdisk = NULL;
>   ramdisk_len = 0;
>   ramdisk_origin = 0;
> @@ -132,7 +138,7 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   if (info->kexec_flags & KEXEC_ON_CRASH) {
>   if (parse_iomem_single("Crash kernel\n", _base,
>  _end))
> - return -1;
> + goto out;
>   }
>  
>   /* Add kernel segment */
> @@ -151,7 +157,7 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   rd_buffer = slurp_file_mmap(ramdisk, _len);
>   if (rd_buffer == NULL) {
>   fprintf(stderr, "Could not read ramdisk.\n");
> - return -1;
> + goto out;
>   }
>   ramdisk_origin = MAX(RAMDISK_ORIGIN_ADDR, kernel_size);
>   ramdisk_origin = _ALIGN_UP(ramdisk_origin, 0x10);
> @@ -160,7 +166,7 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   }
>   if (info->kexec_flags & KEXEC_ON_CRASH) {
>   if (load_crashdump_segments(info, crash_base, crash_end))
> - return -1;
> + goto out;
>   } else {
>   info->entry = (void *) IMAGE_READ_OFFSET;
>   }
> @@ -183,15 +189,28 @@ image_s390_load(int argc, char **argv, const char 
> *kernel_buf,
>   *tmp = crash_end - crash_base + 1;
>   }
>   }
> - /*
> -  * We will write a probably given command line.
> -  * First, erase the old area, then setup the new parameters:
> -  */
> - if (strlen(command_line) != 0) {
> - memset(krnl_buffer + COMMAND_LINE_OFFS, 0, COMMAND_LINESIZE);
> - memcpy(krnl_buffer + COMMAND_LINE_OFFS, command_line, 
> strlen(command_line));
> +
> + if (command_line) {
> + unsigned long maxsize;
> + char *dest = krnl_buffer + COMMAND_LINE_OFFS;
> +
> + maxsize = *(unsigned long *)(krnl_buffer + 
> MAX_COMMAND_LINESIZE_OFFS);
> + if 

Re: [PATCH] kernel/crash_core: suppress unknown crashkernel parameter warning

2021-12-07 Thread Philipp Rudo
Hi Baoquan,

On Tue, 7 Dec 2021 11:34:46 +0800
Baoquan He  wrote:

> On 12/06/21 at 12:17pm, Philipp Rudo wrote:
> > When booting with crashkernel= on the kernel command line a warning
> > similar to
> > 
> > [0.038294] Kernel command line: ro console=ttyS0 crashkernel=256M
> > [0.038353] Unknown kernel command line parameters "crashkernel=256M", 
> > will be passed to user space.
> > 
> > is printed. This originates from crashkernel= being parsed independent from
> > the early_param() mechanism. So the code in init/main.c doesn't know  
> 
> Not only the early_param(), __setup() also takes the same mechanism.
> It's just handled in different stage. You might need to call it kernel
> param handling mechanism, not sure if it's accurate.

you are right, "kernel param handling" is better. I used early_param as
that's where we would need to hook into if we wanted to use the common
kernel param handling. But I don't think it is worth it.

@akpm: do you update the commit message before sending the patch to
   Linus or shall I send a v2?

> > that crashkernel= is a valid kernel parameter and prints this incorrect
> > warning. Suppress the warning by adding a dummy early_param handler for
> > crashkernel=.  
> 
> The fix looks good to me, thanks.
> 
> Acked-by: Baoquan He 

Thanks

> By the way, on which arch did you find this issue? Ask because I am
> wondering whether there's any other similiar independent kernel cmdline
> handling from __setup_param(). If have, is there a chance to take a
> common method to handle them, e.g a generic function or a place to
> identify them. Just wild thought, I have no idea yet. Otherwise, we may
> need several this kind of dummy handler for each one.

The issue was first reported on s390 but I used x86 to test the fix.
The only other reported parameter I encountered was BOOT_IMAGE= which
is not a kernel parameter and thus correct. But in the corresponding
bugzilla Andrew (on cc) said "Gah! I thought I had squashed all of
these interesting uses of the kernel command line, it is like playing 
whack-a-mole."
So I believe there were multiple other parameters that had the same problem.

Thanks
Philipp

> > Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
> > Signed-off-by: Philipp Rudo 
> > ---
> >  kernel/crash_core.c | 11 +++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index eb53f5ec62c9..256cf6db573c 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -6,6 +6,7 @@
> >  
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  
> > @@ -295,6 +296,16 @@ int __init parse_crashkernel_low(char *cmdline,
> > "crashkernel=", suffix_tbl[SUFFIX_LOW]);
> >  }
> >  
> > +/*
> > + * Add a dummy early_param handler to mark crashkernel= as a known command 
> > line
> > + * parameter and suppress incorrect warnings in init/main.c.
> > + */
> > +static int __init parse_crashkernel_dummy(char *arg)
> > +{
> > +   return 0;
> > +}
> > +early_param("crashkernel", parse_crashkernel_dummy);
> > +
> >  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> >   void *data, size_t data_len)
> >  {
> > -- 
> > 2.31.1
> > 
> > 
> > ___
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> >   
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] kernel/crash_core: suppress unknown crashkernel parameter warning

2021-12-06 Thread Philipp Rudo
When booting with crashkernel= on the kernel command line a warning
similar to

[0.038294] Kernel command line: ro console=ttyS0 crashkernel=256M
[0.038353] Unknown kernel command line parameters "crashkernel=256M", will 
be passed to user space.

is printed. This originates from crashkernel= being parsed independent from
the early_param() mechanism. So the code in init/main.c doesn't know
that crashkernel= is a valid kernel parameter and prints this incorrect
warning. Suppress the warning by adding a dummy early_param handler for
crashkernel=.

Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
Signed-off-by: Philipp Rudo 
---
 kernel/crash_core.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index eb53f5ec62c9..256cf6db573c 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -295,6 +296,16 @@ int __init parse_crashkernel_low(char *cmdline,
"crashkernel=", suffix_tbl[SUFFIX_LOW]);
 }
 
+/*
+ * Add a dummy early_param handler to mark crashkernel= as a known command line
+ * parameter and suppress incorrect warnings in init/main.c.
+ */
+static int __init parse_crashkernel_dummy(char *arg)
+{
+   return 0;
+}
+early_param("crashkernel", parse_crashkernel_dummy);
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len)
 {
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] s390/kexec: fix memory leak of ipl report buffer

2021-10-29 Thread Philipp Rudo
Hi Baoquan,

On Fri, 29 Oct 2021 17:26:35 +0800
Baoquan He  wrote:

> A memory leak is reported by kmemleak scanning:
> 
> unreferenced object 0x38000195000 (size 4096):
>   comm "kexec", pid 8548, jiffies 4294953647 (age 32443.270s)
>   hex dump (first 32 bytes):
> 00 00 00 c8 20 00 00 00 00 00 00 c0 02 80 00 00   ...
> 40 40 40 40 40 40 40 40 00 00 00 00 00 00 00 00  
>   backtrace:
> [<11a2f199>] __vmalloc_node_range+0xc0/0x140
> [<81fa2752>] vzalloc+0x5a/0x70
> [<63a4c92d>] ipl_report_finish+0x2c/0x180
> [<553304da>] kexec_file_add_ipl_report+0xf4/0x150
> [<862d033f>] kexec_file_add_components+0x124/0x160
> [<0d2717bb>] arch_kexec_kernel_image_load+0x62/0x90
> [<2e0373b6>] kimage_file_alloc_init+0x1aa/0x2e0
> [<60f2d14f>] __do_sys_kexec_file_load+0x17c/0x2c0
> [<8c86fe5a>] __s390x_sys_kexec_file_load+0x40/0x50
> [<1fdb9dac>] __do_syscall+0x1bc/0x1f0
> [<3ee4258d>] system_call+0x78/0xa0
> 
> The ipl report buffer is allocated via vmalloc, while has no chance to free
> if the kexe loading is unloaded. This will cause obvious memory leak
> when kexec/kdump kernel is reloaded, manually, or triggered, e.g by
> memory hotplug.
> 
> Here, add struct kimage_arch to s390 to pass out the ipl report buffer
> address, and introduce arch dependent function
> arch_kimage_file_post_load_cleanup() to free it when unloaded.
> 
> Signed-off-by: Baoquan He 

other than a missing

Fixes: 99feaa717e55 ("s390/kexec_file: Create ipl report and pass to
next kernel")

the patch looks good to me.

Reviewed-by: Philipp Rudo 

> ---
>  arch/s390/include/asm/kexec.h | 7 +++
>  arch/s390/kernel/machine_kexec_file.c | 9 +
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> index ea398a05f643..9f16e99fb882 100644
> --- a/arch/s390/include/asm/kexec.h
> +++ b/arch/s390/include/asm/kexec.h
> @@ -74,6 +74,13 @@ void *kexec_file_add_components(struct kimage *image,
>  int arch_kexec_do_relocs(int r_type, void *loc, unsigned long val,
>unsigned long addr);
>  
> +#define ARCH_HAS_KIMAGE_ARCH
> +
> +struct kimage_arch {
> + void *ipl_buf;
> +};
> +
> +
>  extern const struct kexec_file_ops s390_kexec_image_ops;
>  extern const struct kexec_file_ops s390_kexec_elf_ops;
>  
> diff --git a/arch/s390/kernel/machine_kexec_file.c 
> b/arch/s390/kernel/machine_kexec_file.c
> index f9e4baa64b67..584c48b631b2 100644
> --- a/arch/s390/kernel/machine_kexec_file.c
> +++ b/arch/s390/kernel/machine_kexec_file.c
> @@ -202,6 +202,7 @@ static int kexec_file_add_ipl_report(struct kimage *image,
>   buf.buffer = ipl_report_finish(data->report);
>   buf.bufsz = data->report->size;
>   buf.memsz = buf.bufsz;
> + image->arch.ipl_buf = buf.buffer;
>  
>   data->memsz += buf.memsz;
>  
> @@ -321,3 +322,11 @@ int arch_kexec_kernel_image_probe(struct kimage *image, 
> void *buf,
>  
>   return kexec_image_probe_default(image, buf, buf_len);
>  }
> +
> +int arch_kimage_file_post_load_cleanup(struct kimage *image)
> +{
> + kvfree(image->arch.ipl_buf);
> + image->arch.ipl_buf = NULL;
> +
> + return kexec_image_post_load_cleanup_default(image);
> +}


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/2] makedumpfile: Fix --dry-run for incomplete dumps

2021-08-12 Thread Philipp Rudo
When running out of space during a dry run, e.g. by limiting the output
size using the -L option, makedumpfile fails with

[...]
Can't write the dump file(vmcore). Size limit(104857600) reached.
get_nr_pages: Can't seek end of the dump file(vmcore).
__read_disk_dump_header: Can't open a file(vmcore). No such file or 
directory
[...]

This is because for --dry-run no dump file is created and a dummy file
descriptor of -1 is used. Thus the file operations in get_nr_pages and
check_and_modify_*_headers fail. Fix the first error by using
write_bytes as file size for the output file and the second one by
always returning TRUE for check_and_modify_headers.

Fixes: 3422e1d ("[PATCH 1/2] Add --dry-run option to prevent writing the 
dumpfile")
Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 30f9725..c063267 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -5113,6 +5113,9 @@ check_and_modify_multiple_kdump_headers() {
 int
 check_and_modify_headers()
 {
+   if (info->flag_dry_run)
+   return TRUE;
+
if (info->flag_elf_dumpfile)
return check_and_modify_elf_headers(info->name_dumpfile);
else
@@ -7996,7 +7999,11 @@ get_nr_pages(void *buf, struct cache_data *cd_page){
int size, file_end, nr_pages;
page_desc_t *pd = buf;
 
-   file_end = lseek(cd_page->fd, 0, SEEK_END);
+   if (info->flag_dry_run)
+   file_end = write_bytes;
+   else
+   file_end = lseek(cd_page->fd, 0, SEEK_END);
+
if (file_end < 0) {
ERRMSG("Can't seek end of the dump file(%s).\n", 
cd_page->file_name);
return -1;
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/2] makedumpfile: Fix bad file descriptor error when using --dry-run

2021-08-12 Thread Philipp Rudo
When running makedumpfile with the --dry-run option it fails with

[...]
write_and_check_space: Can't seek the dump file(vmcore). Bad file 
descriptor
[...]

This is because for --dry-run no dump file is created and a dummy file
descriptor of -1 is used. Thus the lseek in write_and_check_space will
fail. Fix this by treating a dry run as if writing to STDOUT.

Fixes: f0cfa86 ("[PATCH v2 3/3] Add -L option to limit output file size")
Signed-off-by: Philipp Rudo 
---
 makedumpfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index b1b3b42..30f9725 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -4712,7 +4712,7 @@ write_and_check_space(int fd, void *buf, size_t buf_size, 
const char* desc,
int retval = 0;
off_t pos;
 
-   if (fd == STDOUT_FILENO) {
+   if (fd == STDOUT_FILENO || info->flag_dry_run) {
pos = write_bytes;
} else {
pos = lseek(fd, 0, SEEK_CUR);
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 0/2] makedumpfile: fix two bugs with --dry-run

2021-08-12 Thread Philipp Rudo
Hi,

While playing with the --dry-run option I noticed two bugs. You can find the
fixes below.

Thanks
Philipp

Philipp Rudo (2):
  makedumpfile: Fix bad file descriptor error when using --dry-run
  makedumpfile: Fix --dry-run for incomplete dumps

 makedumpfile.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-29 Thread Philipp Rudo
Hi,

On Fri, 25 Sep 2020 10:56:25 -0400
Konrad Rzeszutek Wilk  wrote:

> On Fri, Sep 25, 2020 at 11:05:58AM +0800, Dave Young wrote:
> > Hi,
> > 
> > On 09/24/20 at 01:16pm, boris.ostrov...@oracle.com wrote:  
> > > 
> > > On 9/24/20 12:43 PM, Michael Kelley wrote:  
> > > > From: Eric W. Biederman  Sent: Thursday, 
> > > > September 24, 2020 9:26 AM  
> > > >> Michael Kelley  writes:
> > > >>  
> > > > Added Hyper-V people and people who created the param, it is below
> > > > commit, I also want to remove it if possible, let's see how people
> > > > think, but the least way should be to disable the auto setting in 
> > > > both systemd
> > > > and kernel:  
> > > >>> Hyper-V uses a notifier to inform the host system that a Linux VM has
> > > >>> panic'ed.  Informing the host is particularly important in a public 
> > > >>> cloud
> > > >>> such as Azure so that the cloud software can alert the customer, and 
> > > >>> can
> > > >>> track cloud-wide reliability statistics.   Whether a kdump is taken 
> > > >>> is controlled
> > > >>> entirely by the customer and how he configures the VM, and we want
> > > >>> the host to be informed either way.  
> > > >> Why?
> > > >>
> > > >> Why does the host care?
> > > >> Especially if the VM continues executing into a kdump kernel?  
> > > > The host itself doesn't care.  But the host is a convenient out-of-band
> > > > channel for recording that a panic has occurred and to collect basic 
> > > > data
> > > > about the panic.  This out-of-band channel is then used to notify the 
> > > > end
> > > > customer that his VM has panic'ed.  Sure, the customer should be running
> > > > his own monitoring software, but customers don't always do what they
> > > > should.  Equally important, the out-of-band channel allows the cloud
> > > > infrastructure software to notice trends, such as that the rate of Linux
> > > > panics has increased, and that perhaps there is a cloud problem that
> > > > should be investigated.  
> > > 
> > > 
> > > In many cases (especially in cloud environment) your dump device is 
> > > remote (e.g. iscsi) and kdump sometimes (often?) gets stuck because of 
> > > connectivity issues (which could be cause of the panic in the first 
> > > place). So it is quite desirable to inform the infrastructure that the VM 
> > > is on its way out without waiting for kdump to complete.  
> > 
> > That can probably be done in kdump kernel if it is really needed.  Say
> > informing host that panic happened and a kdump kernel is runnning.  
> 
> If kdump kernel gets to that point. Sometimes (sadly) it ends up being
> misconfigured and it chokes up - and hence having multiple ways to emit
> the crash information before running kdump kernel is a life-saver.
> 
> > 
> > But I think to set crash_kexec_post_notifiers by default is still bad.   
> 
> Because of the way it is run today I presume? If there was some
> safe/unsafe policy that should work right? I would think that the
> safe ones that work properly all the time are:
> 
>  - HyperV CRASH_MSRs,
>  - KVM PVPANIC_[PANIC,CRASHLOAD] push button knob,
>  - pstore EFI variables
>  - Dumping in memory,
> 
> And then some that depend on firmware version (aka BIOS, and vendor) are:
>  - ACPI ERST,
> 
> And then the unsafe:
>  - s390, PowerPC (I don't actually know what they are but that
> was Dave's primary motivator).

that won't work on s390. Let me emphasize that the problems on s390 are not the
notifiers themselves but the fact that they are called before crash_kexec.

On s390 we have multiple dump methods besides kdump. We use a panic notifier to
trigger these dump methods from the panicking kernel. The problem is that these
dump methods are less powerful than kdump so we only want to use them as
fallback, i.e. only use them when either kdump wasn't configured or loading of
the crash kernel failed for whatever reason. That's why (plus historic reasons)
our notifier stops the machine when it is called and none of the methods is
configured. Which means that the second crash_kexec is never reached.

Long story short, the problem on s390 is caused by the two hunks in
kernel/panic.c:panic from f06e5153f4ae ("kernel/panic.c: add
"crash_kexec_post_notifiers" option for kdump after panic_notifers").

Besides the problems on s390 I support Dave and think that setting
crash_kexec_post_notifiers by default is wrong. We should keep in mind that
we are in a panic situation. This means that the kernel is in a state where it
doesn't trust itself anymore. So we should keep the code that is run to the
bare minimum as we cannot rely on it to work properly.

Thanks
Philipp

> 
> >   
> > > 
> > >   
> > > >  
> > > >> Further like I have mentioned everytime something like this has come up
> > > >> a call on the kexec on panic code path should be a direct call (That 
> > > >> can
> > > >> be audited) not something hidden in a notifier call chain (which can 
> > > >> not).
> > > >>  
> > > 
> > > We btw already 

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-22 Thread Philipp Rudo
Hi Konrad,


On Mon, 21 Sep 2020 16:18:12 -0400
Konrad Rzeszutek Wilk  wrote:

> On Fri, Sep 18, 2020 at 05:47:43PM -0700, Andrew Morton wrote:
> > On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:
> >   
> > > crash_kexec_post_notifiers enables running various panic notifier
> > > before kdump kernel booting. This increases risks of kdump failure.
> > > It is well documented in kernel-parameters.txt. We do not suggest
> > > people to enable it together with kdump unless he/she is really sure.
> > > This is also not suggested to be enabled by default when users are
> > > not aware in distributions.
> > > 
> > > But unfortunately it is enabled by default in systemd, see below
> > > discussions in a systemd report, we can not convince systemd to change
> > > it:
> > > https://github.com/systemd/systemd/issues/16661
> > > 
> > > Actually we have got reports about kdump kernel hangs in both s390x
> > > and powerpcle cases caused by the systemd change,  also some x86 cases
> > > could also be caused by the same (although that is in Hyper-V code
> > > instead of systemd, that need to be addressed separately).  
> 
> Perhaps it may be better to fix the issus on s390x and PowerPC as well?

There's little s390 can fix. We use the panic_notifier_list to start
other dumpers in case kdump isn't configured or failed. This behavior was
introduced in 2006 long before crash_kexec_post_notifiers were introduced. So I
suggest that crash_kexec_post_notifiers are fixed instead.

> > > 
> > > Thus to avoid the auto enablement here just disable the param writable
> > > permission in sysfs.
> > >   
> > 
> > Well.  I don't think this is at all a desirable way of resolving a
> > disagreement with the systemd developers
> > 
> > At the above github address I'm seeing "ryncsn added a commit to
> > ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
> > enable crash_kexec_post_notifiers by default".  So didn't that address
> > the issue?  
> 
> It does in systemd, but there is a strong interest in making this on by 
> default.

AFAIK pstore requires UEFI to work. So what's the point to enable it on non-UEFI
systems?

Thanks
Philipp

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V40 08/29] kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE

2019-08-30 Thread Philipp Rudo
Hi Matthew,

found a typo ...

On Mon, 19 Aug 2019 17:17:44 -0700
Matthew Garrett  wrote:

[...]

> index 6d0635ceddd0..9b4f37a4edf1 100644
> --- a/arch/s390/kernel/kexec_elf.c
> +++ b/arch/s390/kernel/kexec_elf.c
> @@ -130,7 +130,7 @@ static int s390_elf_probe(const char *buf, unsigned long 
> len)
>  const struct kexec_file_ops s390_kexec_elf_ops = {
>   .probe = s390_elf_probe,
>   .load = s390_elf_load,
> -#ifdef CONFIG_KEXEC_VERIFY_SIG
> +#ifdef CONFIG_KEXEC__SIG
  ^^
... here.
>   .verify_sig = s390_verify_sig,
> -#endif /* CONFIG_KEXEC_VERIFY_SIG */
> +#endif /* CONFIG_KEXEC_SIG */
>  };

Thanks
Philipp



Re: [PATCH RFC] generic ELF support for kexec

2019-07-01 Thread Philipp Rudo
Hi Sven,

On Tue, 25 Jun 2019 20:54:34 +0200
Sven Schnelle  wrote:

> Hi List,
> 
> i recently started working on kexec for PA-RISC. While doing so, i figured
> that powerpc already has support for reading ELF images inside of the Kernel.
> My first attempt was to steal the source code and modify it for PA-RISC, but
> it turned out that i didn't had to change much. Only ARM specific stuff like
> fdt blob fetching had to be removed.
> 
> So instead of duplicating the code, i thought about moving the ELF stuff to
> the core kexec code, and exposing several function to use that code from the
> arch specific code.

That's always the right approach. Well done.

> I'm attaching the patch to this Mail. What do you think about that change?
> s390 also uses ELF files, and (maybe?) could also switch to this 
> implementation.
> But i don't know anything about S/390 and don't have one in my basement. So
> i'll leave s390 to the IBM folks.

I'm afraid there isn't much code here s390 can reuse. I see multiple problems
in kexec_elf_load:

* while loading the phdrs we also need to setup some data structures to pass
  to the next kernel
* the s390 kernel needs to be loaded to a fixed address
* there is no support to load a crash kernel

Of course that could all be fixed/worked around by introducing some arch hooks.
But when you take into account that the whole elf loader on s390 is ~100 lines
of code, I don't think it is worth it.

More comments below.
 
[...]

> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index b9b1bc5f9669..49b23b425f84 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -216,6 +216,41 @@ extern int crash_prepare_elf64_headers(struct crash_mem 
> *mem, int kernel_map,
>  void **addr, unsigned long *sz);
>  #endif /* CONFIG_KEXEC_FILE */
>  
> +#ifdef CONFIG_KEXEC_FILE_ELF
> +
> +struct kexec_elf_info {
> + /*
> +  * Where the ELF binary contents are kept.
> +  * Memory managed by the user of the struct.
> +  */
> + const char *buffer;
> +
> + const struct elfhdr *ehdr;
> + const struct elf_phdr *proghdrs;
> + struct elf_shdr *sechdrs;
> +};

Do i understand this right? elf_info->buffer contains the full elf file and
elf_info->ehdr is a (endianness translated) copy of the files ehdr?

If so ...

> +void kexec_free_elf_info(struct kexec_elf_info *elf_info);
> +
> +int kexec_build_elf_info(const char *buf, size_t len, struct elfhdr *ehdr,
> +   struct kexec_elf_info *elf_info);
> +
> +int kexec_elf_kernel_load(struct kimage *image, struct kexec_buf *kbuf,
> +   char *kernel_buf, unsigned long kernel_len,
> +   unsigned long *kernel_load_addr);
> +
> +int kexec_elf_probe(const char *buf, unsigned long len);
> +
> +int kexec_elf_load(struct kimage *image, struct elfhdr *ehdr,
> +  struct kexec_elf_info *elf_info,
> +  struct kexec_buf *kbuf,
> +  unsigned long *lowest_load_addr);
> +
> +int kexec_elf_load(struct kimage *image, struct elfhdr *ehdr,
> +  struct kexec_elf_info *elf_info,
> +  struct kexec_buf *kbuf,
> +  unsigned long *lowest_load_addr);

... you could simplify the arguments by dropping the ehdr argument. The
elf_info should contain all the information needed. Furthermore the kexec_buf
also contains a pointer to its kimage. So you can drop that argument as well.

An other thing is that you kzalloc the memory needed for proghdrs and sechdrs
but expect the user of those functions to provide the memory needed for ehdr.
Wouldn't it be more consistent to also kzalloc the ehdr?

[...]

> diff --git a/kernel/kexec_file_elf.c b/kernel/kexec_file_elf.c
> new file mode 100644
> index ..bb966c93492c
> --- /dev/null
> +++ b/kernel/kexec_file_elf.c
> @@ -0,0 +1,574 @@

[...]

> +static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, uint64_t value)
> +{
> + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
> + value = le64_to_cpu(value);
> + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
> + value = be64_to_cpu(value);
> +
> + return value;
> +}
> +
> +static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value)
> +{
> + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
> + value = le16_to_cpu(value);
> + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
> + value = be16_to_cpu(value);
> +
> + return value;
> +}
> +
> +static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value)
> +{
> + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
> + value = le32_to_cpu(value);
> + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
> + value = be32_to_cpu(value);
> +
> + return value;
> +}

What are the elf*_to_cpu good for? In general I'd assume that kexec loads a
kernel for the same architecture it is running on. So 

Re: [PATCH v13 03/16] s390, kexec_file: drop arch_kexec_mem_walk()

2018-08-01 Thread Philipp Rudo
Hey Akashi,

I kept thinking about this patch and remembered why I didn't made the change
you are suggesting now.

The problem is when you only check for kbuf->mem you are excluding address 0,
which might be a valid address to load the kernel to. On s390 this is actually
done when the kernel is not loaded via a boot loader. For kexec_file however,
we cut off the first few kB of the image and jump directly to 'startup'. So
checking for !0 does not cause a problem here.

Anyway, the long term safer solution would be something like

#define KEXEC_BUF_MEM_UNKNOWN (-1UL)

for architectures to tell common code to search a fitting mem hole.

Back then I didn't do the change because I had the other workaround, which
didn't require a common code change. But when you are touching the code now it
is worth thinking about it.

Just wanted to let you know
Philipp


On Wed,  1 Aug 2018 16:58:07 +0900
AKASHI Takahiro  wrote:

> Since s390 already knows where to locate buffers, calling
> arch_kexec_mem_walk() has no sense. So we can just drop it as kbuf->mem
> indicates this while all other architectures sets it to 0 initially.
> 
> This change is a preparatory work for the next patch, where all the
> variant memory walks, either on system resource or memblock, will be
> put in one common place so that it will satisfy all the architectures'
> need.
> 
> Signed-off-by: AKASHI Takahiro 
> Reviewed-by: Philipp Rudo 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Dave Young 
> Cc: Vivek Goyal 
> Cc: Baoquan He 
> ---
>  arch/s390/kernel/machine_kexec_file.c | 10 --
>  kernel/kexec_file.c   |  4 
>  2 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/s390/kernel/machine_kexec_file.c 
> b/arch/s390/kernel/machine_kexec_file.c
> index f413f57f8d20..32023b4f9dc0 100644
> --- a/arch/s390/kernel/machine_kexec_file.c
> +++ b/arch/s390/kernel/machine_kexec_file.c
> @@ -134,16 +134,6 @@ int kexec_file_add_initrd(struct kimage *image, struct 
> s390_load_data *data,
>   return ret;
>  }
>  
> -/*
> - * The kernel is loaded to a fixed location. Turn off kexec_locate_mem_hole
> - * and provide kbuf->mem by hand.
> - */
> -int arch_kexec_walk_mem(struct kexec_buf *kbuf,
> - int (*func)(struct resource *, void *))
> -{
> - return 1;
> -}
> -
>  int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
>Elf_Shdr *section,
>const Elf_Shdr *relsec,
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 63c7ce1c0c3e..bf39df5e5bb9 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -534,6 +534,10 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
>  {
>   int ret;
>  
> + /* Arch knows where to place */
> + if (kbuf->mem)
> + return 0;
> +
>   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
>  
>   return ret == 1 ? 0 : -EADDRNOTAVAIL;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v12 03/16] s390, kexec_file: drop arch_kexec_mem_walk()

2018-07-24 Thread Philipp Rudo
Hi AKASHI,

the patch looks good to me.

Reviewed-by: Philipp Rudo 

Thanks
Philipp


On Tue, 24 Jul 2018 15:57:46 +0900
AKASHI Takahiro  wrote:

> Since s390 already knows where to locate buffers, calling
> arch_kexec_mem_walk() has no sense. So we can just drop it as kbuf->mem
> indicates this while all other architectures sets it to 0 initially.
> 
> This change is a preparatory work for the next patch, where all the
> variant memory walks, either on system resource or memblock, will be
> put in one common place so that it will satisfy all the architectures'
> need.
> 
> Signed-off-by: AKASHI Takahiro 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Dave Young 
> Cc: Vivek Goyal 
> Cc: Baoquan He 
> ---
>  arch/s390/kernel/machine_kexec_file.c | 10 --
>  kernel/kexec_file.c   |  4 
>  2 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/s390/kernel/machine_kexec_file.c 
> b/arch/s390/kernel/machine_kexec_file.c
> index f413f57f8d20..32023b4f9dc0 100644
> --- a/arch/s390/kernel/machine_kexec_file.c
> +++ b/arch/s390/kernel/machine_kexec_file.c
> @@ -134,16 +134,6 @@ int kexec_file_add_initrd(struct kimage *image, struct 
> s390_load_data *data,
>   return ret;
>  }
>  
> -/*
> - * The kernel is loaded to a fixed location. Turn off kexec_locate_mem_hole
> - * and provide kbuf->mem by hand.
> - */
> -int arch_kexec_walk_mem(struct kexec_buf *kbuf,
> - int (*func)(struct resource *, void *))
> -{
> - return 1;
> -}
> -
>  int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
>Elf_Shdr *section,
>const Elf_Shdr *relsec,
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 63c7ce1c0c3e..bf39df5e5bb9 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -534,6 +534,10 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
>  {
>   int ret;
>  
> + /* Arch knows where to place */
> + if (kbuf->mem)
> + return 0;
> +
>   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
>  
>   return ret == 1 ? 0 : -EADDRNOTAVAIL;


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/1] x86/purgatory: Fix Makefile bug

2018-07-05 Thread Philipp Rudo
Hi Dave,

On Thu, 5 Jul 2018 14:46:20 +0800
Dave Young  wrote:

> On 07/04/18 at 01:00pm, Philipp Rudo wrote:
> > Hi everybody,
> > 
> > when i moved the purgatories sha256 implementation to common code, i forgot
> > to add FORCE to the new Makefile target. This patch fixes it.  
> 
> Hi Philipp,
> 
> Do you have the exact steps of how to reproduce your problem?

I only tested it on s390, but on x86 it should work the same way.

- Build the kernel without the fix
- Add some flag to the purgatories KBUILD_CFLAGS,I used
  -fno-asynchronous-unwind-tables
- Re-build the kernel

When you look at make's output you see that sha256.o is not re-build in the
last step. Also readelf -S still shows the .eh_frame section for sha256.o.

With the fix sha256.o is re-build in the last step.

Hope that helps.

Thanks
Philipp

> > 
> > Note: There is an equivalent bug in the s390 purgatory. The fix for that
> > will go upstream via Martin.
> > 
> > Thanks
> > Philipp
> > 
> > Philipp Rudo (1):
> >   x86/purgatory: Add missing FORCE to Makefile target
> > 
> >  arch/x86/purgatory/Makefile | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > -- 
> > 2.16.4
> >   
> 
> Thanks
> Dave
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 0/1] x86/purgatory: Fix Makefile bug

2018-07-04 Thread Philipp Rudo
Hi everybody,

when i moved the purgatories sha256 implementation to common code, i forgot
to add FORCE to the new Makefile target. This patch fixes it.

Note: There is an equivalent bug in the s390 purgatory. The fix for that
will go upstream via Martin.

Thanks
Philipp

Philipp Rudo (1):
  x86/purgatory: Add missing FORCE to Makefile target

 arch/x86/purgatory/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.16.4


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/1] x86/purgatory: Add missing FORCE to Makefile target

2018-07-04 Thread Philipp Rudo
Without FORCE make does not detect changes only made to the command line
options. So object files might not be re-built even when they should be.
Fix this by adding FORCE where it is missing.

Fixes: df6f2801f511 ("kernel/kexec_file.c: move purgatories sha256 to common 
code")
Cc:  # 4.17
Signed-off-by: Philipp Rudo 
---
 arch/x86/purgatory/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 2e9ee023e6bc..81a8e33115ad 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -6,7 +6,7 @@ purgatory-y := purgatory.o stack.o setup-x86_$(BITS).o sha256.o 
entry64.o string
 targets += $(purgatory-y)
 PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
 
-$(obj)/sha256.o: $(srctree)/lib/sha256.c
+$(obj)/sha256.o: $(srctree)/lib/sha256.c FORCE
$(call if_changed_rule,cc_o_c)
 
 LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z 
nodefaultlib
-- 
2.16.4


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec/s390: Add support for kexec_file_load

2018-05-28 Thread Philipp Rudo
On Tue, 22 May 2018 10:11:04 +0200
Simon Horman <ho...@verge.net.au> wrote:

> On Mon, May 21, 2018 at 02:24:40PM +0800, Dave Young wrote:
> > On 05/16/18 at 02:27pm, Philipp Rudo wrote:  
> > > Since kernel 4.17-rc2 s390 supports the kexec_file_load system call. Add
> > > the new system call to kexec-tools and provide the -s 
> > > (--kexec-file-syscall)
> > > option for s390 to support this new feature.
> > > 
> > > Signed-off-by: Philipp Rudo <pr...@linux.ibm.com>  
> 
> ...
> 
> > Acked-by: Dave Young <dyo...@redhat.com>  
> 
> Thanks, applied.
> 

Sorry for the late answer, I had the last week off.
Nevertheless thanks to both of you.

Philipp


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v1 0/2] kexec: Remove "weak" annotations from headers

2018-04-13 Thread Philipp Rudo
Hi Bjorn,

in recent patches AKASHI [1] and I [2] made some changes to the declarations
you are touching and already removed some of the weak statements. The patches
got accepted on linux-next and will (hopefully) be pulled for v4.17. So you
should prepare for some merge conflicts. Nevertheless three weak statements
still remain (arch_kexec_walk_mem & arch_kexec_apply_relocations*) so your
patch still makes totally sense.

Thanks
Philipp

[1] https://lkml.org/lkml/2018/3/6/201
[2] https://lkml.org/lkml/2018/3/21/278

On Thu, 12 Apr 2018 13:23:29 -0500
Bjorn Helgaas  wrote:

> "Weak" annotations in header files are error-prone because they make
> every definition weak.  Remove them from include/linux/kexec.h.
> 
> These were introduced in two separate commits, so this is in two
> patches so they can be easily backported to stable kernels (some of
> them date back to v4.3 and one only goes back to v4.10).
> 
> ---
> 
> Bjorn Helgaas (2):
>   kexec: Remove "weak" from kexec_file function declarations
>   kexec: Remove "weak" from arch_kexec_walk_mem() declaration
> 
> 
>  include/linux/kexec.h |   24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> ___
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 00/11] kexec_file: Clean up purgatory load

2018-03-23 Thread Philipp Rudo
Hi Andrew,
Hi Dave,

i checked out linux-next with the patches and it looks good to me. Also made a
quick test and everything works fine.

Thanks very much to both of you
Philipp


On Wed, 21 Mar 2018 16:00:42 -0700
Andrew Morton <a...@linux-foundation.org> wrote:

> On Wed, 21 Mar 2018 12:27:40 +0100 Philipp Rudo <pr...@linux.vnet.ibm.com> 
> wrote:
> 
> > here is the updated patch set including Dave's comments.  
> 
> There were some syntactic clashes with the "kexec_file, x86, powerpc:
> refactoring for other architecutres" series.  Please check (in
> linux-next) that I got it all correct and that both patch series are
> functioning correctly.
> 
> I expect to send all this into Stephen for -next tomorrow.
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 11/11] kexec_file: Move purgatories sha256 to common code

2018-03-21 Thread Philipp Rudo
The code to verify the new kernels sha digest is applicable for all
architectures. Move it to common code.

One problem is the string.c implementation on x86. Currently sha256
includes x86/boot/string.h which defines memcpy and memset to be gcc
builtins. By moving the sha256 implementation to common code and changing
the include to linux/string.h both functions are no longer defined. Thus
definitions have to be provided in x86/purgatory/string.c

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
---
 arch/x86/purgatory/Makefile|  3 +++
 arch/x86/purgatory/purgatory.c |  2 +-
 arch/x86/purgatory/string.c| 12 
 {arch/x86/purgatory => include/linux}/sha256.h | 11 ++-
 {arch/x86/purgatory => lib}/sha256.c   |  4 ++--
 5 files changed, 28 insertions(+), 4 deletions(-)
 rename {arch/x86/purgatory => include/linux}/sha256.h (62%)
 rename {arch/x86/purgatory => lib}/sha256.c (99%)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 2f15a2ac4209..414eed6b5065 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -6,6 +6,9 @@ purgatory-y := purgatory.o stack.o setup-x86_$(BITS).o sha256.o 
entry64.o string
 targets += $(purgatory-y)
 PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
 
+$(obj)/sha256.o: $(srctree)/lib/sha256.c
+   $(call if_changed_rule,cc_o_c)
+
 LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z 
nodefaultlib
 targets += purgatory.ro
 
diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
index 470edad96bb9..025c34ac0d84 100644
--- a/arch/x86/purgatory/purgatory.c
+++ b/arch/x86/purgatory/purgatory.c
@@ -11,9 +11,9 @@
  */
 
 #include 
+#include 
 #include 
 
-#include "sha256.h"
 #include "../boot/string.h"
 
 unsigned long purgatory_backup_dest __section(.kexec-purgatory);
diff --git a/arch/x86/purgatory/string.c b/arch/x86/purgatory/string.c
index d886b1fa36f0..795ca4f2cb3c 100644
--- a/arch/x86/purgatory/string.c
+++ b/arch/x86/purgatory/string.c
@@ -10,4 +10,16 @@
  * Version 2.  See the file COPYING for more details.
  */
 
+#include 
+
 #include "../boot/string.c"
+
+void *memcpy(void *dst, const void *src, size_t len)
+{
+   return __builtin_memcpy(dst, src, len);
+}
+
+void *memset(void *dst, int c, size_t len)
+{
+   return __builtin_memset(dst, c, len);
+}
diff --git a/arch/x86/purgatory/sha256.h b/include/linux/sha256.h
similarity index 62%
rename from arch/x86/purgatory/sha256.h
rename to include/linux/sha256.h
index 2867d9825a57..244fe01a65fb 100644
--- a/arch/x86/purgatory/sha256.h
+++ b/include/linux/sha256.h
@@ -13,9 +13,18 @@
 #include 
 #include 
 
+/*
+ * Stand-alone implementation of the SHA256 algorithm. It is designed to
+ * have as little dependencies as possible so it can be used in the
+ * kexec_file purgatory. In other cases you should use the implementation in
+ * crypto/.
+ *
+ * For details see lib/sha256.c
+ */
+
 extern int sha256_init(struct sha256_state *sctx);
 extern int sha256_update(struct sha256_state *sctx, const u8 *input,
-   unsigned int length);
+unsigned int length);
 extern int sha256_final(struct sha256_state *sctx, u8 *hash);
 
 #endif /* SHA256_H */
diff --git a/arch/x86/purgatory/sha256.c b/lib/sha256.c
similarity index 99%
rename from arch/x86/purgatory/sha256.c
rename to lib/sha256.c
index 548ca675a14a..4400c832e2aa 100644
--- a/arch/x86/purgatory/sha256.c
+++ b/lib/sha256.c
@@ -16,9 +16,9 @@
  */
 
 #include 
+#include 
+#include 
 #include 
-#include "sha256.h"
-#include "../boot/string.h"
 
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
-- 
2.13.5


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 10/11] kexec_file: Allow archs to set purgatory load address

2018-03-21 Thread Philipp Rudo
For s390 new kernels are loaded to fixed addresses in memory before they
are booted. With the current code this is a problem as it assumes the
kernel will be loaded to an 'arbitrary' address. In particular,
kexec_locate_mem_hole searches for a large enough memory region and sets
the load address (kexec_bufer->mem) to it.

Luckily there is a simple workaround for this problem. By returning 1 in
arch_kexec_walk_mem, kexec_locate_mem_hole is turned off. This allows the
architecture to set kbuf->mem by hand. While the trick works fine for the
kernel it does not for the purgatory as here the architectures don't have
access to its kexec_buffer.

Give architectures access to the purgatories kexec_buffer by changing
kexec_load_purgatory to take a pointer to it. With this change
architectures have access to the buffer and can edit it as they need.

A nice side effect of this change is that we can get rid of the
purgatory_info->purgatory_load_address field. As now the information stored
there can directly be accessed from kbuf->mem.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidef...@de.ibm.com>
---
 arch/powerpc/kernel/kexec_elf_64.c |  9 +
 arch/x86/kernel/kexec-bzimage64.c  |  8 
 include/linux/kexec.h  | 17 ++---
 kernel/kexec_file.c| 29 -
 4 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
index 9a42309b091a..82448c03502d 100644
--- a/arch/powerpc/kernel/kexec_elf_64.c
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -572,7 +572,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 {
int ret;
unsigned int fdt_size;
-   unsigned long kernel_load_addr, purgatory_load_addr;
+   unsigned long kernel_load_addr;
unsigned long initrd_load_addr = 0, fdt_load_addr;
void *fdt;
const void *slave_code;
@@ -580,6 +580,8 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
struct elf_info elf_info;
struct kexec_buf kbuf = { .image = image, .buf_min = 0,
  .buf_max = ppc64_rma_size };
+   struct kexec_buf pbuf = { .image = image, .buf_min = 0,
+ .buf_max = ppc64_rma_size, .top_down = true };
 
ret = build_elf_exec_info(kernel_buf, kernel_len, , _info);
if (ret)
@@ -591,14 +593,13 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 
pr_debug("Loaded the kernel at 0x%lx\n", kernel_load_addr);
 
-   ret = kexec_load_purgatory(image, 0, ppc64_rma_size, true,
-  _load_addr);
+   ret = kexec_load_purgatory(image, );
if (ret) {
pr_err("Loading purgatory failed.\n");
goto out;
}
 
-   pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+   pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
if (initrd != NULL) {
kbuf.buffer = initrd;
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index fb095ba0c02f..df183585928f 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -334,7 +334,6 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
unsigned long setup_header_size, params_cmdline_sz;
struct boot_params *params;
unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
-   unsigned long purgatory_load_addr;
struct bzimage64_data *ldata;
struct kexec_entry64_regs regs64;
void *stack;
@@ -342,6 +341,8 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
  .top_down = true };
+   struct kexec_buf pbuf = { .image = image, .buf_min = MIN_PURGATORY_ADDR,
+ .buf_max = ULONG_MAX, .top_down = true };
 
header = (struct setup_header *)(kernel + setup_hdr_offset);
setup_sects = header->setup_sects;
@@ -379,14 +380,13 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
 * Load purgatory. For 64bit entry point, purgatory  code can be
 * anywhere.
 */
-   ret = kexec_load_purgatory(image, MIN_PURGATORY_ADDR, ULONG_MAX, 1,
-  _load_addr);
+   ret = kexec_load_purgatory(image, );
if (ret) {
pr_err("Loading purgatory failed\n");
return ERR_PTR(ret);
}
 
-   pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+   pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
 
/*
d

[PATCH v2 05/11] kexec_file: Use read-only sections in arch_kexec_apply_relocations*

2018-03-21 Thread Philipp Rudo
When the relocations are applied to the purgatory only the section the
relocations are applied to is writable. The other sections, i.e. the symtab
and .rel/.rela, are in read-only kexec_purgatory. Highlight this by marking
the corresponding variables as 'const'.

While at it also change the signatures of arch_kexec_apply_relocations* to
take section pointers instead of just the index of the relocation section.
This removes the second lookup and sanity check of the sections in arch
code.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
---
 arch/x86/kernel/machine_kexec_64.c | 56 -
 include/linux/kexec.h  | 13 +---
 kernel/kexec_file.c| 63 +-
 3 files changed, 71 insertions(+), 61 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 3b7427aa7d85..51667c8b5c9b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -422,52 +422,36 @@ int arch_kexec_kernel_verify_sig(struct kimage *image, 
void *kernel,
 /*
  * Apply purgatory relocations.
  *
- * ehdr: Pointer to elf headers
- * sechdrs: Pointer to section headers.
- * relsec: section index of SHT_RELA section.
+ * @pi:Purgatory to be relocated.
+ * @section:   Section relocations applying to.
+ * @relsec:Section containing RELAs.
+ * @symtabsec: Corresponding symtab.
  *
  * TODO: Some of the code belongs to generic code. Move that in kexec.c.
  */
-int arch_kexec_apply_relocations_add(const Elf64_Ehdr *ehdr,
-Elf64_Shdr *sechdrs, unsigned int relsec)
+int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
+Elf_Shdr *section, const Elf_Shdr *relsec,
+const Elf_Shdr *symtabsec)
 {
unsigned int i;
Elf64_Rela *rel;
Elf64_Sym *sym;
void *location;
-   Elf64_Shdr *section, *symtabsec;
unsigned long address, sec_base, value;
const char *strtab, *name, *shstrtab;
+   const Elf_Shdr *sechdrs;
 
-   /*
-* ->sh_offset has been modified to keep the pointer to section
-* contents in memory
-*/
-   rel = (void *)sechdrs[relsec].sh_offset;
-
-   /* Section to which relocations apply */
-   section = [sechdrs[relsec].sh_info];
-
-   pr_debug("Applying relocate section %u to %u\n", relsec,
-sechdrs[relsec].sh_info);
-
-   /* Associated symbol table */
-   symtabsec = [sechdrs[relsec].sh_link];
-
-   /* String table */
-   if (symtabsec->sh_link >= ehdr->e_shnum) {
-   /* Invalid strtab section number */
-   pr_err("Invalid string table section index %d\n",
-  symtabsec->sh_link);
-   return -ENOEXEC;
-   }
+   /* String & section header string table */
+   sechdrs = (void *)pi->ehdr + pi->ehdr->e_shoff;
+   strtab = (char *)pi->ehdr + sechdrs[symtabsec->sh_link].sh_offset;
+   shstrtab = (char *)pi->ehdr + sechdrs[pi->ehdr->e_shstrndx].sh_offset;
 
-   strtab = (char *)sechdrs[symtabsec->sh_link].sh_offset;
+   rel = (void *)pi->ehdr + relsec->sh_offset;
 
-   /* section header string table */
-   shstrtab = (char *)sechdrs[ehdr->e_shstrndx].sh_offset;
+   pr_debug("Applying relocate section %s to %u\n",
+shstrtab + relsec->sh_name, relsec->sh_info);
 
-   for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
+   for (i = 0; i < relsec->sh_size / sizeof(*rel); i++) {
 
/*
 * rel[i].r_offset contains byte offset from beginning
@@ -490,8 +474,8 @@ int arch_kexec_apply_relocations_add(const Elf64_Ehdr *ehdr,
 * to apply. ELF64_R_SYM() and ELF64_R_TYPE() macros get
 * these respectively.
 */
-   sym = (Elf64_Sym *)symtabsec->sh_offset +
-   ELF64_R_SYM(rel[i].r_info);
+   sym = (void *)pi->ehdr + symtabsec->sh_offset;
+   sym += ELF64_R_SYM(rel[i].r_info);
 
if (sym->st_name)
name = strtab + sym->st_name;
@@ -514,12 +498,12 @@ int arch_kexec_apply_relocations_add(const Elf64_Ehdr 
*ehdr,
 
if (sym->st_shndx == SHN_ABS)
sec_base = 0;
-   else if (sym->st_shndx >= ehdr->e_shnum) {
+   else if (sym->st_shndx >= pi->ehdr->e_shnum) {
pr_err("Invalid section %d for symbol %s\n",
   sym->st_shndx, name);
return -ENOEXEC;
} else
-   sec_base = sechdrs[sym->st_shndx].sh_addr;
+   

[PATCH v2 02/11] kexec_file: Remove checks in kexec_purgatory_load

2018-03-21 Thread Philipp Rudo
Before the purgatory is loaded several checks are done whether the ELF file
in kexec_purgatory is valid or not. These checks are incomplete. For
example they don't check for the total size of the sections defined in the
section header table or if the entry point actually points into the
purgatory.

On the other hand the purgatory, although an ELF file on its own, is part
of the kernel. Thus not trusting the purgatory means not trusting the
kernel build itself.

So remove all validity checks on the purgatory and just trust the kernel
build.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
---
 kernel/kexec_file.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index e5bcd94c1efb..0f044457b40c 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -880,22 +880,8 @@ int kexec_load_purgatory(struct kimage *image, unsigned 
long min,
if (kexec_purgatory_size <= 0)
return -EINVAL;
 
-   if (kexec_purgatory_size < sizeof(Elf_Ehdr))
-   return -ENOEXEC;
-
pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
 
-   if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
-   || pi->ehdr->e_type != ET_REL
-   || !elf_check_arch(pi->ehdr)
-   || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
-   return -ENOEXEC;
-
-   if (pi->ehdr->e_shoff >= kexec_purgatory_size
-   || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
-   kexec_purgatory_size - pi->ehdr->e_shoff))
-   return -ENOEXEC;
-
ret = __kexec_load_purgatory(image, min, max, top_down);
if (ret)
return ret;
-- 
2.13.5


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 08/11] kexec_file: Remove unneeded variables in kexec_purgatory_setup_sechdrs

2018-03-21 Thread Philipp Rudo
The main loop currently uses quite a lot of variables to update the section
headers. Some of them are unnecessary. So clean them up a little.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
---
 kernel/kexec_file.c | 34 --
 1 file changed, 12 insertions(+), 22 deletions(-)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 4938f7aacd09..746b91e46e34 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -724,12 +724,8 @@ static int kexec_purgatory_setup_kbuf(struct 
purgatory_info *pi,
 static int kexec_purgatory_setup_sechdrs(struct purgatory_info *pi,
 struct kexec_buf *kbuf)
 {
-   unsigned long curr_load_addr;
-   unsigned long load_addr;
unsigned long bss_addr;
unsigned long offset;
-   unsigned char *buf_addr;
-   unsigned char *src;
Elf_Shdr *sechdrs;
int i;
 
@@ -762,20 +758,18 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
sechdrs[i].sh_offset;
}
 
-   /* Load SHF_ALLOC sections */
-   buf_addr = kbuf->buffer;
-   load_addr = curr_load_addr = kbuf->mem;
-   bss_addr = load_addr + kbuf->bufsz;
+   offset = 0;
+   bss_addr = kbuf->mem + kbuf->bufsz;
kbuf->image->start = pi->ehdr->e_entry;
 
for (i = 0; i < pi->ehdr->e_shnum; i++) {
unsigned long align;
+   void *src, *dst;
 
if (!(sechdrs[i].sh_flags & SHF_ALLOC))
continue;
 
align = sechdrs[i].sh_addralign;
-
if (sechdrs[i].sh_type == SHT_NOBITS) {
bss_addr = ALIGN(bss_addr, align);
sechdrs[i].sh_addr = bss_addr;
@@ -783,31 +777,27 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
continue;
}
 
-   curr_load_addr = ALIGN(curr_load_addr, align);
-   offset = curr_load_addr - load_addr;
-   /* We already modifed ->sh_offset to keep src addr */
-   src = (char *)sechdrs[i].sh_offset;
-   memcpy(buf_addr + offset, src, sechdrs[i].sh_size);
-
+   offset = ALIGN(offset, align);
if (sechdrs[i].sh_flags & SHF_EXECINSTR &&
pi->ehdr->e_entry >= sechdrs[i].sh_addr &&
pi->ehdr->e_entry < (sechdrs[i].sh_addr
 + sechdrs[i].sh_size)) {
kbuf->image->start -= sechdrs[i].sh_addr;
-   kbuf->image->start += curr_load_addr;
+   kbuf->image->start += kbuf->mem + offset;
}
 
-   /* Store load address and source address of section */
-   sechdrs[i].sh_addr = curr_load_addr;
+   src = (void *)sechdrs[i].sh_offset;
+   dst = pi->purgatory_buf + offset;
+   memcpy(dst, src, sechdrs[i].sh_size);
+
+   sechdrs[i].sh_addr = kbuf->mem + offset;
 
/*
 * This section got copied to temporary buffer. Update
 * ->sh_offset accordingly.
 */
-   sechdrs[i].sh_offset = (unsigned long)(buf_addr + offset);
-
-   /* Advance to the next address */
-   curr_load_addr += sechdrs[i].sh_size;
+   sechdrs[i].sh_offset = (unsigned long)dst;
+   offset += sechdrs[i].sh_size;
}
 
return 0;
-- 
2.13.5


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 09/11] kexec_file: Remove mis-use of sh_offset field during purgatory load

2018-03-21 Thread Philipp Rudo
The current code uses the sh_offset field in purgatory_info->sechdrs to
store a pointer to the current load address of the section. Depending
whether the section will be loaded or not this is either a pointer into
purgatory_info->purgatory_buf or kexec_purgatory. This is not only a
violation of the ELF standard but also makes the code very hard to
understand as you cannot tell if the memory you are using is read-only or
not.

Remove this mis-use and store the offset of the section in
pugaroty_info->purgatory_buf in sh_offset.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
---
 arch/x86/kernel/machine_kexec_64.c | 10 ++
 kernel/kexec_file.c| 37 +++--
 2 files changed, 13 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 51667c8b5c9b..41db74bdc88b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -457,13 +457,15 @@ int arch_kexec_apply_relocations_add(struct 
purgatory_info *pi,
 * rel[i].r_offset contains byte offset from beginning
 * of section to the storage unit affected.
 *
-* This is location to update (->sh_offset). This is temporary
-* buffer where section is currently loaded. This will finally
-* be loaded to a different address later, pointed to by
+* This is location to update. This is temporary buffer
+* where section is currently loaded. This will finally be
+* loaded to a different address later, pointed to by
 * ->sh_addr. kexec takes care of moving it
 *  (kexec_load_segment()).
 */
-   location = (void *)(section->sh_offset + rel[i].r_offset);
+   location = pi->purgatory_buf;
+   location += section->sh_offset;
+   location += rel[i].r_offset;
 
/* Final address of the location */
address = section->sh_addr + rel[i].r_offset;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 746b91e46e34..1478766c7754 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -729,6 +729,10 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
Elf_Shdr *sechdrs;
int i;
 
+   /*
+* The section headers in kexec_purgatory are read-only. In order to
+* have them modifiable make a temporary copy.
+*/
sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
if (!sechdrs)
return -ENOMEM;
@@ -736,28 +740,6 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
   pi->ehdr->e_shnum * sizeof(Elf_Shdr));
pi->sechdrs = sechdrs;
 
-   /*
-* We seem to have multiple copies of sections. First copy is which
-* is embedded in kernel in read only section. Some of these sections
-* will be copied to a temporary buffer and relocated. And these
-* sections will finally be copied to their final destination at
-* segment load time.
-*
-* Use ->sh_offset to reflect section address in memory. It will
-* point to original read only copy if section is not allocatable.
-* Otherwise it will point to temporary copy which will be relocated.
-*
-* Use ->sh_addr to contain final address of the section where it
-* will go during execution time.
-*/
-   for (i = 0; i < pi->ehdr->e_shnum; i++) {
-   if (sechdrs[i].sh_type == SHT_NOBITS)
-   continue;
-
-   sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
-   sechdrs[i].sh_offset;
-   }
-
offset = 0;
bss_addr = kbuf->mem + kbuf->bufsz;
kbuf->image->start = pi->ehdr->e_entry;
@@ -786,17 +768,12 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
kbuf->image->start += kbuf->mem + offset;
}
 
-   src = (void *)sechdrs[i].sh_offset;
+   src = (void *)pi->ehdr + sechdrs[i].sh_offset;
dst = pi->purgatory_buf + offset;
memcpy(dst, src, sechdrs[i].sh_size);
 
sechdrs[i].sh_addr = kbuf->mem + offset;
-
-   /*
-* This section got copied to temporary buffer. Update
-* ->sh_offset accordingly.
-*/
-   sechdrs[i].sh_offset = (unsigned long)dst;
+   sechdrs[i].sh_offset = offset;
offset += sechdrs[i].sh_size;
}
 
@@ -1006,7 +983,7 @@ int kexec_purgatory_get_set_symbol(struct kimage *image, 
const char *name,
return -EINVAL;
}
 
- 

[PATCH v2 06/11] kexec_file: Split up __kexec_load_puragory

2018-03-21 Thread Philipp Rudo
When inspecting __kexec_load_purgatory you find that it has two tasks

1) setting up the kexec_buffer for the new kernel and,
2) setting up pi->sechdrs for the final load address.

The two tasks are independent of each other. To improve readability split
up __kexec_load_purgatory into two functions, one for each task, and call
them directly from kexec_load_purgatory.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
---
 kernel/kexec_file.c | 200 +++-
 1 file changed, 103 insertions(+), 97 deletions(-)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 80c7f658afc0..d1c3ec8dc6b1 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -649,39 +649,97 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
return ret;
 }
 
-/* Actually load purgatory. Lot of code taken from kexec-tools */
-static int __kexec_load_purgatory(struct kimage *image, unsigned long min,
- unsigned long max, int top_down)
+/*
+ * kexec_purgatory_setup_kbuf - prepare buffer to load purgatory.
+ * @pi:Purgatory to be loaded.
+ * @kbuf:  Buffer to setup.
+ *
+ * Allocates the memory needed for the buffer. Caller is responsible to free
+ * the memory after use.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+static int kexec_purgatory_setup_kbuf(struct purgatory_info *pi,
+ struct kexec_buf *kbuf)
 {
-   struct purgatory_info *pi = >purgatory_info;
-   unsigned long align, bss_align, bss_sz, bss_pad;
-   unsigned long entry, load_addr, curr_load_addr, bss_addr, offset;
-   unsigned char *buf_addr, *src;
-   int i, ret = 0, entry_sidx = -1;
-   const Elf_Shdr *sechdrs_c;
-   Elf_Shdr *sechdrs = NULL;
-   struct kexec_buf kbuf = { .image = image, .bufsz = 0, .buf_align = 1,
- .buf_min = min, .buf_max = max,
- .top_down = top_down };
+   const Elf_Shdr *sechdrs;
+   unsigned long bss_align;
+   unsigned long bss_sz;
+   unsigned long align;
+   int i, ret;
 
-   /*
-* sechdrs_c points to section headers in purgatory and are read
-* only. No modifications allowed.
-*/
-   sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
+   sechdrs = (void *)pi->ehdr + pi->ehdr->e_shoff;
+   bss_align = 1;
+   bss_sz = 0;
+
+   for (i = 0; i < pi->ehdr->e_shnum; i++) {
+   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+   continue;
+
+   align = sechdrs[i].sh_addralign;
+   if (sechdrs[i].sh_type != SHT_NOBITS) {
+   if (kbuf->buf_align < align)
+   kbuf->buf_align = align;
+   kbuf->bufsz = ALIGN(kbuf->bufsz, align);
+   kbuf->bufsz += sechdrs[i].sh_size;
+   } else {
+   if (bss_align < align)
+   bss_align = align;
+   bss_sz = ALIGN(bss_sz, align);
+   bss_sz += sechdrs[i].sh_size;
+   }
+   }
+   kbuf->bufsz = ALIGN(kbuf->bufsz, bss_align);
+   kbuf->memsz = kbuf->bufsz + bss_sz;
+   if (kbuf->buf_align < bss_align)
+   kbuf->buf_align = bss_align;
+
+   kbuf->buffer = vzalloc(kbuf->bufsz);
+   if (!kbuf->buffer)
+   return -ENOMEM;
+   pi->purgatory_buf = kbuf->buffer;
+
+   ret = kexec_add_buffer(kbuf);
+   if (ret)
+   goto out;
+   pi->purgatory_load_addr = kbuf->mem;
+
+   return 0;
+out:
+   vfree(pi->purgatory_buf);
+   pi->purgatory_buf = NULL;
+   return ret;
+}
+
+/*
+ * kexec_purgatory_setup_sechdrs - prepares the pi->sechdrs buffer.
+ * @pi:Purgatory to be loaded.
+ * @kbuf:  Buffer prepared to store purgatory.
+ *
+ * Allocates the memory needed for the buffer. Caller is responsible to free
+ * the memory after use.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+static int kexec_purgatory_setup_sechdrs(struct purgatory_info *pi,
+struct kexec_buf *kbuf)
+{
+   unsigned long curr_load_addr;
+   unsigned long load_addr;
+   unsigned long bss_addr;
+   unsigned long offset;
+   unsigned char *buf_addr;
+   unsigned char *src;
+   Elf_Shdr *sechdrs;
+   int entry_sidx = -1;
+   int i;
 
-   /*
-* We can not modify sechdrs_c[] and its fields. It is read only.
-* Copy it over to a local copy where one can store some temporary
-* data and free it at the end. We need to modify ->sh_addr and
-* ->sh_offset fields to keep track of permanent and temporary
-*

[PATCH v2 00/11] kexec_file: Clean up purgatory load

2018-03-21 Thread Philipp Rudo
Hi everybody

here is the updated patch set including Dave's comments.

Changes v1 -> v2:
* Fix missing memcpy and memset definition in x86 purgatory
introduced by moving the sha256 implementation to common code.
* Update subject of patches #7-9 to be more precise.
* Add missing semicolon in patch #7 found by kbuild bot.
* Fix compile warnings found by kbuild bot by moving all kexec_file
specific function declarations into a 'ifdef CONFIG_KEXEC_FILE'
block (patch #5 and #10).

Thanks
Philipp

---

Hi everybody

following the discussion with Dave and AKASHI, here are the common code
patches extracted from my recent patch set (Add kexec_file_load support to
s390) [1]. The patches were extracted to allow upstream integration together
with AKASHI's common code patches before the arch code gets adjusted to the
new base.

The reason for this series is to prepare common code for adding
kexec_file_load to s390 as well as cleaning up the mis-use of the sh_offset
field during purgatory load. In detail this series contains:

Patch #1&2: Minor cleanups/fixes.

Patch #3-9: Clean up the purgatory load/relocation code. Especially remove
the mis-use of the purgatory_info->sechdrs->sh_offset field, currently
holding a pointer into either kexec_purgatory (ro) or purgatory_buf (rw)
depending on the section. With these patches the section address will be
calculated verbosely and sh_offset will contain the offset of the section
in the stripped purgatory binary (purgatory_buf).

Patch #10: Allows architectures to set the purgatory load address. This
patch is important for s390 as the kernel and purgatory have to be loaded
to fixed addresses. In current code this is impossible as the purgatory
load is opaque to the architecture.

Patch #11: Moves x86 purgatories sha implementation to common lib/
directory to allow reuse in other architectures.

The patches apply to v4.16-rc3. There are no changes compared to [1] (all
requested changes only affected s390 code). Please note that I had to touch
arch code for x86 and power a little. In theory this should not change the
behavior but I don't have a way to test it. Cross-compiling with
defconfig [2] works fine for both.

Thanks
Philipp

[1] http://lists.infradead.org/pipermail/kexec/2018-February/019926.html
[2] On x86 with the orc unwinder and stack validation turned off. objtool
SEGFAULTs on s390...

Philipp Rudo (11):
  kexec_file: Silence compile warnings
  kexec_file: Remove checks in kexec_purgatory_load
  kexec_file: Make purgatory_info->ehdr const
  kexec_file: Search symbols in read-only kexec_purgatory
  kexec_file: Use read-only sections in arch_kexec_apply_relocations*
  kexec_file: Split up __kexec_load_puragory
  kexec_file: Remove unneeded for-loop in kexec_purgatory_setup_sechdrs
  kexec_file: Remove unneeded variables in kexec_purgatory_setup_sechdrs
  kexec_file: Remove mis-use of sh_offset field during purgatory load
  kexec_file: Allow archs to set purgatory load address
  kexec_file: Move purgatories sha256 to common code

 arch/powerpc/kernel/kexec_elf_64.c |   9 +-
 arch/x86/kernel/kexec-bzimage64.c  |   8 +-
 arch/x86/kernel/machine_kexec_64.c |  66 ++---
 arch/x86/purgatory/Makefile|   3 +
 arch/x86/purgatory/purgatory.c |   2 +-
 arch/x86/purgatory/string.c|  12 +
 include/linux/kexec.h  |  49 ++--
 {arch/x86/purgatory => include/linux}/sha256.h |  11 +-
 kernel/kexec_file.c| 379 -
 {arch/x86/purgatory => lib}/sha256.c   |   4 +-
 10 files changed, 267 insertions(+), 276 deletions(-)
 rename {arch/x86/purgatory => include/linux}/sha256.h (62%)
 rename {arch/x86/purgatory => lib}/sha256.c (99%)

-- 
2.13.5


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2 07/11] kexec_file: Remove unneeded for-loop in kexec_purgatory_setup_sechdrs

2018-03-21 Thread Philipp Rudo
To update the entry point there is an extra loop over all section headers
although this can be done in the main loop. So move it there and eliminate
the extra loop and variable to store the 'entry section index'.

Also, in the main loop, move the usual case, i.e. non-bss section, out of
the extra if-block.

Signed-off-by: Philipp Rudo <pr...@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidef...@de.ibm.com>
---
 kernel/kexec_file.c | 76 +
 1 file changed, 30 insertions(+), 46 deletions(-)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index d1c3ec8dc6b1..4938f7aacd09 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -731,7 +731,6 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
unsigned char *buf_addr;
unsigned char *src;
Elf_Shdr *sechdrs;
-   int entry_sidx = -1;
int i;
 
sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
@@ -763,32 +762,11 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
sechdrs[i].sh_offset;
}
 
-   /*
-* Identify entry point section and make entry relative to section
-* start.
-*/
-   kbuf->image->start = pi->ehdr->e_entry;
-   for (i = 0; i < pi->ehdr->e_shnum; i++) {
-   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
-   continue;
-
-   if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
-   continue;
-
-   /* Make entry section relative */
-   if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
-   ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
-pi->ehdr->e_entry)) {
-   entry_sidx = i;
-   kbuf->image->start -= sechdrs[i].sh_addr;
-   break;
-   }
-   }
-
/* Load SHF_ALLOC sections */
buf_addr = kbuf->buffer;
load_addr = curr_load_addr = kbuf->mem;
bss_addr = load_addr + kbuf->bufsz;
+   kbuf->image->start = pi->ehdr->e_entry;
 
for (i = 0; i < pi->ehdr->e_shnum; i++) {
unsigned long align;
@@ -797,34 +775,40 @@ static int kexec_purgatory_setup_sechdrs(struct 
purgatory_info *pi,
continue;
 
align = sechdrs[i].sh_addralign;
-   if (sechdrs[i].sh_type != SHT_NOBITS) {
-   curr_load_addr = ALIGN(curr_load_addr, align);
-   offset = curr_load_addr - load_addr;
-   /* We already modifed ->sh_offset to keep src addr */
-   src = (char *) sechdrs[i].sh_offset;
-   memcpy(buf_addr + offset, src, sechdrs[i].sh_size);
-
-   /* Store load address and source address of section */
-   sechdrs[i].sh_addr = curr_load_addr;
-
-   /*
-* This section got copied to temporary buffer. Update
-* ->sh_offset accordingly.
-*/
-   sechdrs[i].sh_offset = (unsigned long)(buf_addr + 
offset);
-
-   /* Advance to the next address */
-   curr_load_addr += sechdrs[i].sh_size;
-   } else {
+
+   if (sechdrs[i].sh_type == SHT_NOBITS) {
bss_addr = ALIGN(bss_addr, align);
sechdrs[i].sh_addr = bss_addr;
bss_addr += sechdrs[i].sh_size;
+   continue;
+   }
+
+   curr_load_addr = ALIGN(curr_load_addr, align);
+   offset = curr_load_addr - load_addr;
+   /* We already modifed ->sh_offset to keep src addr */
+   src = (char *)sechdrs[i].sh_offset;
+   memcpy(buf_addr + offset, src, sechdrs[i].sh_size);
+
+   if (sechdrs[i].sh_flags & SHF_EXECINSTR &&
+   pi->ehdr->e_entry >= sechdrs[i].sh_addr &&
+   pi->ehdr->e_entry < (sechdrs[i].sh_addr
++ sechdrs[i].sh_size)) {
+   kbuf->image->start -= sechdrs[i].sh_addr;
+   kbuf->image->start += curr_load_addr;
}
-   }
 
-   /* Update entry point based on load address of text section */
-   if (entry_sidx >= 0)
-   kbuf->image->start += sechdrs[entry_sidx].sh_addr;
+   /* Store load address and source address of section */
+   sechdrs[i].sh_addr = curr_load_addr;
+
+   /*
+* This section got copied to temporary buffer. Update
+* ->sh_offset accordingl

  1   2   >