Re: kexec reports "Cannot get kernel _text symbol address" on arm64 platform

2023-08-11 Thread b...@redhat.com
On 08/12/23 at 07:11am, Baoquan He wrote:
> On 08/11/23 at 01:27pm, Pandey, Radhey Shyam wrote:
> > > -Original Message-
> > > From: b...@redhat.com 
> > > Sent: Wednesday, August 9, 2023 7:42 AM
> > > To: Pandey, Radhey Shyam ;
> > > pi...@redhat.com
> > > Cc: kexec@lists.infradead.org; linux-ker...@vger.kernel.org
> > > Subject: Re: kexec reports "Cannot get kernel _text symbol address" on
> > > arm64 platform
> > > 
> > > On 08/08/23 at 07:17pm, Pandey, Radhey Shyam wrote:
> > > > Hi,
> > > >
> > > > I am trying to bring up kdump on arm64 platform[1]. But I get "Cannot 
> > > > get
> > > kernel _text symbol address".
> > > >
> > > > Is there some Dump-capture kernel config options that I am missing?
> > > >
> > > > FYI, copied below complete kexec debug log.
> > > >
> > > > [1]: https://www.xilinx.com/products/boards-and-kits/vck190.html
> > > 
> > > Your description isn't clear. You saw the printing, then your kdump kernel
> > > loading succeeded or not?
> > > 
> > > If no, have you tried applying Pingfan's patchset and still saw the issue?
> > > 
> > > [PATCHv7 0/5] arm64: zboot support
> > > https://lore.kernel.org/all/20230803024152.11663-1-pi...@redhat.com/T/#u
> > 
> > I was able to proceed further with loading with crash kernel on triggering 
> > system crash.
> > echo c > /proc/sysrq-trigger
> > 
> > But when I copy /proc/vmcore it throws memory abort. Also I see size of 
> > /proc/vmcore really huge (18446603353488633856).
> 
> This is a better symptom description.
> 
> It's very similar with a solved issue even though the calltrace is not
> completely same, can you try below patch to see if it fix your problem?

Oops, I was wrong. Below patch is irrelevant because it's a kcore issue,
you met a vmcore issue, please ignore this. We need investigate to see
what is happening.

> 
> [PATCH] fs/proc/kcore: reinstate bounce buffer for KCORE_TEXT regions
> https://lore.kernel.org/all/20230731215021.70911-1-lstoa...@gmail.com/T/#u
> 
> > Any possible guess on what could be wrong?
> > 
> > 
> > [   80.733523] Starting crashdump kernel...
> > [   80.737435] Bye!
> > [0.00] Booting Linux on physical CPU 0x01 [0x410fd083]
> > [0.00] Linux version 6.5.0-rc4-ge28001fb4e07 (radheys@xhdradheys41) 
> > (aarch64-xilinx-linux-gcc.real (GCC) 12.2.0, GNU ld (GNU Binutils) 
> > 2.39.0.20220819) #23 SMP Fri Aug 11 16:25:34 IST 2023
> > 
> > 
> > 
> > 
> > xilinx-vck190-20232:/run/media/mmcblk0p1# cat /proc/meminfo | head
> > MemTotal:2092876 kB
> > MemFree: 1219928 kB
> > MemAvailable:1166004 kB
> > Buffers:  32 kB
> > Cached:   756952 kB
> > SwapCached:0 kB
> > Active: 1480 kB
> > Inactive:  24164 kB
> > Active(anon):   1452 kB
> > Inactive(anon):24160 kB
> > xilinx-vck190-20232:/run/media/mmcblk0p1# cp /proc/vmcore dump 
> > [  975.284865] Unable to handle kernel level 3 address size fault at 
> > virtual address 80008d7cf000
> > [  975.293871] Mem abort info:
> > [  975.296669]   ESR = 0x9603
> > [  975.300425]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [  975.305738]   SET = 0, FnV = 0
> > [  975.308788]   EA = 0, S1PTW = 0
> > [  975.311925]   FSC = 0x03: level 3 address size fault
> > [  975.316888] Data abort info:
> > [  975.319763]   ISV = 0, ISS = 0x0003, ISS2 = 0x
> > [  975.325245]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [  975.330292]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [  975.335599] swapper pgtable: 4k pages, 48-bit VAs, pgdp=05016ef6b000
> > [  975.342297] [80008d7cf000] pgd=1501eddfe003, 
> > p4d=1501eddfe003, pud=1501eddfd003, pmd=15017b695003, 
> > pte=00687fff84000703
> > [  975.354827] Internal error: Oops: 9603 [#4] SMP
> > [  975.360392] Modules linked in:
> > 3  975.
> > 63440] CBPrUo:a d0c aPID: 664 Comm: cp Tainted: G  D
> > 6.5.0-rc4-ge28001fb4e07 #23
> > [  975.372822] Hardware name: Xilinx Versal vck190 Eval board revA (DT)
> > [  975.379165] pstate: a005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS 
> > BTYPE=--)
> > [  975.386119] pc : __memcpy+0x110/0x230
> > [  975.389783] lr : _copy_to_iter+0x3d8/0x4d0
> > [  975.393874] sp : 80008dc939a0
> > [  975.397178] x29: 80008dc939a0 x28: 05013c1bea30 x27: 
> > 1000
> > [  975.404309] x26: 1000 x25: 1000 x24: 
> > 80008d7cf000
> > [  975.411440] x23: 0400 x22: 80008dc93ba0 x21: 
> > 1000
> > [  975.418570] x20:  x19: 1000 x18: 
> > 
> > [  975.425699] x17:  x16:  x15: 
> > 0140
> > [  975.432829] x14: 8500a9919000 x13: 0041 x12: 
> > fffef6831000
> > [  975.439958] x11: 80008d9cf000 x10:  x9 : 
> > 
> > [  975.447088] x8 : 80008d7d x7 : 0501addfd358 x6 : 
> > 0401
> > [  975.454217] x5 

Re: kexec reports "Cannot get kernel _text symbol address" on arm64 platform

2023-08-11 Thread b...@redhat.com
On 08/11/23 at 01:27pm, Pandey, Radhey Shyam wrote:
> > -Original Message-
> > From: b...@redhat.com 
> > Sent: Wednesday, August 9, 2023 7:42 AM
> > To: Pandey, Radhey Shyam ;
> > pi...@redhat.com
> > Cc: kexec@lists.infradead.org; linux-ker...@vger.kernel.org
> > Subject: Re: kexec reports "Cannot get kernel _text symbol address" on
> > arm64 platform
> > 
> > On 08/08/23 at 07:17pm, Pandey, Radhey Shyam wrote:
> > > Hi,
> > >
> > > I am trying to bring up kdump on arm64 platform[1]. But I get "Cannot get
> > kernel _text symbol address".
> > >
> > > Is there some Dump-capture kernel config options that I am missing?
> > >
> > > FYI, copied below complete kexec debug log.
> > >
> > > [1]: https://www.xilinx.com/products/boards-and-kits/vck190.html
> > 
> > Your description isn't clear. You saw the printing, then your kdump kernel
> > loading succeeded or not?
> > 
> > If no, have you tried applying Pingfan's patchset and still saw the issue?
> > 
> > [PATCHv7 0/5] arm64: zboot support
> > https://lore.kernel.org/all/20230803024152.11663-1-pi...@redhat.com/T/#u
> 
> I was able to proceed further with loading with crash kernel on triggering 
> system crash.
> echo c > /proc/sysrq-trigger
> 
> But when I copy /proc/vmcore it throws memory abort. Also I see size of 
> /proc/vmcore really huge (18446603353488633856).

This is a better symptom description.

It's very similar with a solved issue even though the calltrace is not
completely same, can you try below patch to see if it fix your problem?

[PATCH] fs/proc/kcore: reinstate bounce buffer for KCORE_TEXT regions
https://lore.kernel.org/all/20230731215021.70911-1-lstoa...@gmail.com/T/#u

> Any possible guess on what could be wrong?
> 
> 
> [   80.733523] Starting crashdump kernel...
> [   80.737435] Bye!
> [0.00] Booting Linux on physical CPU 0x01 [0x410fd083]
> [0.00] Linux version 6.5.0-rc4-ge28001fb4e07 (radheys@xhdradheys41) 
> (aarch64-xilinx-linux-gcc.real (GCC) 12.2.0, GNU ld (GNU Binutils) 
> 2.39.0.20220819) #23 SMP Fri Aug 11 16:25:34 IST 2023
> 
> 
> 
> 
> xilinx-vck190-20232:/run/media/mmcblk0p1# cat /proc/meminfo | head
> MemTotal:2092876 kB
> MemFree: 1219928 kB
> MemAvailable:1166004 kB
> Buffers:  32 kB
> Cached:   756952 kB
> SwapCached:0 kB
> Active: 1480 kB
> Inactive:  24164 kB
> Active(anon):   1452 kB
> Inactive(anon):24160 kB
> xilinx-vck190-20232:/run/media/mmcblk0p1# cp /proc/vmcore dump 
> [  975.284865] Unable to handle kernel level 3 address size fault at virtual 
> address 80008d7cf000
> [  975.293871] Mem abort info:
> [  975.296669]   ESR = 0x9603
> [  975.300425]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  975.305738]   SET = 0, FnV = 0
> [  975.308788]   EA = 0, S1PTW = 0
> [  975.311925]   FSC = 0x03: level 3 address size fault
> [  975.316888] Data abort info:
> [  975.319763]   ISV = 0, ISS = 0x0003, ISS2 = 0x
> [  975.325245]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [  975.330292]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [  975.335599] swapper pgtable: 4k pages, 48-bit VAs, pgdp=05016ef6b000
> [  975.342297] [80008d7cf000] pgd=1501eddfe003, p4d=1501eddfe003, 
> pud=1501eddfd003, pmd=15017b695003, pte=00687fff84000703
> [  975.354827] Internal error: Oops: 9603 [#4] SMP
> [  975.360392] Modules linked in:
> 3  975.
> 63440] CBPrUo:a d0c aPID: 664 Comm: cp Tainted: G  D
> 6.5.0-rc4-ge28001fb4e07 #23
> [  975.372822] Hardware name: Xilinx Versal vck190 Eval board revA (DT)
> [  975.379165] pstate: a005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  975.386119] pc : __memcpy+0x110/0x230
> [  975.389783] lr : _copy_to_iter+0x3d8/0x4d0
> [  975.393874] sp : 80008dc939a0
> [  975.397178] x29: 80008dc939a0 x28: 05013c1bea30 x27: 
> 1000
> [  975.404309] x26: 1000 x25: 1000 x24: 
> 80008d7cf000
> [  975.411440] x23: 0400 x22: 80008dc93ba0 x21: 
> 1000
> [  975.418570] x20:  x19: 1000 x18: 
> 
> [  975.425699] x17:  x16:  x15: 
> 0140
> [  975.432829] x14: 8500a9919000 x13: 0041 x12: 
> fffef6831000
> [  975.439958] x11: 80008d9cf000 x10:  x9 : 
> 
> [  975.447088] x8 : 80008d7d x7 : 0501addfd358 x6 : 
> 0401
> [  975.454217] x5 : 0501370e9000 x4 : 80008d7d x3 : 
> 
> [  975.461346] x2 : 1000 x1 : 80008d7cf000 x0 : 
> 0501370e8000
> [  975.468476] Call trace:
> [  975.470912]  __memcpy+0x110/0x230
> [  975.474221]  copy_oldmem_page+0x70/0xac
> [  975.478050]  read_from_oldmem.part.0+0x120/0x188
> [  975.482663]  read_vmcore+0x14c/0x238
> [  975.486231]  proc_reg_read_iter+0x84/0xd8
> [  975.490233]  

Re: [RFC] IMA Log Snapshotting Design Proposal

2023-08-11 Thread Stefan Berger



On 8/11/23 11:57, Tushar Sugandhi wrote:




[1] 
https://patchwork.kernel.org/project/linux-integrity/cover/20230801181917.8535-1-tusha...@linux.microsoft.com/


The shards should will need to be written into some sort of standard location 
or a config file needs to
be defined, so that everyone knows where to find them and how they are named.


We thought about well known standard location earlier.
Letting the Kernel choose the name/location of the snapshot
file comes with its own complexity. Our initial stance is we don’t
want to handle that at Kernel level, and let the UM client choose
the location/naming of the snapshot files. But we are happy to
reconsider if the community requests it.


I would also let user space do the snapshotting but all applications
relying on shards should know where they are located on the system
and what the naming scheme is so they can be  process in proper order.
evmctl for example would have to know where the shards are if keylime
agent had taken snapshots.




Yes. If the “PCR quotes in the snapshot_aggregate event in IMA log”


PCR quote or 'quotes'? Why multiple?

Form your proposal but you may have changed your opinion  following what I see 
in other messages:
"- The Kernel will get the current TPM PCR values and PCR update counter [2]
    and store them as template data in a new IMA event "snapshot_aggregate"."

Afaik TPM quote's don't give you the state of the individual PCR values, 
therefore
I would expect to at least find the 'PCR values' of all the PCRs that IMA 
touched to
be in the snapshot_aggregate so I can replay all the following events on top of 
these
PCR values and come up with the values that were used in the "final PCR quote". 
This
is unless you expect the server to take an automatic snapshot of the values of 
the
PCRs  that it computed while evaluating the log in case it ever needs to go 
back.


I meant a single set of PCR values captured when snapshot_aggregate
is logged. Sorry for the confusion.


Ok.




+ "replay of rest of the events in IMA log" results in the “final PCR quotes”
that matches with the “AK signed PCR quotes” sent by the client, then the 
truncated
IMA log can be trusted. The verifier can either ‘trust’ the “PCR quotes in the
snapshot_aggregate event in IMA log” or it can ask for the (n-1)th snapshot 
shard
to check the past events.


For anything regarding determining the 'trustworthiness of a system' one would 
have to
be able to go back to the very beginning of the log *or* remember in what state 
a
system was when the latest snapshot was taken so that if a restart happens it 
can resume
with that assumption about state of trustworthiness and know what the values of 
the PCRs
were at that time so it can resume replaying the log (or the server would get 
these
values from the log).


Correct. We intend to support the above. I hope our proposal
description captures it. BTW, when you say ‘restart’, you mean the UM
process restart, right? Because in case of a Kernel restart


Yes, client restart not reboot.


(i.e. cold-boot) the past IMA log (and the TPM state) is lost,
and old snapshots (if any) are useless.


Right. Some script should run on boot and delete all contents of the directory 
where the log
shards are.




The AK quotes by the kernel (which adds a 2nd AK key) that James is proposing
could be useful if the entire log, consisting of multiple shards, is very large 
and
cannot be transferred from the client to the server in one go so that the 
server could
evaluate the 'final PCR quote' immediately . However, if a client can indicated 
'I will
send more the next time and I have this much more to transfer' and the server 
allows
this multiple times (until all the 1MB shards of the 20MB log are transferred) 
then that
kernel AK key would not be necessary since presumably the "final PCR quote", 
created
by a user space client, would resolve whether the entire log is trustworthy.


See my responses to James today [2]

[2] 
https://lore.kernel.org/all/72e39852-1ff1-c7f6-ac7e-593e8142d...@linux.microsoft.com/


I think James was proposing one AK, possibly persisted in the TPM's NVRAM. 
Still, the less keys
that are involved in this the better...

   Stefan


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v27 2/8] crash: add generic infrastructure for crash hotplug support

2023-08-11 Thread Eric DeVolder
To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
onlining). The crash elfcorehdr describes the CPUs and memory to
be written into the vmcore.

To track CPU changes, callbacks are registered with the cpuhp
mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
crash hotplug elfcorehdr update has no explicit ordering requirement
(relative to other cpuhp states), so meets the criteria for
utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
state and avoids the need to introduce a new state for crash
hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE
group, just prior to the STARTING group, which is very close to the
CPU starting up in a plug/online situation, or stopping in a unplug/
offline situation. This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the
elfcorehdr would be inaccurate. Note that for a CPU being unplugged
or offlined, the CPU will still be present in the list of CPUs
generated by crash_prepare_elf64_headers(). However, there is no
need to explicitly omit the CPU, see justification in
'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'.

To track memory changes, a notifier is registered to capture the
memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().

The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory. During the
process, the kexec_lock is held.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 include/linux/crash_core.h |   9 +++
 include/linux/kexec.h  |  11 +++
 kernel/Kconfig.kexec   |  31 
 kernel/crash_core.c| 142 +
 kernel/kexec_core.c|   6 ++
 5 files changed, 199 insertions(+)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index de62a722431e..e14345cc7a22 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long 
system_ram,
 int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
 
+#define KEXEC_CRASH_HP_NONE0
+#define KEXEC_CRASH_HP_ADD_CPU 1
+#define KEXEC_CRASH_HP_REMOVE_CPU  2
+#define KEXEC_CRASH_HP_ADD_MEMORY  3
+#define KEXEC_CRASH_HP_REMOVE_MEMORY   4
+#define KEXEC_CRASH_HP_INVALID_CPU -1U
+
+struct kimage;
+
 #endif /* LINUX_CRASH_CORE_H */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 811a90e09698..b9903dd48e24 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Verify architecture specific macros are defined */
@@ -360,6 +361,12 @@ struct kimage {
struct purgatory_info purgatory_info;
 #endif
 
+#ifdef CONFIG_CRASH_HOTPLUG
+   int hp_action;
+   int elfcorehdr_index;
+   bool elfcorehdr_updated;
+#endif
+
 #ifdef CONFIG_IMA_KEXEC
/* Virtual address of IMA measurement buffer for kexec syscall */
void *ima_buffer;
@@ -490,6 +497,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, 
unsigned int pages, g
 static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) 
{ }
 #endif
 
+#ifndef arch_crash_handle_hotplug_event
+static inline void arch_crash_handle_hotplug_event(struct kimage *image) { }
+#endif
+
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index ff72e45cfaef..d0a9a5392035 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -113,4 +113,35 @@ config CRASH_DUMP
  For s390, this option also enables zfcpdump.
  See also 
 
+config CRASH_HOTPLUG
+   bool "Update the crash elfcorehdr on system configuration changes"
+   default y
+   depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+   depends on ARCH_SUPPORTS_CRASH_HOTPLUG
+   help
+ Enable direct update to the crash elfcorehdr (which contains
+ the list of CPUs and memory regions to be dumped upon a crash)
+ in response to hot plug/unplug or online/offline of CPUs or
+ memory. This is a much more advanced approach than userspace
+ attempting that.
+
+ If unsure, say Y.
+
+config CRASH_MAX_MEMORY_RANGES
+   int "Specify the maximum number of memory regions for the elfcorehdr"
+   default 8192
+   depends on CRASH_HOTPLUG
+   help
+ For the kexec_file_load() 

[PATCH v27 0/8] crash: Kernel handling of CPU and memory hot un/plug

2023-08-11 Thread Eric DeVolder
This series is dependent upon "refactor Kconfig to consolidate
KEXEC and CRASH options".
 https://lore.kernel.org/lkml/20230712161545.87870-1-eric.devol...@oracle.com/

Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also
be updated.

The elfcorehdr describes to kdump the CPUs and memory in the system,
and any inaccuracies can result in a vmcore with missing CPU context
or memory regions.

The current solution utilizes udev to initiate an unload-then-reload
of the kdump image (eg. kernel, initrd, boot_params, purgatory and
elfcorehdr) by the userspace kexec utility. In the original post I
outlined the significant performance problems related to offloading
this activity to userspace.

This patchset introduces a generic crash handler that registers with
the CPU and memory notifiers. Upon CPU or memory changes, from either
hot un/plug or off/onlining, this generic handler is invoked and
performs important housekeeping, for example obtaining the appropriate
lock, and then invokes an architecture specific handler to do the
appropriate elfcorehdr update.

Note the description in patch 'crash: change crash_prepare_elf64_headers()
to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that
enables further optimizations related to CPU plug/unplug/online/offline
performance of elfcorehdr updates.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement
with userspace needed.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

 - Prevent udev from updating kdump crash kernel on hot un/plug changes.
   Add the following as the first lines to the RHEL udev rule file
   /usr/lib/udev/rules.d/98-kexec.rules:

   # The kernel updates the crash elfcorehdr for CPU and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

   With this changeset applied, the two rules evaluate to false for
   CPU and memory change events and thus skip the userspace
   unload-then-reload of kdump.

 - Change to the kexec_file_load for loading the kdump kernel:
   Eg. on RHEL: in /usr/bin/kdumpctl, change to:
standard_kexec_args="-p -d -s"
   which adds the -s to select kexec_file_load() syscall.

This kernel patchset also supports kexec_load() with a modified kexec
userspace utility. A working changeset to the kexec userspace utility
is posted to the kexec-tools mailing list here:

 http://lists.infradead.org/pipermail/kexec/2023-May/027049.html

To use the kexec-tools patch, apply, build and install kexec-tools,
then change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug. The removal of -s reverts to the kexec_load syscall and
the addition of --hotplug invokes the changes put forth in the
kexec-tools patch.

Regards,
eric
---
v27: 11aug2023
 - Rebased onto 6.5.0-rc5
 - The linux-next and akpm test bots found a build error when just
   PROC_KCORE is configured (with no KEXEC or CRASH), which resulted
   in CRASH_CORE enabled by itself. To solve, the struct crash_mem
   moved from include/linux/kexec.h to include/linux/crash_core.h.
   Similarly, the crash_notes also moved from kernel/kexec.c to
   kernel/crash_core.c.
 - Minor adjustment to arch/x86/kernel/crash.c was also needed to
   avoid unused function build errors for just PROC_KCORE.
 - Spot testing of several architectures did not reveal any further
   build problems (PROC_KCORE, KEXEC, CRASH_DUMP, CRASH_HOTPLUG).

v26: 4aug2023
 https://lore.kernel.org/lkml/20230804210359.8321-1-eric.devol...@oracle.com/
 - Rebased onto 6.5.0-rc4
 - Dropped the refactor of files drivers/base/cpu|memory.c as unrelated
   to this series.
 - Minor corrections to documentation, per Randy Dunlap and GregKH.

v25: 29jun2023
 https://lore.kernel.org/lkml/20230629192119.6613-1-eric.devol...@oracle.com/
 - Properly applied IS_ENABLED() to the function bodies of callbacks
   in drivers/base/cpu|memory.c.
 - Re-ran compile and run-time testing of the impacted attributes for
   both enabled and not enabled config settings.

v24: 28jun2023
 https://lore.kernel.org/lkml/20230628185215.40707-1-eric.devol...@oracle.com/
 - Rebased onto 6.4.0
 - Included Documentation/ABI/testing entries for the new sysfs
   crash_hotplug attributes, per Greg Kroah-Hartman.
 - Refactored drivers/base/cpu|memory.c to use the .is_visible()
   method for attributes, per Greg Kroah-Hartman.
 - Retained all existing Acks and RBs as the few changes as a result
   of Greg's requests were trivial.

v23: 12jun2023
 https://lore.kernel.org/lkml/20230612210712.683175-1-eric.devol...@oracle.com/
 - Rebased onto 6.4.0-rc6
 - Refactored Kconfig, per Thomas. See series:
   
https://lore.kernel.org/lkml/20230612172805.681179-1-eric.devol...@oracle.com/
 - Reworked commit messages to conform to 

[PATCH v27 8/8] x86/crash: optimize CPU changes

2023-08-11 Thread Eric DeVolder
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF
PT_NOTE for all possible CPUs. As such, subsequent changes to CPUs
(ie. hot un/plug, online/offline) do not need to rewrite the elfcorehdr.

The kimage->file_mode term covers kdump images loaded via the
kexec_file_load() syscall. Since crash_prepare_elf64_headers()
wrote the initial elfcorehdr, no update to the elfcorehdr is
needed for CPU changes.

The kimage->elfcorehdr_updated term covers kdump images loaded via
the kexec_load() syscall. At least one memory or CPU change must occur
to cause crash_prepare_elf64_headers() to rewrite the elfcorehdr.
Afterwards, no update to the elfcorehdr is needed for CPU changes.

This code is intentionally *NOT* hoisted into
crash_handle_hotplug_event() as it would prevent the arch-specific
handler from running for CPU changes. This would break PPC, for
example, which needs to update other information besides the
elfcorehdr, on CPU changes.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 arch/x86/kernel/crash.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index caf22bcb61af..18d2a18d1073 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -467,6 +467,16 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
unsigned long mem, memsz;
unsigned long elfsz = 0;
 
+   /*
+* As crash_prepare_elf64_headers() has already described all
+* possible CPUs, there is no need to update the elfcorehdr
+* for additional CPU changes.
+*/
+   if ((image->file_mode || image->elfcorehdr_updated) &&
+   ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) ||
+   (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU)))
+   return;
+
/*
 * Create the new elfcorehdr reflecting the changes to CPU and/or
 * memory resources.
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v27 7/8] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()

2023-08-11 Thread Eric DeVolder
The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the CPUs and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
CPUs in the system.

With respect to the CPUs, the current implementation utilizes
for_each_present_cpu() which means that as CPUs are added and removed,
the elfcorehdr must again be updated to reflect the new set of CPUs.

The reasoning behind the move to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
  possible CPUs; that is, crash_notes are not allocated dynamically
  when CPUs are plugged/unplugged. Thus the crash_notes for each
  possible CPU are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU.
  Changing to for_each_possible_cpu() is valid as the crash_notes
  pointed to by each CPU PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

 kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
   elfcorehdr  /proc/vmcore vmcore

reveals how the ELF CPU PT_NOTEs are utilized:

- Upon panic, each CPU is sent an IPI and shuts itself down, recording
 its state in its crash_notes. When all CPUs are shutdown, the
 crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
 use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the CPU
 PT_NOTEs to craft a nr_cpus variable, which is reported in a
 header but otherwise generally unused. Makedumpfile creates the
 vmcore.

- The 'crash' dump analyzer does not appear to reference the CPU
 PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
 symbols and directly examines those structure contents from vmcore
 memory. From that information it is able to determine which CPUs
 are present and online, and locate the corresponding crash_notes.
 Said differently, it appears that 'crash' analyzer does not rely
 on the ELF PT_NOTEs for CPUs; rather it obtains the information
 directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most
common solution.)

This results in the benefit of having all CPUs described in the
elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on CPU changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible CPUs.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' CPUs from the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or CPU change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible CPUs will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain 
Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 kernel/crash_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index fa918176d46d..7378b501fada 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int 
need_kernel_map,
ehdr->e_ehsize = sizeof(Elf64_Ehdr);
ehdr->e_phentsize = sizeof(Elf64_Phdr);
 
-   /* Prepare one phdr of type PT_NOTE for each present CPU */
-   for_each_present_cpu(cpu) {
+   /* Prepare one phdr of type PT_NOTE for each possible CPU */
+   for_each_possible_cpu(cpu) {
phdr->p_type = PT_NOTE;
notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
phdr->p_offset = phdr->p_paddr = notes_addr;
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v27 5/8] x86/crash: add x86 crash hotplug support

2023-08-11 Thread Eric DeVolder
When CPU or memory is hot un/plugged, or off/onlined, the crash
elfcorehdr, which describes the CPUs and memory in the system,
must also be updated.

A new elfcorehdr is generated from the available CPUs and memory
and replaces the existing elfcorehdr. The segment containing the
elfcorehdr is identified at run-time in
crash_core:crash_handle_hotplug_event().

No modifications to purgatory (see 'kexec: exclude elfcorehdr
from the segment digest') or boot_params (as the elfcorehdr=
capture kernel command line parameter pointer remains unchanged
and correct) are needed, just elfcorehdr.

For kexec_file_load(), the elfcorehdr segment size is based on
NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a
growing number of CPU and memory resources.

For kexec_load(), the userspace kexec utility needs to size the
elfcorehdr segment in the same/similar manner.

To accommodate kexec_load() syscall in the absence of
kexec_file_load() syscall support, prepare_elf_headers() and
dependents are moved outside of CONFIG_KEXEC_FILE.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 arch/x86/Kconfig |   3 +
 arch/x86/include/asm/kexec.h |  15 +
 arch/x86/kernel/crash.c  | 103 ---
 3 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7082fc10b346..ffc95c3d6abd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2069,6 +2069,9 @@ config ARCH_SUPPORTS_KEXEC_JUMP
 config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)
 
+config ARCH_SUPPORTS_CRASH_HOTPLUG
+   def_bool y
+
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
default "0x100"
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 5b77bbc28f96..9143100ea3ea 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -209,6 +209,21 @@ typedef void crash_vmclear_fn(void);
 extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 extern void kdump_nmi_shootdown_cpus(void);
 
+#ifdef CONFIG_CRASH_HOTPLUG
+void arch_crash_handle_hotplug_event(struct kimage *image);
+#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline int crash_hotplug_cpu_support(void) { return 1; }
+#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+#endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int crash_hotplug_memory_support(void) { return 1; }
+#define crash_hotplug_memory_support crash_hotplug_memory_support
+#endif
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index cdd92ab43cda..c70a111c44fa 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -158,8 +158,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
crash_save_cpu(regs, safe_smp_processor_id());
 }
 
-#ifdef CONFIG_KEXEC_FILE
-
 static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
 {
unsigned int *nr_ranges = arg;
@@ -231,7 +229,7 @@ static int prepare_elf64_ram_headers_callback(struct 
resource *res, void *arg)
 
 /* Prepare elf headers. Return addr and size */
 static int prepare_elf_headers(struct kimage *image, void **addr,
-   unsigned long *sz)
+   unsigned long *sz, unsigned long 
*nr_mem_ranges)
 {
struct crash_mem *cmem;
int ret;
@@ -249,6 +247,9 @@ static int prepare_elf_headers(struct kimage *image, void 
**addr,
if (ret)
goto out;
 
+   /* Return the computed number of memory ranges, for hotplug usage */
+   *nr_mem_ranges = cmem->nr_ranges;
+
/* By default prepare 64bit headers */
ret =  crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), 
addr, sz);
 
@@ -257,6 +258,7 @@ static int prepare_elf_headers(struct kimage *image, void 
**addr,
return ret;
 }
 
+#ifdef CONFIG_KEXEC_FILE
 static int add_e820_entry(struct boot_params *params, struct e820_entry *entry)
 {
unsigned int nr_e820_entries;
@@ -371,18 +373,42 @@ int crash_setup_memmap_entries(struct kimage *image, 
struct boot_params *params)
 int crash_load_segments(struct kimage *image)
 {
int ret;
+   unsigned long pnum = 0;
struct kexec_buf kbuf = { .image = image, .buf_min = 0,
  .buf_max = ULONG_MAX, .top_down = false };
 
/* Prepare elf headers and add a segment */
-   ret = prepare_elf_headers(image, , );
+   ret = prepare_elf_headers(image, , , );
if (ret)
return ret;
 
-   image->elf_headers = kbuf.buffer;
-   image->elf_headers_sz = kbuf.bufsz;
+   image->elf_headers  = kbuf.buffer;
+   image->elf_headers_sz   = kbuf.bufsz;
+   kbuf.memsz  = 

[PATCH v27 3/8] kexec: exclude elfcorehdr from the segment digest

2023-08-11 Thread Eric DeVolder
When a crash kernel is loaded via the kexec_file_load() syscall, the
kernel places the various segments (ie crash kernel, crash initrd,
boot_params, elfcorehdr, purgatory, etc) in memory. For those
architectures that utilize purgatory, a hash digest of the segments
is calculated for integrity checking. The digest is embedded into
the purgatory image prior to placing in memory.

Updates to the elfcorehdr in response to CPU and memory changes
would cause the purgatory integrity checking to fail (at crash time,
and no vmcore created). Therefore, the elfcorehdr segment is
explicitly excluded from the purgatory digest, enabling updates to
the elfcorehdr while also avoiding the need to recompute the hash
digest and reload purgatory.

Suggested-by: Baoquan He 
Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 kernel/kexec_file.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 453b7a513540..e2ec9d7b9a1f 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -726,6 +726,12 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
for (j = i = 0; i < image->nr_segments; i++) {
struct kexec_segment *ksegment;
 
+#ifdef CONFIG_CRASH_HOTPLUG
+   /* Exclude elfcorehdr segment to allow future changes via 
hotplug */
+   if (j == image->elfcorehdr_index)
+   continue;
+#endif
+
ksegment = >segment[i];
/*
 * Skip purgatory as it will be modified once we put digest
-- 
2.31.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v27 4/8] crash: memory and CPU hotplug sysfs attributes

2023-08-11 Thread Eric DeVolder
Introduce the crash_hotplug attribute for memory and CPUs for
use by userspace.  These attributes directly facilitate the udev
rule for managing userspace re-loading of the crash kernel upon
hot un/plug changes.

For memory, expose the crash_hotplug attribute to the
/sys/devices/system/memory directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/memory/memory81
  looking at device '/devices/system/memory/memory81':
KERNEL=="memory81"
SUBSYSTEM=="memory"
DRIVER==""
ATTR{online}=="1"
ATTR{phys_device}=="0"
ATTR{phys_index}=="0051"
ATTR{removable}=="1"
ATTR{state}=="online"
ATTR{valid_zones}=="Movable"

  looking at parent device '/devices/system/memory':
KERNELS=="memory"
SUBSYSTEMS==""
DRIVERS==""
ATTRS{auto_online_blocks}=="offline"
ATTRS{block_size_bytes}=="800"
ATTRS{crash_hotplug}=="1"

For CPUs, expose the crash_hotplug attribute to the
/sys/devices/system/cpu directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0
  looking at device '/devices/system/cpu/cpu0':
KERNEL=="cpu0"
SUBSYSTEM=="cpu"
DRIVER=="processor"
ATTR{crash_notes}=="277c38600"
ATTR{crash_notes_size}=="368"
ATTR{online}=="1"

  looking at parent device '/devices/system/cpu':
KERNELS=="cpu"
SUBSYSTEMS==""
DRIVERS==""
ATTRS{crash_hotplug}=="1"
ATTRS{isolated}==""
ATTRS{kernel_max}=="8191"
ATTRS{nohz_full}=="  (null)"
ATTRS{offline}=="4-7"
ATTRS{online}=="0-3"
ATTRS{possible}=="0-7"
ATTRS{present}=="0-3"

With these sysfs attributes in place, it is possible to efficiently
instruct the udev rule to skip crash kernel reloading for kernels
configured with crash hotplug support.

For example, the following is the proposed udev rule change for RHEL
system 98-kexec.rules (as the first lines of the rule file):

 # The kernel updates the crash elfcorehdr for CPU and memory changes
 SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
 SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

When examined in the context of 98-kexec.rules, the above rules
test if crash_hotplug is set, and if so, the userspace initiated
unload-then-reload of the crash kernel is skipped.

CPU and memory checks are separated in accordance with
CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options.
If an architecture supports, for example, memory hotplug but not
CPU hotplug, then the /sys/devices/system/memory/crash_hotplug
attribute file is present, but the /sys/devices/system/cpu/crash_hotplug
attribute file will NOT be present. Thus the udev rule skips
userspace processing of memory hot un/plug events, but the udev
rule will evaluate false for CPU events, thus allowing userspace to
process CPU hot un/plug events (ie the unload-then-reload of the kdump
capture kernel).

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 Documentation/ABI/testing/sysfs-devices-memory |  8 
 .../ABI/testing/sysfs-devices-system-cpu   |  8 
 .../admin-guide/mm/memory-hotplug.rst  |  8 
 Documentation/core-api/cpu_hotplug.rst | 18 ++
 drivers/base/cpu.c | 13 +
 drivers/base/memory.c  | 13 +
 include/linux/kexec.h  |  8 
 7 files changed, 76 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-memory 
b/Documentation/ABI/testing/sysfs-devices-memory
index d8b0f80b9e33..a95e0f17c35a 100644
--- a/Documentation/ABI/testing/sysfs-devices-memory
+++ b/Documentation/ABI/testing/sysfs-devices-memory
@@ -110,3 +110,11 @@ Description:
link is created for memory section 9 on node0.
 
/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+
+What:  /sys/devices/system/memory/crash_hotplug
+Date:  Aug 2023
+Contact:   Linux kernel mailing list 
+Description:
+   (RO) indicates whether or not the kernel directly supports
+   modifying the crash elfcorehdr for memory hot un/plug and/or
+   on/offline changes.
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 77942eedf4f6..b52564de2b18 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -687,3 +687,11 @@ Description:
(RO) the list of CPUs that are isolated and don't
participate in load balancing. These CPUs are set by
boot parameter "isolcpus=".
+
+What:  /sys/devices/system/cpu/crash_hotplug
+Date:  Aug 2023
+Contact:   Linux kernel mailing list 
+Description:
+   (RO) indicates whether or not the kernel directly supports
+   modifying the crash elfcorehdr 

[PATCH v27 1/8] crash: move a few code bits to setup support of crash hotplug

2023-08-11 Thread Eric DeVolder
The crash hotplug support leans on the work for the kexec_file_load()
syscall. To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.

No functionality change intended.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 include/linux/kexec.h |  30 +++
 kernel/crash_core.c   | 182 ++
 kernel/kexec_file.c   | 181 -
 3 files changed, 197 insertions(+), 196 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 22b5cd24f581..811a90e09698 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -105,6 +105,21 @@ struct compat_kexec_segment {
 };
 #endif
 
+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN   4096
+
+struct crash_mem {
+   unsigned int max_nr_ranges;
+   unsigned int nr_ranges;
+   struct range ranges[];
+};
+
+extern int crash_exclude_mem_range(struct crash_mem *mem,
+  unsigned long long mstart,
+  unsigned long long mend);
+extern int crash_prepare_elf64_headers(struct crash_mem *mem, int 
need_kernel_map,
+  void **addr, unsigned long *sz);
+
 #ifdef CONFIG_KEXEC_FILE
 struct purgatory_info {
/*
@@ -230,21 +245,6 @@ static inline int arch_kexec_locate_mem_hole(struct 
kexec_buf *kbuf)
 }
 #endif
 
-/* Alignment required for elf header segment */
-#define ELF_CORE_HEADER_ALIGN   4096
-
-struct crash_mem {
-   unsigned int max_nr_ranges;
-   unsigned int nr_ranges;
-   struct range ranges[];
-};
-
-extern int crash_exclude_mem_range(struct crash_mem *mem,
-  unsigned long long mstart,
-  unsigned long long mend);
-extern int crash_prepare_elf64_headers(struct crash_mem *mem, int 
need_kernel_map,
-  void **addr, unsigned long *sz);
-
 #ifndef arch_kexec_apply_relocations_add
 /*
  * arch_kexec_apply_relocations_add - apply relocations of type RELA
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 90ce1dfd591c..b7c30b748a16 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg)
 }
 early_param("crashkernel", parse_crashkernel_dummy);
 
+int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
+ void **addr, unsigned long *sz)
+{
+   Elf64_Ehdr *ehdr;
+   Elf64_Phdr *phdr;
+   unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
+   unsigned char *buf;
+   unsigned int cpu, i;
+   unsigned long long notes_addr;
+   unsigned long mstart, mend;
+
+   /* extra phdr for vmcoreinfo ELF note */
+   nr_phdr = nr_cpus + 1;
+   nr_phdr += mem->nr_ranges;
+
+   /*
+* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+* area (for example, 8000 - a000 on x86_64).
+* I think this is required by tools like gdb. So same physical
+* memory will be mapped in two ELF headers. One will contain kernel
+* text virtual addresses and other will have __va(physical) addresses.
+*/
+
+   nr_phdr++;
+   elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+   elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+   buf = vzalloc(elf_sz);
+   if (!buf)
+   return -ENOMEM;
+
+   ehdr = (Elf64_Ehdr *)buf;
+   phdr = (Elf64_Phdr *)(ehdr + 1);
+   memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+   ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+   ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+   ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+   ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+   memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+   ehdr->e_type = ET_CORE;
+   ehdr->e_machine = ELF_ARCH;
+   ehdr->e_version = EV_CURRENT;
+   ehdr->e_phoff = sizeof(Elf64_Ehdr);
+   ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+   ehdr->e_phentsize = sizeof(Elf64_Phdr);
+
+   /* Prepare one phdr of type PT_NOTE for each present CPU */
+   for_each_present_cpu(cpu) {
+   phdr->p_type = PT_NOTE;
+   notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+   phdr->p_offset = phdr->p_paddr = notes_addr;
+   phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+   (ehdr->e_phnum)++;
+   phdr++;
+   }
+
+   /* Prepare one PT_NOTE header for vmcoreinfo */
+   phdr->p_type = PT_NOTE;
+   phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+   

[PATCH v27 6/8] crash: hotplug support for kexec_load()

2023-08-11 Thread Eric DeVolder
The hotplug support for kexec_load() requires changes to the
userspace kexec-tools and a little extra help from the kernel.

Given a kdump capture kernel loaded via kexec_load(), and a
subsequent hotplug event, the crash hotplug handler finds the
elfcorehdr and rewrites it to reflect the hotplug change.
That is the desired outcome, however, at kernel panic time,
the purgatory integrity check fails (because the elfcorehdr
changed), and the capture kernel does not boot and no vmcore
is generated.

Therefore, the userspace kexec-tools/kexec must indicate to the
kernel that the elfcorehdr can be modified (because the kexec
excluded the elfcorehdr from the digest, and sized the elfcorehdr
memory buffer appropriately).

To facilitate hotplug support with kexec_load():
 - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is
   safe for the kernel to modify the kexec_load()'d elfcorehdr
 - the /sys/kernel/crash_elfcorehdr_size node communicates the
   preferred size of the elfcorehdr memory buffer
 - The sysfs crash_hotplug nodes (ie.
   /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically
   take into account kexec_file_load() vs kexec_load() and
   KEXEC_UPDATE_ELFCOREHDR.
   This is critical so that the udev rule processing of crash_hotplug
   is all that is needed to determine if the userspace unload-then-load
   of the kdump image is to be skipped, or not. The proposed udev
   rule change looks like:
   # The kernel updates the crash elfcorehdr for CPU and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

The table below indicates the behavior of kexec_load()'d kdump image
updates (with the new udev crash_hotplug rule in place):

 Kernel |Kexec
 ---+-+
 Old|Old  |New
|  a  | a
 ---+-+
 New|  a  | b
 ---+-+

where kexec 'old' and 'new' delineate kexec-tools has the needed
modifications for the crash hotplug feature, and kernel 'old' and
'new' delineate the kernel supports this crash hotplug feature.

Behavior 'a' indicates the unload-then-reload of the entire kdump
image. For the kexec 'old' column, the unload-then-reload occurs
due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel
(with 'new' kexec) does not present the crash_hotplug sysfs node,
which leads to the unload-then-reload of the kdump image.

Behavior 'b' indicates the desired optimized behavior of the kernel
directly modifying the elfcorehdr and avoiding the unload-then-reload
of the kdump image.

If the udev rule is not updated with crash_hotplug node check, then
no matter any combination of kernel or kexec is new or old, the
kdump image continues to be unload-then-reload on hotplug changes.

To fully support crash hotplug feature, there needs to be a rollout
of kernel, kexec-tools and udev rule changes. However, the order of
the rollout of these pieces does not matter; kexec_load()'d kdump
images still function for hotplug as-is.

Suggested-by: Hari Bathini 
Signed-off-by: Eric DeVolder 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
---
 arch/x86/include/asm/kexec.h | 11 +++
 arch/x86/kernel/crash.c  | 27 +++
 include/linux/kexec.h| 14 --
 include/uapi/linux/kexec.h   |  1 +
 kernel/Kconfig.kexec |  4 
 kernel/crash_core.c  | 31 +++
 kernel/kexec.c   |  5 +
 kernel/ksysfs.c  | 15 +++
 8 files changed, 102 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 9143100ea3ea..3be6a98751f0 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -214,14 +214,17 @@ void arch_crash_handle_hotplug_event(struct kimage 
*image);
 #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
 
 #ifdef CONFIG_HOTPLUG_CPU
-static inline int crash_hotplug_cpu_support(void) { return 1; }
-#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+int arch_crash_hotplug_cpu_support(void);
+#define crash_hotplug_cpu_support arch_crash_hotplug_cpu_support
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-static inline int crash_hotplug_memory_support(void) { return 1; }
-#define crash_hotplug_memory_support crash_hotplug_memory_support
+int arch_crash_hotplug_memory_support(void);
+#define crash_hotplug_memory_support arch_crash_hotplug_memory_support
 #endif
+
+unsigned int arch_crash_get_elfcorehdr_size(void);
+#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c70a111c44fa..caf22bcb61af 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -427,6 +427,33 @@ int crash_load_segments(struct kimage *image)
 #undef pr_fmt
 #define pr_fmt(fmt) "crash hp: " fmt
 
+/* These functions provide the value for the sysfs 

Re: [RFC] IMA Log Snapshotting Design Proposal

2023-08-11 Thread Tushar Sugandhi



On 8/10/23 07:12, Stefan Berger wrote:



On 8/9/23 21:15, Tushar Sugandhi wrote:

Thanks a lot Stefan for looking into this proposal,
and providing your feedback. We really appreciate it.

On 8/7/23 15:49, Stefan Berger wrote:



On 8/1/23 17:21, James Bottomley wrote:

On Tue, 2023-08-01 at 12:12 -0700, Sush Shringarputale wrote:
[...]

Truncating IMA log to reclaim memory is not feasible, since it makes
the log go out of sync with the TPM PCR quote making remote
attestation fail.


This assumption isn't entirely true.  It's perfectly possible to shard
an IMA log using two TPM2_Quote's for the beginning and end PCR values
to validate the shard.  The IMA log could be truncated in the same way
(replace the removed part of the log with a TPM2_Quote and AK, so the
log still validates from the beginning quote to the end).

If you use a TPM2_Quote mechanism to save the log, all you need to do
is have the kernel generate the quote with an internal AK.  You can
keep a record of the quote and the AK at the beginning of the truncated
kernel log.  If the truncated entries are saved in a file shard it


The truncation seems dangerous to me. Maybe not all the scenarios 
with an attestation
client (client = reading logs and quoting) are possible then anymore, 
such as starting
an attestation client only after truncation but a verifier must have 
witnessed the

system's PCRs and log state before the truncation occurred.

You are correct that truncation on it’s own is dangerous. It needs to be
accompanied by (a) saving the IMA log data to disk as snapshots, (b) 
adding the
necessary TPM PCR quotes to the current IMA log (as James mentioned 
above),
(c) attestation clients having an ability to send the past snapshots 
to the
remote-attestation-service (verifiers), (d) and verifiers having an 
ability
to use the snapshots along with current IMA logs for the purpose of 
attestation.
All these points are explained in the original RFC email in sections 
B.1 through B.5 [1].


I read it.

Maybe you have dismissed the PCR update counter already...
I am not sure what the PCR update counter is supposed to help with. It 
won't allow you to detect
missing log events but rather will confuse anyone looking at it when my 
application extends PCR 12
for example, which also affects the update counter. It's a global 
counter that increases with every
PCR extension (except PCR 16, 21, 22, 23) and if used as proposed would 
prevent any application from

extending PCRs.

https://github.com/stefanberger/libtpms/blob/master/src/tpm2/PCR.c#L667
https://github.com/stefanberger/libtpms/blob/master/src/tpm2/PCR.c#L629
https://github.com/stefanberger/libtpms/blob/master/src/tpm2/PCR.c#L161



Agree with your point about TPM PCR update counter Stefan.
I will bring it up in the update counter patch series discussion [1].

[1] 
https://patchwork.kernel.org/project/linux-integrity/cover/20230801181917.8535-1-tusha...@linux.microsoft.com/ 



The shards should will need to be written into some sort of standard 
location or a config file needs to
be defined, so that everyone knows where to find them and how they are 
named.



We thought about well known standard location earlier.
Letting the Kernel choose the name/location of the snapshot
file comes with its own complexity. Our initial stance is we don’t
want to handle that at Kernel level, and let the UM client choose
the location/naming of the snapshot files. But we are happy to
reconsider if the community requests it.




I think an ima-buf (or similar) log entry in IMA log would have to 
appear at the beginning of the
truncated log stating the value of all PCRs that IMA touched 
(typically only PCR 10
but it can be others). The needs to be done since the quote itself 
doesn't
provide the state of the individual PCRs. This would at least allow 
an attestation
client to re-read the log from the beginning (when it is re-start or 
started for the
first time after the truncation). 

  Agreed. See the description of snapshot_aggregate in Section B.5 in the
original RFC email [1].

However, this alone (without the
internal AK quoting the old state) could lead to abuse where I could 
create totally
fake IMA logs stating the state of the PCRs at the beginning (so the 
verifier
syncs its internal PCR state to this state). 

Yes, the PCR quotes sent to the verifier must be signed by the AK that
is trusted by the verifier. That assumption is true regardless of IMA log
snapshotting feature.

Further, even with the AK-quote that
you propose I may be able to create fake logs and trick a verifier into
trusting the machine IFF it doesn't know what kernel this system was 
booted with
that I may have hacked to provide a fake AK-quote that just happens 
to match the

PCR state presented at the beginning of the log.


If the Kernel is compromised, then all-bets are off.
(Regardless of IMA log snapshotting feature.)
=> Can a truncated log be made safe for attestation when the 
attestation starts

only after the 

Re: [RFC] IMA Log Snapshotting Design Proposal

2023-08-11 Thread Tushar Sugandhi



On 8/10/23 04:43, James Bottomley wrote:

On Wed, 2023-08-09 at 21:43 -0700, Tushar Sugandhi wrote:

On 8/8/23 14:41, James Bottomley wrote:

On Tue, 2023-08-08 at 16:09 -0400, Stefan Berger wrote:

[...]

   at this point doesn't seem necessary since one presumably can
verify the log and PCR states at the end with the 'regular'
quote.
  
I don't understand this.  A regular quote is a signature over PCR

state by an AK.  The point about saving the AK in the log for the
original is that if the *kernel* truncates the log and saves it to
a file, it needs to generate both the AK and the quote for the top
of the file shard. That means the AK/EK binding is unverified, but
can be verified by loading the AK and running the usual tests,
which can only be done if you have the loadable AK, which is why
you need it as part of the log saving proposal.
  
I had this question about the usability of AK/EK in this

context. Although AK/EK + PCR quote is needed to verify the snapshot
shards / IMA logs are not tampered with, I am still not sure why
AK/EK needs to be part of the shard/IMA log. The client sending AK/EK
to attestation service separately would still serve the purpose,
right?


Well, the EK doesn't need to be part of the log: it's just a permanent
part of the TPM identity.  To verify the log, you need access to the
TPM that was used to create it, so that's the point at which you get
the EK.


Agreed. EK is part of TPM identity. But to verify the log,
you don’t need to have physical access to the TPM. You need to have
access to just public part of EK and AK/AIK certs (TPM on the system
would sign the quote using the private AK).
I believe you already know this, just stating for the sake of
completing the conversation. :)

An AK is simply a TPM generated signing key (meaning the private part
of the key is secured by the TPM and known to no-one else).  In the
literature a TPM generated signing key doesn't become an Attestation
Key until it's been verified using an EK property (either a certify for
a signing EK or a make/activate credential round trip for the more
usual encryption EK.


Yes. That aligns with my understanding of EK/AK in general.
Thanks for describing.


So the proposal is for each quote that's used to verify a log shard is
that the TPM simply generate a random signing key and use that to sign

I believe you are suggesting creating a new AK each time you
want to sign a PCR quote. It is doable in TPM 2.0, and it provides
benefits like privacy and untraceability. But it comes with it’s own
costs – cost of generating new AK each time you want to sign,
maintaining mapping of AK and it’s signed quotes, maintaining
multiple public AK certs etc.


the quote.  You need to save the TPM form of the generated key so it
can be loaded later and the reason for that is you can do the EK
verification at any time after the quote was given by loading the saved
key and running the verification protocol.  In the normal attestation
you do the EK verification of the AK *before* the quote, but there's no
property of the quote that depends on this precedence provided you do
the quote with a TPM generated signing key.

Yes.



The underlying point is that the usual way an EK verifies an AK
requires a remote observer, which the kernel won't have, so the kernel

Agreed.


must do all its stuff locally (generate key, get quote) and then at

I believe the Kernel doesn’t have to generate key while
taking the snapshot. In the current proposal, Kernel can simply get
the (unsigned) PCR quote and log it in IMA log as part of the
snapshot_aggregate event. We don’t need to sign the quote while
logging it in the IMA log as snapshot_aggregate. And the act of
logging that event in IMA log extends the PCR bank. Sometime later,
when a remote observer wants to validate the log – it can do it by
comparing against the PCR quote that was signed at that point.


some point later the system can become remote connected and prove to
whatever external entity that the log shard is valid.  So we have to
have all the components necessary for that proof: the log shard, the
quote and the TPM form of the AK.


For instance, PCR quotes will be signed by AK. So as long as the
verifier trusts the AK/EK,


Right, but if you're sharding a log, the kernel doesn't know if a
verifier has been in contact yet.  The point of the protocol above is
to make that not matter.  The verifier can contact the system after the
log has been saved and the verification will still work.


The Kernel doesn’t need to know. And it still doesn’t matter.
The benefit of our approach is the PCR values that represent the
previous snapshot(shard) is now logged in the IMA log as
snapshot_aggregate, and the PCRs are extended again as part of
logging that event in IMA log.


  it can verify the quotes are not tampered with.
Replaying IMA log/snapshot can produce the PCR quotes which can be
matched with signed PCR quotes. If they match, then the verifier can
conclude that the IMA log is 

RE: kexec reports "Cannot get kernel _text symbol address" on arm64 platform

2023-08-11 Thread Pandey, Radhey Shyam
> -Original Message-
> From: b...@redhat.com 
> Sent: Wednesday, August 9, 2023 7:42 AM
> To: Pandey, Radhey Shyam ;
> pi...@redhat.com
> Cc: kexec@lists.infradead.org; linux-ker...@vger.kernel.org
> Subject: Re: kexec reports "Cannot get kernel _text symbol address" on
> arm64 platform
> 
> On 08/08/23 at 07:17pm, Pandey, Radhey Shyam wrote:
> > Hi,
> >
> > I am trying to bring up kdump on arm64 platform[1]. But I get "Cannot get
> kernel _text symbol address".
> >
> > Is there some Dump-capture kernel config options that I am missing?
> >
> > FYI, copied below complete kexec debug log.
> >
> > [1]: https://www.xilinx.com/products/boards-and-kits/vck190.html
> 
> Your description isn't clear. You saw the printing, then your kdump kernel
> loading succeeded or not?
> 
> If no, have you tried applying Pingfan's patchset and still saw the issue?
> 
> [PATCHv7 0/5] arm64: zboot support
> https://lore.kernel.org/all/20230803024152.11663-1-pi...@redhat.com/T/#u

I was able to proceed further with loading with crash kernel on triggering 
system crash.
echo c > /proc/sysrq-trigger

But when I copy /proc/vmcore it throws memory abort. Also I see size of 
/proc/vmcore really huge (18446603353488633856).
Any possible guess on what could be wrong?


[   80.733523] Starting crashdump kernel...
[   80.737435] Bye!
[0.00] Booting Linux on physical CPU 0x01 [0x410fd083]
[0.00] Linux version 6.5.0-rc4-ge28001fb4e07 (radheys@xhdradheys41) 
(aarch64-xilinx-linux-gcc.real (GCC) 12.2.0, GNU ld (GNU Binutils) 
2.39.0.20220819) #23 SMP Fri Aug 11 16:25:34 IST 2023




xilinx-vck190-20232:/run/media/mmcblk0p1# cat /proc/meminfo | head
MemTotal:2092876 kB
MemFree: 1219928 kB
MemAvailable:1166004 kB
Buffers:  32 kB
Cached:   756952 kB
SwapCached:0 kB
Active: 1480 kB
Inactive:  24164 kB
Active(anon):   1452 kB
Inactive(anon):24160 kB
xilinx-vck190-20232:/run/media/mmcblk0p1# cp /proc/vmcore dump 
[  975.284865] Unable to handle kernel level 3 address size fault at virtual 
address 80008d7cf000
[  975.293871] Mem abort info:
[  975.296669]   ESR = 0x9603
[  975.300425]   EC = 0x25: DABT (current EL), IL = 32 bits
[  975.305738]   SET = 0, FnV = 0
[  975.308788]   EA = 0, S1PTW = 0
[  975.311925]   FSC = 0x03: level 3 address size fault
[  975.316888] Data abort info:
[  975.319763]   ISV = 0, ISS = 0x0003, ISS2 = 0x
[  975.325245]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  975.330292]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  975.335599] swapper pgtable: 4k pages, 48-bit VAs, pgdp=05016ef6b000
[  975.342297] [80008d7cf000] pgd=1501eddfe003, p4d=1501eddfe003, 
pud=1501eddfd003, pmd=15017b695003, pte=00687fff84000703
[  975.354827] Internal error: Oops: 9603 [#4] SMP
[  975.360392] Modules linked in:
3  975.
63440] CBPrUo:a d0c aPID: 664 Comm: cp Tainted: G  D
6.5.0-rc4-ge28001fb4e07 #23
[  975.372822] Hardware name: Xilinx Versal vck190 Eval board revA (DT)
[  975.379165] pstate: a005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  975.386119] pc : __memcpy+0x110/0x230
[  975.389783] lr : _copy_to_iter+0x3d8/0x4d0
[  975.393874] sp : 80008dc939a0
[  975.397178] x29: 80008dc939a0 x28: 05013c1bea30 x27: 1000
[  975.404309] x26: 1000 x25: 1000 x24: 80008d7cf000
[  975.411440] x23: 0400 x22: 80008dc93ba0 x21: 1000
[  975.418570] x20:  x19: 1000 x18: 
[  975.425699] x17:  x16:  x15: 0140
[  975.432829] x14: 8500a9919000 x13: 0041 x12: fffef6831000
[  975.439958] x11: 80008d9cf000 x10:  x9 : 
[  975.447088] x8 : 80008d7d x7 : 0501addfd358 x6 : 0401
[  975.454217] x5 : 0501370e9000 x4 : 80008d7d x3 : 
[  975.461346] x2 : 1000 x1 : 80008d7cf000 x0 : 0501370e8000
[  975.468476] Call trace:
[  975.470912]  __memcpy+0x110/0x230
[  975.474221]  copy_oldmem_page+0x70/0xac
[  975.478050]  read_from_oldmem.part.0+0x120/0x188
[  975.482663]  read_vmcore+0x14c/0x238
[  975.486231]  proc_reg_read_iter+0x84/0xd8
[  975.490233]  copy_splice_read+0x160/0x288
[  975.494236]  vfs_splice_read+0xac/0x10c
[  975.498063]  splice_direct_to_actor+0xa4/0x26c
[  975.502498]  do_splice_direct+0x90/0xdc
[  975.506325]  do_sendfile+0x344/0x454
[  975.509892]  __arm64_sys_sendfile64+0x134/0x140
[  975.514415]  invoke_syscall+0x54/0x124
[  975.518157]  el0_svc_common.constprop.0+0xc4/0xe4
[  975.522854]  do_el0_svc+0x38/0x98
[  975.526162]  el0_svc+0x2c/0x84
[  975.529211]  el0t_64_sync_handler+0x100/0x12c
[  975.533562]  el0t_64_sync+0x190/0x194
[  975.537218] Code: cb01000e b4fffc2e eb0201df 540004a3 (a940342c) 
[  975.543302] ---[ end trace  ]---
t 

Re: [RFC] IMA Log Snapshotting Design Proposal

2023-08-11 Thread Mimi Zohar
Hi Sush, Tushar,

On Tue, 2023-08-01 at 12:12 -0700, Sush Shringarputale wrote:
> 
> | A. Problem Statement |
> 
> Depending on the IMA policy, the IMA log can consume a lot of Kernel 
> memory on
> the device.  For instance, the events for the following IMA policy 
> entries may
> need to be measured in certain scenarios, but they can also lead to a 
> verbose
> IMA log when the device is running for a long period of time.
> ┌───┐
> │# PROC_SUPER_MAGIC │
> │measure fsmagic=0x9fa0 │
> │# SYSFS_MAGIC  │
> │measure fsmagic=0x62656572 │
> │# DEBUGFS_MAGIC│
> │measure fsmagic=0x64626720 │
> │# TMPFS_MAGIC  │
> │measure fsmagic=0x01021994 │
> │# RAMFS_MAGIC  │
> │measure fsmagic=0x858458f6 │
> │# SECURITYFS_MAGIC │
> │measure fsmagic=0x73636673 │
> │# OVERLAYFS_MAGIC  │
> │measure fsmagic=0x794c7630 │
> │# log, audit or tmp files  │
> │measure obj_type=var_log_t │
> │measure obj_type=auditd_log_t  │
> │measure obj_type=tmp_t │
> └───┘
> 
> Secondly, certain devices are configured to take Kernel updates using Kexec
> soft-boot.  The IMA log from the previous Kernel gets carried over and the
> Kernel memory consumption problem worsens when such devices undergo multiple
> Kexec soft-boots over a long period of time.
> 
> The above two scenarios can cause IMA log to grow and consume Kernel memory.
> 
> In addition, a large IMA log can add pressure on the network bandwidth when
> the attestation client sends it to remote-attestation-service.
> 
> Truncating IMA log to reclaim memory is not feasible, since it makes the 
> log go
> out of sync with the TPM PCR quote making remote attestation fail.
> 
> A sophisticated solution is required which will help relieve the memory
> pressure on the device and continue supporting remote attestation without
> disruptions.

If the problem is kernel memory, then using a single tmpfs file has
already been proposed [1].  As entries are added to the measurement
list, they are copied to the tmpfs file and removed from kernel memory.
Userspace would still access the measurement list via the existing
securityfs file.

The IMA measurement list is a sequential file, allowing it to be read
from an offset.  How much or how little of the measuremnt list is read
by the attestation client and sent to the attestation server is up to
the attestation client/server.

If the problem is not kernel memory, but memory pressure in general,
then instead of a tmpfs file, the measurement list could similarly be
copied to a single persistent file [1].

> 
> ---
> 
> | B. Proposed Solution |
> 
> In this document, we propose an enhancement to the IMA subsystem to improve
> the long-running performance by snapshotting the IMA log, while still
> providing mechanisms to verify its integrity using the PCR quotes.
> 
> The remainder of the document describes details of the proposed solution 
> in the
> following sub-sections.
>   - High-level Work-flow
>   - Snapshot Triggering Mechanism
>   - Design Choices for Storing Snapshots
>   - Attestation-Client and Remote-Attestation-Service Side Changes
>   - Example Walk-through
>   - Open Questions
> ---
> 
> | B.1 High-level Work-flow |
> 
> Pre-requisites:
> - IMA Integrity guarantees are maintained.
> 
> The proposed high level work-flow of IMA log snapshotting is as follows:
> - A user-mode process will trigger the snapshot by opening a file in SysFS
>say /sys/kernel/security/ima/snapshot (referred to as 
> sysk_ima_snapshot_file
>here onwards).

Please fix the mailer so that it doesn't wrap sentences.   Adding blank
lines between bullets would improve readability.

> - The Kernel will get the current TPM PCR values and PCR update counter [2]
>and store them as template data in a new IMA event "snapshot_aggregate".
>This event will be measured by IMA using critical data measurement
>functionality [1].  Recording regular IMA events will be paused while
>"snapshot_aggregate" is being computed using the existing IMA mutex lock.

> - Once the "snapshot_aggregate" is computed and measured in IMA log, the 
> prior
>IMA events will be made available in the sysk_ima_snapshot_file.

> - 

Re: [PATCH V3 01/14] blk-mq: add blk_mq_max_nr_hw_queues()

2023-08-11 Thread Christoph Hellwig
On Thu, Aug 10, 2023 at 08:09:27AM +0800, Ming Lei wrote:
> 1) some archs support 'nr_cpus=1' for kdump kernel, which is fine, since
> num_possible_cpus becomes 1.
> 
> 2) some archs do not support 'nr_cpus=1', and have to rely on
> 'max_cpus=1', so num_possible_cpus isn't changed, and kernel just boots
> with single online cpu. That causes trouble because blk-mq limits single
> queue.

And we need to fix case 2.  We need to drop the is_kdump support, and
if they want to force less cpus they need to make nr_cpus=1 work.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 1/2] RISC-V: Use linux,usable-memory-range for crash kernel

2023-08-11 Thread Song Shuai
Now we use "memeory::linux,usable-memory" to indicate the available
memory for the crash kernel.

While booting with UEFI, the crash kernel would use efi.memmap to
re-populate memblock and then first kernel's memory would be corrputed.
Consequently, the /proc/vmcore file failed to create in my local test.

And according to "chosen" dtschema [1], the available memory for the
crash kernel should be held via "chosen::linux,usable-memory-range"
property which will re-cap memblock even after UEFI's re-population.

[1]:
https://github.com/devicetree-org/dt-schema/blob/main/dtschema/schemas/chosen.yaml

Signed-off-by: Song Shuai 
---
 kexec/arch/riscv/kexec-riscv.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kexec/arch/riscv/kexec-riscv.c b/kexec/arch/riscv/kexec-riscv.c
index fe5dd2d..5aea035 100644
--- a/kexec/arch/riscv/kexec-riscv.c
+++ b/kexec/arch/riscv/kexec-riscv.c
@@ -79,20 +79,20 @@ int load_extra_segments(struct kexec_info *info, uint64_t 
kernel_base,
}
 
ret = dtb_add_range_property(>buf, >size, start, end,
-"memory", "linux,usable-memory");
+"chosen", 
"linux,usable-memory-range");
if (ret) {
-   fprintf(stderr, "Couldn't add usable-memory to fdt\n");
+   fprintf(stderr, "Couldn't add usable-memory-range to 
fdt\n");
return ret;
}
 
max_usable = end;
} else {
/*
-* Make sure we remove elfcorehdr and usable-memory
+* Make sure we remove elfcorehdr and usable-memory-range
 * when switching from crash kernel to a normal one.
 */
dtb_delete_property(fdt->buf, "chosen", "linux,elfcorehdr");
-   dtb_delete_property(fdt->buf, "memory", "linux,usable-memory");
+   dtb_delete_property(fdt->buf, "chosen", 
"linux,usable-memory-range");
}
 
/* Do we need to include an initrd image ? */
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH 2/2] RISC-V: Fix the undeclared ‘EM_RISCV’ build failure

2023-08-11 Thread Song Shuai
Use local `elf.h` instead of `linux/elf.h` to fix this build error:

```
kexec/arch/riscv/crashdump-riscv.c:17:13: error: ‘EM_RISCV’ undeclared here 
(not in a function); did you mean ‘EM_CRIS’?
  .machine = EM_RISCV,
 ^~~~
 EM_CRIS
```

Signed-off-by: Song Shuai 
---
 kexec/arch/riscv/crashdump-riscv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kexec/arch/riscv/crashdump-riscv.c 
b/kexec/arch/riscv/crashdump-riscv.c
index 3ed4fe3..336d7a7 100644
--- a/kexec/arch/riscv/crashdump-riscv.c
+++ b/kexec/arch/riscv/crashdump-riscv.c
@@ -1,5 +1,5 @@
 #include 
-#include 
+#include 
 #include 
 
 #include "kexec.h"
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V3 01/14] blk-mq: add blk_mq_max_nr_hw_queues()

2023-08-11 Thread Hari Bathini




On 10/08/23 8:31 am, Baoquan He wrote:

On 08/10/23 at 10:06am, Ming Lei wrote:

On Thu, Aug 10, 2023 at 09:18:27AM +0800, Baoquan He wrote:

On 08/10/23 at 08:09am, Ming Lei wrote:

On Wed, Aug 09, 2023 at 03:44:01PM +0200, Christoph Hellwig wrote:

I'm starting to sound like a broken record, but we can't just do random
is_kdump checks, and it's not going to get better by resending it again and
again.  If kdump kernels limit the number of possible CPUs, it needs to
reflected in cpu_possible_map and we need to use that information.



Can you look at previous kdump/arch guys' comment about kdump usage &
num_possible_cpus?

 
https://lore.kernel.org/linux-block/caf+s44ruqswbosy9kmdx35crviqnxoeuvgnsue75bb0y2jg...@mail.gmail.com/
 https://lore.kernel.org/linux-block/ZKz912KyFQ7q9qwL@MiWiFi-R3L-srv/

The point is that kdump kernels does not limit the number of possible CPUs.

1) some archs support 'nr_cpus=1' for kdump kernel, which is fine, since
num_possible_cpus becomes 1.


Yes, "nr_cpus=" is strongly suggested in kdump kernel because "nr_cpus="
limits the possible cpu numbers, while "maxcpuss=" only limits the cpu
number which can be brought up during bootup. We noticed this diference
because a large number of possible cpus will cost more memory in kdump
kernel. e.g percpu initialization, even though kdump kernel have set
"maxcpus=1".

Currently x86 and arm64 all support "nr_cpus=". Pingfan ever spent much
effort to make patches to add "nr_cpus=" support to ppc64, seems ppc64
dev and maintainers do not care about it. Finally the patches are not
accepted, and the work is not continued.

Now, I am wondering what is the barrier to add "nr_cpus=" to power ach.
Can we reconsider adding 'nr_cpus=' to power arch since real issue
occurred in kdump kernel?


If 'nr_cpus=' can be supported on ppc64, this patchset isn't needed.



As for this patchset, it can be accpeted so that no failure in kdump
kernel is seen on ARCHes w/o "nr_cpus=" support? My personal opinion.


IMO 'nr_cpus=' support should be preferred, given it is annoying to
maintain two kinds of implementation for kdump kernel from driver
viewpoint. I guess kdump things can be simplified too with supporting
'nr_cpus=' only.


Yes, 'nr_cpus=' is ideal. Not sure if there's some underlying concerns so
that power people decided to not support it.


Though "nr_cpus=1" is an ideal solution, maintainer was not happy with
the patch as the code changes have impact for regular boot path and
it is likely to cause breakages. So, even if "nr_cpus=1" support for
ppc64 is revived, the change is going to take time to be accepted
upstream.

Also, I see is_kdump_kernel() being used irrespective of "nr_cpus=1"
support for other optimizations in the driver for the special dump
capture environment kdump is.

If there is no other downside for driver code, to use is_kdump_kernel(),
other than the maintainability aspect, I think the above changes are
worth considering.

Thanks
Hari

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[ANNOUNCE] kexec-tools v2.0.27 preparation

2023-08-11 Thread Simon Horman
Hi all,

I am planning to release kexec-tools v2.0.27 in the next two weeks
to roughly coincide with the release of the v6.5 kernel.

I would like to ask interested parties to send any patches they would like
included in v2.0.27 within one week so they can be considered for inclusion
in an rc release.

For reference the patches queued up since v2.0.26 are as follows.

Thanks to everyone who has contributed to kexec-tools!

f67c4146d7b5 arm64: Hook up the ZBOOT support as vmlinuz
fc7b83bdf734 arm64: Add ZBOOT PE containing compressed image support
f41c4182b0c4 kexec/zboot: Add arch independent zboot support
1572b91da7c4 kexec: Introduce a member kernel_fd in kexec_info
714fa11590fe kexec/arm64: Simplify the code for zImage
a8de94e5f033 LoongArch: kdump: Set up kernel image segment
4203eaccfa92 kexec: __NR_kexec_file_load is set to undefined on LoongArch
63e9a012112e ppc64: Add elf-ppc64 file types/options and an arch specific flag 
to man page
806711fca9e9 x86: add devicetree support
29fe5067ed07 kexec: make -a the default
e63fefd4fc35 ppc64: add --reuse-cmdline parameter support
8fc55927f700 kexec-tools 2.0.26.git

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7 0/5] arm64: zboot support

2023-08-11 Thread Simon Horman
On Thu, Aug 03, 2023 at 10:41:47AM +0800, Pingfan Liu wrote:
> From: root 
> 
> As more complicated capsule kernel format occurs like zboot, where the
> compressed kernel is stored as a payload. The straight forward
> decompression can not meet the demand.
>   
> As the first step, on aarch64, reading in the kernel file in a probe
> method and decide how to unfold the content by the method itself.
> 
> This series consists of two parts
> [1/5], simplify the current aarch64 image probe
> [2-5/5], return the kernel fd by the image load interface, and let the
> handling of zboot image built on it. (Thanks for Dave Young, who
> contributes the original idea and the code)
>  
>  
> To ease the review, a branch is also available at 
> https://github.com/pfliu/kexec-tools.git
> branch zbootV7
>  
> To: kexec@lists.infradead.org
> Cc: Dave Young 
> Cc: ho...@verge.net.au
> Cc: a...@kernel.org
> Cc: jeremy.lin...@arm.com

Thanks everyone,

applied.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec