from:"Frank Heimes"

[Kernel-packages] [Bug 2070358] Re: [Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel OOPS crash

2024-10-04 Thread Frank Heimes

Many thanks for test and verification!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070358

Title:
  [Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel
  OOPS crash

Status in The Ubuntu-power-systems project:
  Fix Released
Status in linux package in Ubuntu:
  Invalid
Status in sosreport package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  Fix Released
Status in sosreport source package in Noble:
  Invalid

Bug description:
  SRU Justification:

  [Impact]
   * When the sosreport command is executed, a kernel OOPS happens and the 
system is crashing,
    depending on the configuration (but default) the system/LPAR is rebooting.

  [Fix]
   * e0011bca603c101f2a3c007bdb77f7006fa78fb1 e0011bca603c "nfsd: initialise 
nfsd_info.mutex early"

  [Test Case]
   * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
   * one option is only running sosreport on the system - and
   the crash is seen when the sosreport is starting to capture dump
   * second option (without sosreport) is:
   * CONFIG_NFSD=m (or y) must be set
   * mount nfsd if not already, using "$ mount -t nfsd nfsd /proc/fs/nfsd" 
command
   * The kernel oops will happen and the logs will show:
     ...
     BUG: Kernel NULL pointer dereference on read at 0x
     Faulting instruction address: 0xc16ff114
     Oops: Kernel access of bad area, sig: 11 [#1]
     ...
   * On a system with that kernel that incl. the above patch
     no oops will occur and the sosreport command will execute normally.

  [Regression Potential]
  * There is a certain risk of a regression, with any code modification,
    and here because the mutex handling in nfsd is modified.

  * But the changes are pretty traceable.

  * On top the commit is already upstream reviewed and accepted.

  * The modifications were done by the NFSD maintainer and also tested
  by IBM.

  [Other]
  * The fix/commit got upstream accepted with kernel v6.10-rc7,
    hence Oracular (with a planned kernel of >=6.10) is not affected.

  == Comment: #0 - Tasmiya Nalatwad  - 2024-05-28 
04:35:50 ==
  --- Description ---
  When sosreport command is executed the kernel OOPS crash is happening and 
lpar is rebooting. As kdump was enabled the dump is captured.

  Note : The bug looks similar Bug 206504 Which is seen on z lpars.

  --- Lpar Details ---
  1. PowerVM
  2. FW: FW1060.00 (NH1060_026)
  3. OS: Ubuntu 24.04
  4. Kernel: 6.8.0-31-generic
  5. Mem (free -mh): 47Gi
  6. cpus: 40

  --- Steps to reproduce ---
  1. run sosreport command on the lpar and the crash is seen when the sosreport 
is starting to capture dump.

  --- Traces ---
  root@ubuntulp2host:~# sosreport
  Please note the 'sosreport' command has been deprecated in favor of the new 
'sos' command, E.G. 'sos report'.
  Redirecting to 'sos report '

  sosreport (version 4.5.6)

  This command will collect system configuration and diagnostic
  information from this Ubuntu system.

  For more information on Canonical visit:

  Community Website  : https://www.ubuntu.com/
  Commercial Support : https://www.canonical.com

  The generated archive may contain data considered sensitive and its
  content should be reviewed by the originating organization before being
  passed to any third party.

  No changes will be made to system configuration.

  Press ENTER to continue, or CTRL-C to quit.

  Optionally, please enter the case id that you are generating this
  report for []:

   Setting up archive ...
   Setting up plugins ...
  [plugin:lxd] skipped command 'lxc image list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc network list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc profile list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc storage list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:networking] skipped command 'ip -s macsec show': required kmods 
missing: macsec.   Use '--allow-system-changes' to enable collection.

[Kernel-packages] [Bug 2075575] Re: kexec fails in LPAR when some cpus are disabled

2024-10-04 Thread Frank Heimes

Many thanks for test and verification!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075575

Title:
  kexec fails in LPAR when some cpus are disabled

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Triaged
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-02 
03:11:31 ==
  +++ This bug was initially created as a clone of Bug #206083 +++

  ---Problem Description---
  kexec fails in LPAR when some cpus are disabled
   
  Contact Information = sthou...@in.ibm.com 
   
  Machine Type = na 
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Summary:
  At L1 level, kexec fails if some of the cpus in the machine are disabled.

  
  Distros and kernel  versions used:
  1. Distro versions used

a. L1 LPAR :

b. L2 :

  
  Repro steps:
  1. Boot into an L1 lpar
  2. Disable some cpus (eg: ppc64_cpu --cores-on=3)
  3. Try to kexec. 

  
  This bug is reproducible only when we load the target kernel/initrd and use 
"kexec -e" as follows:

  kexec -l --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  kexec -e

  
  kexec works fine if we do a normal kexec without skipping the shutdown path

  kexec --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  
  Fix is upstream now:
  
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=21a741eb75f80397e5f7d3739e24d7d75e619011

  Thanks,
  Sourabh Jain

  please include in Ubuntu

   
  Oops output:
   no
   
  Stack trace output:
   no
   
  System Dump Info:
The system is not configured to capture a system dump.
   
  *Additional Instructions for sthou...@in.ibm.com: 
  -Attach sysctl -a output output to the bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2075575/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2072641] Re: [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

2024-10-02 Thread Frank Heimes

** Tags removed: petest-448

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072641

Title:
  [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  Description:   KVM: s390: unhandled guest LPSWEY instruction

  Symptom:   guest kernel oops on LPSWEY instruction

  Problem:   in rare cases like machine check injection with
 PSW disabled, all load PSW instructions are
 intercepted. LPSW and LPSWE are handled by KVM but
 not the new LPSWEY

  Solution:  Provide an LPSWEY handler in KVM.

  Reproduction:  hotplug a device while a CPU is disabled for machine
 checks, e.g. during early boot

  Upstream-ID:   4c6abb7f7b349f00c0f7ed5045bf67759c012892

  Preventive:yes
  Reported:  upstream
  Component: kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072641/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2074380] Re: [UBUNTU 22.04] s390/cpum_cf: make crypto counters upward compatible

2024-10-01 Thread Frank Heimes

I was able to successfully validate this on kernel 6.8.0-48:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 24.04.1 LTS
Release:24.04
Codename:   noble
$ uname -a
Linux s1lp15 6.8.0-48-generic #48-Ubuntu SMP Fri Sep 27 13:02:00 UTC 2024 s390x 
s390x s390x GNU/Linux
$ ls -l /sys/devices/cpum_cf/events/ | grep AES
-r--r--r-- 1 root root 4096 Oct  2 06:22 AES_BLOCKED_CYCLES
-r--r--r-- 1 root root 4096 Oct  2 06:22 AES_BLOCKED_FUNCTIONS
-r--r--r-- 1 root root 4096 Oct  2 06:22 AES_CYCLES
-r--r--r-- 1 root root 4096 Oct  2 06:22 AES_FUNCTIONS

(hence adjusting tags accordingly)

** Tags removed: verification-needed-noble-linux
** Tags added: verification-done-noble-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074380

Title:
  [UBUNTU 22.04] s390/cpum_cf: make crypto counters upward compatible

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * The CPU Measurement Facility (CPU MF) crypto counter set
     is not listed in the device sysfs tree - it's not exported
     in the sysfs directory /sys/devices/cpum_cf/events.

   * The attribute files for each CPU-MF counter defined
     in the crypto counter set is missing.

   * This is caused by the counter second version number of CPU MF
     hardware being incremented on new machines.

   * This causes a sanity check to fail,
     but the counters are supported by hardware.

   * The solution is to remove the upper limit in counter second
     version number check.

  [ Fix ]

   * f10933cbd2df f10933cbd2dfddf6273698a45f76db9bafd8150f
 "s390/cpum_cf: make crypto counters upward compatible across machine types"

   * The fix was upstream accepted with kernel v6.10(-rc1).

   * Upstream commit applies cleanly on noble master-next, 
 but needed to be backported to jammy master-next due to different code
 and context in kernel 5.15.

  [ Test Plan ]

   * Run the following commands on a new machine generation:
     (hence only doable by IBM)
     # ls -l /sys/devices/cpum_cf/events/ | grep AES

   * If the output is empty than this patch is required.

   * With a patched kernel the output should be like:
     # ls /sys/devices/cpum_cf/events/ | grep AES
     AES_BLOCKED_CYCLES
     AES_BLOCKED_FUNCTIONS
     AES_CYCLES
     AES_FUNCTIONS

  [ Where problems could occur ]

   * This affects s390x only - CPU MF is s390-specific,
     and only s390 specific code is modified.

   * And it furthermore is limited to the crypto counter set
     of CPU MF.

   * So any impact is likely limited to hardware crypto counters
     on s390x only.

   * In s390/kernel/perf_cpum_cf.c the else if case got changed from
     explicitly checking for 6 or 7 to >= 6 which seems to require
     attention for future 8 and more cases.

   * In s390/kernel/perf_cpum_cf_events.c the switch (ci.csvn) statement
     was changed to an if / else if with similar logic.
     Again attentioin for any potential future cases >= 8.

   * It does not look like currently used cases (1..5 and 6..7)
     are affected by the modification, just >7.

   * Test build of patched jammy and noble s390x kernels were build
     and are avaiable here:
     https://launchpad.net/~fheimes/+archive/ubuntu/lp2074380

  [ Other Info ]

   * Since the code/fix was upstream accepted with kernel v6.10(-rc1)
     it does not affect the current development release oracular.

   * This SRU can also be seen under the umbrella of new
  hardware enablement.

   * Since it requires special hw, the verification needs to be
     done by IBM.

  __

  Description:   kernel: s390/cpum_cf: make crypto counters upward
  compatible

  Symptom:   The CPU Measurement facility crypto counter set is not
     listed in the device sysfs tree.

  Problem:   The CPU Measurement facility crypto counter set is not
     exported in the sysfs directory
     /sys/devices/cpum_cf/events.
     The attribute files for each CPU-MF counter defined
     in the crypto counter set is missing. This is caused
     by the counter second version number of the CPU
     Measurement Facility hardware being incremented on
     new machines.  This causes a sanity check to fail,
     but the counters are supported by hardware.

  Solution:  Remove upper limit in counter second version number
     check.

  Reproduction:  Run command on a new machine generation:
  # ls -l /sys/devices/cpum_cf/events/ | grep AES
  #
     If the output is empty than this patch is required.

[Kernel-packages] [Bug 2070329] Re: KOP L2 guest fails to boot with 1 core - SMT8 topology

2024-09-26 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070329

Title:
  KOP L2 guest fails to boot with 1 core - SMT8 topology

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * On a P10 system with SMT-8 configured
 a level 2 guest (VM) fails to boot in case
 it only has one core assigned.

  [ Test Plan ]

   * Setup an IBM Power 10 system - that support up to SMT-8
 and with firmware 1060, that offers support for KVM -
 using Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on this system.

   * Configure a KVM guest (e.g. using virtinst or
 qemu-system-ppc64 directly) now with smt-8,
 but only one virtual CPU.

   * Try to boot this specific guest:
 qemu-system-ppc64 \
-drive file=rhel.qcow2,format=qcow2 \
-m 20G \
-smp 8,cores=1,threads=8 \
-cpu  host \
-nographic \
-machine pseries,ic-mode=xics -accel kvm

   * It will fail to boot with a kernel that does not
 have the two patches in place.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Primarily support for using DPDES (register) is required,
 since its needed for enabling usage of doorbells in L2 gusts.
 This is mainly done by adding DEFINEs, stubs and case.
 If the definitions are not correct or if the code executed by
 the new case (KVMPPC_GSID_DPDES) is done wrong,
 the guest state could be incorrect, harming the L2 guest doorbell.
 (DPDES is to provide the means for the hypervisor to save a
  [sub-]processor's Directed Privileged Doorbell exception state
  when the set of programs running on the [sub-]processor is
  swapped out or moved from one [sub-]processor to another.)

   * The missing Doorbell emulation got added by a 4 line if statement
 in powerpc/kvm/book3s_hv.c, which is relatively traceable.

   * The main issue I can think of is that kvmppc_set_dpdes is called
 with wrong arguments.

   * And kvmppc_set_dpdes will not work (at all) if the above DPDES
 support (and commit/patch) is missing.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-06-25 
01:24:11 ==
  +++ This bug was initially created as a clone of Bug #205277 +++

  ---Problem Description---
  KOP L2 guest fails to boot with 1 core - SMT8 topology

  ---Additional Hardware Info---
  na

  ---Debugger Data---
  na

  ---Steps to Reproduce---
   KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8

  command line used to verify the issue:
  #!/bin/sh

  QEMU="/home/mgautam/qemu"
  qemu-system-ppc64 -s \
  -drive file=/root/debian-12-nocloud-ppc64el.qcow2,format=qcow2 \
  -m 20G \
  -smp 8,cores=1,sockets=1,threads=8 \
  -cpu host \
  -nographic \
  -machine pseries,ic-mode=xics -accel kvm  \
  -net nic,model=virtio \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10022-:22

  NOTE: L2 boots fine when doorbells are turned off in L1 kernel

  As per the investigation so far, the doorbell exception is not getting
  fired inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for
  L2, the guest never receives the doorbell and also it is never cleared
  from the GSB. We are discussing this behaviour with phyp team.

  The root cause of this issue is lack of DPDES support at L1. I've
  posted the fix upstream - https://lore.kernel.org/linuxppc-
  dev/20240522084949.123148-1-gau...@linux.ibm.com/T/#u

  The fix has been accepted upstream and will be backported for kernels
  >= 6.7

  https://lore.kernel.org/linuxppc-
  dev/20240605113913.83715-1-gau...@linux.ibm.com/

  ---Patches Installed---
  na

  ---System Hang---
   na

  ---uname output---
  na

  Contact Information = na

  Machine Type = na

  Userspace rpm: na

  Userspace tool common name: na

  The userspace tool has the following bit modes: na

  Userspace tool obtained from project website:  na

  *Additional Instructions for na:
  -Post a private note with access information to the machine that is currently 
in the debugger.
  -Attach ltrace

[Kernel-packages] [Bug 2076406] Re: L2 Guest migration: continuously dumping while running NFS guest migration

2024-09-26 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076406

Title:
  L2 Guest migration: continuously dumping while running NFS guest
  migration

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * While doing ISST testing it turned out that a 2nd level (KVM)
 guest (aka VM) continuously dumped when running an NFS
 guest migration.

  [ Test Plan ]

   * Setup two IBM Power 10 system (with firmware 1060, that offers
 support for KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on both on these system to allow guest migration.

   * Setup a KVM guest and place its disk on an NFS volume.

   * Now initiate a guest migration.

   * Without the two patches the initiator system will start to dump.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Although the patch set looks huge,
 the patches themselves are relatively small and less invasive
 and I would consider them mainly as fixes.

   * kvmppc_set_one_reg_hv() wrongly get() the value instead of
 set() for MMCR3.

   * And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting
 the SIAR instead of SDAR - which is quite traceable.

   * Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR
 is introduced. Here issues can happen if the initialization
 is done wrong or in the case statement.
 A fix was added to keep nested guest DEXCR in sync.
 The guest state element defined for DEXCR was already there,
 but not really considered - this is fixed now (DEXCR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 Guest state may get out of sync.

   * Another one-reg register identifier was introduced
 that is used to read and set the virtual HASHKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHKEYR.
 Again initialization and the case code are critical.
 Code was added to keep nested guest HASHKEYR in sync.
 Again the state element defined for HASHKEYR was there,
 but not considered, what is fixed now (HASHKEYR GSID)
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

   * Again another one-reg identifier was introduced
 that is used to read and set the virtual HASHPKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR.
 And again the guest state element defined for HASHPKEYR
 was there but ignored which is now fixed (HASHPKEYR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++

  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}

  ---uname output---
  NA

  Machine Type = NA

  Contact Information = NA

  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
  [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
     GPR00: c02bbc58 c003871cf450 c20ded00 
0009
     GPR04: 0009 0009 0080 
0200
     GPR08: 01ff 0001 c00740f57ee0 
44048222
     GPR12: c00d192c c00743ddc980  

     GPR16:  cd86e200 0

[Kernel-packages] [Bug 2076866] Re: Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-26 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * A KVM guest (VM) that got live migrated between two Power 10 systems
     (using nested virtualization, means KVM on top of PowerVM) will
     highly likely crash after about an hour.

   * At that point it looked like the live migration itself was already
     successful, but it wasn't, and the crash is caused due to it.

  [ Test Plan ]

   * Setting up two Power 10 systems (with firmware level FW1060 or newer,
     that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup a qemu/KVM environment that allows to live migrate a KVM
     guest from one P10 system to the other.

   * (The disk type does not seem to matter, hence NFS based disk storage
  can be used for example).

   * After about an hour the live migrated guest is likely to crash.
     Hence wait for 2 hours (which increases the likeliness) and
     a crash due to:
     "migrate_misplaced_folio+0x540/0x5d0"
     occurs.

  [ Where problems could occur ]

   * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
     folio might have already been unmapped and the move of the checks
     might have an impact on page table locks if done wrong,
     which may lead to wrong locks, blocked memory and finally crashes.

   * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
     'in-directed', which may lead to a different behaviour and side-effects.
     However, isolation is still done, just slightly different and
     instead of using numamigrate_isolate_folio, now in (the renamed)
     migrate_misplaced_folio_prepare.

   * Further upstream conversations:
     https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
     https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
     https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com

   * Fixing a confusing return code, now to just return 0, on success is
     clarifying the return code handling and usage, and was mainly done in
     preparation of further changes,
     but can have bad side effects if the return code was used in other
     code places already as is.

   * Further upstream conversations:
     https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
     https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com

   * Fixing the fact that NUMA balancing prohibits mTHP
     (multi-size Transparent Hugepage Support) seems to be unreasonable
     since its an exclusive mapping.
     Allowing this seems to bring significant performance improvements
     see commit message d2136d749d76), but introduced significant changes
     PTE mapping and modifications and even relies on further commits:
     859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
     80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
     This case cause issues on systems configured for THP,
     may confuse the ordering, which may even lead to memory corruption.
     And this may especially hit (NUMA) systems with high core numbers,
     where balancing is more often needed.

   * Further upstream conversations:
     
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
     
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of the code for NUMA mapping rebuilding and moving
     it into a new helper, seems to be straight forward, since the active code
     stays unchanged, however the new function needs to be callable, but this
     is the case since its all in mm/memory.c.

   * Further upstream conversations:
     
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
     is more significant, since the logic changed from
     (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
     (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
     tables of more than one MM'.

   * Since th

[Kernel-packages] [Bug 2076147] Re: Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test

2024-09-26 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  Add 'mm: hold PTL from the first PTE while reclaiming a large folio'
  to fix L2 Guest hang during LTP Test

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
     PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.

   * It hangs with:
     "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"

   * Diagnosing the issues points this this fix/upstream-commit:
     [commit message, by Barry Song ]
     Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
     modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
     it only starts acquiring PTL from the first valid (present) PTE.
     PTE modifications can temporarily set PTEs to pte_none.
     Consequently, the initial PTEs of a large folio might be skipped
     in try_to_unmap_one().
     For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
     still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
     try_to_unmap_one().
     So folio will be still mapped, the folio fails to be reclaimed and is put
     back to LRU in this round.
     This also breaks up PTEs optimization such as CONT-PTE on this large folio
     and may lead to accident folio_split() afterwards.
     And since a part of PTEs are now swap entries, accessing those parts will
     introduce overhead - do_swap_page.
     Although the kernel can withstand all of the above issues, the situation
     still seems quite awkward and warrants making it more ideal.
     The same race also occurs with small folios, but they have only one PTE,
     thus, it won't be possible for them to be partially unmapped.
     This patch [see below] holds PTL from PTE0, allowing us to avoid reading
     PTE values that are in the process of being transformed. With stable PTE
     values, we can ensure that this large folio is either completely reclaimed
     or that all PTEs remain untouched in this round.
     A corner case is that if we hold PTL from PTE0 and most initial PTEs have
     been really unmapped before that, we may increase the duration of holding
     PTL. Thus we only apply this optimization to folios which are still 
entirely
     mapped (not in deferred_split list).

  [ Fix ]

   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
     "mm: hold PTL from the first PTE while reclaiming a large folio"

  [ Test Plan ]

   * An IBM Power 10 system (where PowerVM is mandatory)
     running Ubuntu Server 24.04 (kernel 6.8) or later
     with (nested) KVM setup (so KVM on top of PowerVM).

   * Run LTP test suite
     Tests running: SLS(io,base)

   * Without the patch the above test will hang with
     Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab

  [ Where problems could occur ]

   * This is a common code change in the memory management sub-system,
     hence great care needs to be taken, even if it was discussed upfront
     at the https://lore.kernel.org/ mailing list and the upstream commit
     provenance shows that many eyes had a look at this.

   * The modification is relatively small with just one if statement
     (across two lines) in mm/vmscan.c.

   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
     from the first page table entry (PTE) and to eliminate the influence of
     temporary and volatile PTE values.

   * If done wrong it can especially have a negative impact in case of large 
folios.
     and wrong hints might be given to try_to_unmap
     which may lead to bad page swapping.

   * In case of an issue with this patch the result can also be decreased
     performance and efficiency in the page table handling - the opposite
     of what the patch is supposed to address.

   * Fortunately several developers had their eyes on this commit,
     as the provenance of the patch and the discussion at LKML shows.

   * Further upstream conversation:
 Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cn...@gmail.com

  [ Other Info ]

   * The commit is upstream since v6.10(-rc1), hence it will be included
     in oracular with the planned target kernel of 6.11.

   * And since (nested) KVM virtualization on ppc64el was (re-)introduced
 just with noble, no older Ubuntu releases older than noble are affected.

  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as

[Kernel-packages] [Bug 2070329] Re: KOP L2 guest fails to boot with 1 core - SMT8 topology

2024-09-26 Thread Frank Heimes



-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070329

Title:
  KOP L2 guest fails to boot with 1 core - SMT8 topology

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * On a P10 system with SMT-8 configured
 a level 2 guest (VM) fails to boot in case
 it only has one core assigned.

  [ Test Plan ]

   * Setup an IBM Power 10 system - that support up to SMT-8
 and with firmware 1060, that offers support for KVM -
 using Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on this system.

   * Configure a KVM guest (e.g. using virtinst or
 qemu-system-ppc64 directly) now with smt-8,
 but only one virtual CPU.

   * Try to boot this specific guest:
 qemu-system-ppc64 \
-drive file=rhel.qcow2,format=qcow2 \
-m 20G \
-smp 8,cores=1,threads=8 \
-cpu  host \
-nographic \
-machine pseries,ic-mode=xics -accel kvm

   * It will fail to boot with a kernel that does not
 have the two patches in place.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Primarily support for using DPDES (register) is required,
 since its needed for enabling usage of doorbells in L2 gusts.
 This is mainly done by adding DEFINEs, stubs and case.
 If the definitions are not correct or if the code executed by
 the new case (KVMPPC_GSID_DPDES) is done wrong,
 the guest state could be incorrect, harming the L2 guest doorbell.
 (DPDES is to provide the means for the hypervisor to save a
  [sub-]processor's Directed Privileged Doorbell exception state
  when the set of programs running on the [sub-]processor is
  swapped out or moved from one [sub-]processor to another.)

   * The missing Doorbell emulation got added by a 4 line if statement
 in powerpc/kvm/book3s_hv.c, which is relatively traceable.

   * The main issue I can think of is that kvmppc_set_dpdes is called
 with wrong arguments.

   * And kvmppc_set_dpdes will not work (at all) if the above DPDES
 support (and commit/patch) is missing.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-06-25 
01:24:11 ==
  +++ This bug was initially created as a clone of Bug #205277 +++

  ---Problem Description---
  KOP L2 guest fails to boot with 1 core - SMT8 topology

  ---Additional Hardware Info---
  na

  ---Debugger Data---
  na

  ---Steps to Reproduce---
   KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8

  command line used to verify the issue:
  #!/bin/sh

  QEMU="/home/mgautam/qemu"
  qemu-system-ppc64 -s \
  -drive file=/root/debian-12-nocloud-ppc64el.qcow2,format=qcow2 \
  -m 20G \
  -smp 8,cores=1,sockets=1,threads=8 \
  -cpu host \
  -nographic \
  -machine pseries,ic-mode=xics -accel kvm  \
  -net nic,model=virtio \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10022-:22

  NOTE: L2 boots fine when doorbells are turned off in L1 kernel

  As per the investigation so far, the doorbell exception is not getting
  fired inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for
  L2, the guest never receives the doorbell and also it is never cleared
  from the GSB. We are discussing this behaviour with phyp team.

  The root cause of this issue is lack of DPDES support at L1. I've
  posted the fix upstream - https://lore.kernel.org/linuxppc-
  dev/20240522084949.123148-1-gau...@linux.ibm.com/T/#u

  The fix has been accepted upstream and will be backported for kernels
  >= 6.7

  https://lore.kernel.org/linuxppc-
  dev/20240605113913.83715-1-gau...@linux.ibm.com/

  ---Patches Installed---
  na

  ---System Hang---
   na

  ---uname output---
  na

  Contact Information = na

  Machine Type = na

  Userspace rpm: na

  Userspace tool common name: na

  The userspace tool has the following bit modes: na

  Userspace tool obtained from project website:  na

  *Additional Instructions for na:
  -Post a private note with access information to the machine that is currently 
in the debugger.
  -Attach ltrace and strace of userspace application.

To manage notifications about this bug go to

[Kernel-packages] [Bug 2076406] Re: L2 Guest migration: continuously dumping while running NFS guest migration

2024-09-26 Thread Frank Heimes



-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076406

Title:
  L2 Guest migration: continuously dumping while running NFS guest
  migration

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * While doing ISST testing it turned out that a 2nd level (KVM)
 guest (aka VM) continuously dumped when running an NFS
 guest migration.

  [ Test Plan ]

   * Setup two IBM Power 10 system (with firmware 1060, that offers
 support for KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on both on these system to allow guest migration.

   * Setup a KVM guest and place its disk on an NFS volume.

   * Now initiate a guest migration.

   * Without the two patches the initiator system will start to dump.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Although the patch set looks huge,
 the patches themselves are relatively small and less invasive
 and I would consider them mainly as fixes.

   * kvmppc_set_one_reg_hv() wrongly get() the value instead of
 set() for MMCR3.

   * And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting
 the SIAR instead of SDAR - which is quite traceable.

   * Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR
 is introduced. Here issues can happen if the initialization
 is done wrong or in the case statement.
 A fix was added to keep nested guest DEXCR in sync.
 The guest state element defined for DEXCR was already there,
 but not really considered - this is fixed now (DEXCR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 Guest state may get out of sync.

   * Another one-reg register identifier was introduced
 that is used to read and set the virtual HASHKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHKEYR.
 Again initialization and the case code are critical.
 Code was added to keep nested guest HASHKEYR in sync.
 Again the state element defined for HASHKEYR was there,
 but not considered, what is fixed now (HASHKEYR GSID)
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

   * Again another one-reg identifier was introduced
 that is used to read and set the virtual HASHPKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR.
 And again the guest state element defined for HASHPKEYR
 was there but ignored which is now fixed (HASHPKEYR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++

  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}

  ---uname output---
  NA

  Machine Type = NA

  Contact Information = NA

  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
  [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
     GPR00: c02bbc58 c003871cf450 c20ded00 
0009
     GPR04: 0009 0009 0080 
0200
     GPR08: 01ff 0001 c00740f57ee0 
44048222
     GPR12: c00d192c c00743ddc980  

     GPR16:  cd86e200 0001 
0001
     GPR20: 000c c3d0

[Kernel-packages] [Bug 2076866] Re: Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-26 Thread Frank Heimes



-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * A KVM guest (VM) that got live migrated between two Power 10 systems
     (using nested virtualization, means KVM on top of PowerVM) will
     highly likely crash after about an hour.

   * At that point it looked like the live migration itself was already
     successful, but it wasn't, and the crash is caused due to it.

  [ Test Plan ]

   * Setting up two Power 10 systems (with firmware level FW1060 or newer,
     that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup a qemu/KVM environment that allows to live migrate a KVM
     guest from one P10 system to the other.

   * (The disk type does not seem to matter, hence NFS based disk storage
  can be used for example).

   * After about an hour the live migrated guest is likely to crash.
     Hence wait for 2 hours (which increases the likeliness) and
     a crash due to:
     "migrate_misplaced_folio+0x540/0x5d0"
     occurs.

  [ Where problems could occur ]

   * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
     folio might have already been unmapped and the move of the checks
     might have an impact on page table locks if done wrong,
     which may lead to wrong locks, blocked memory and finally crashes.

   * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
     'in-directed', which may lead to a different behaviour and side-effects.
     However, isolation is still done, just slightly different and
     instead of using numamigrate_isolate_folio, now in (the renamed)
     migrate_misplaced_folio_prepare.

   * Further upstream conversations:
     https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
     https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
     https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com

   * Fixing a confusing return code, now to just return 0, on success is
     clarifying the return code handling and usage, and was mainly done in
     preparation of further changes,
     but can have bad side effects if the return code was used in other
     code places already as is.

   * Further upstream conversations:
     https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
     https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com

   * Fixing the fact that NUMA balancing prohibits mTHP
     (multi-size Transparent Hugepage Support) seems to be unreasonable
     since its an exclusive mapping.
     Allowing this seems to bring significant performance improvements
     see commit message d2136d749d76), but introduced significant changes
     PTE mapping and modifications and even relies on further commits:
     859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
     80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
     This case cause issues on systems configured for THP,
     may confuse the ordering, which may even lead to memory corruption.
     And this may especially hit (NUMA) systems with high core numbers,
     where balancing is more often needed.

   * Further upstream conversations:
     
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
     
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of the code for NUMA mapping rebuilding and moving
     it into a new helper, seems to be straight forward, since the active code
     stays unchanged, however the new function needs to be callable, but this
     is the case since its all in mm/memory.c.

   * Further upstream conversations:
     
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
     is more significant, since the logic changed from
     (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
     (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
     tables of more than one MM'.

   * Since this is an estimation, the results may be unpredictable
     (especially for bigger f

[Kernel-packages] [Bug 2075575] Re: kexec fails in LPAR when some cpus are disabled

2024-09-26 Thread Frank Heimes

Patch/commit has landed in noble master-next, hence updating noble to
Fix Committed.

** Changed in: ubuntu-power-systems
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Noble)
   Status: Triaged => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075575

Title:
  kexec fails in LPAR when some cpus are disabled

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Triaged
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-02 
03:11:31 ==
  +++ This bug was initially created as a clone of Bug #206083 +++

  ---Problem Description---
  kexec fails in LPAR when some cpus are disabled
   
  Contact Information = sthou...@in.ibm.com 
   
  Machine Type = na 
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Summary:
  At L1 level, kexec fails if some of the cpus in the machine are disabled.

  
  Distros and kernel  versions used:
  1. Distro versions used

a. L1 LPAR :

b. L2 :

  
  Repro steps:
  1. Boot into an L1 lpar
  2. Disable some cpus (eg: ppc64_cpu --cores-on=3)
  3. Try to kexec. 

  
  This bug is reproducible only when we load the target kernel/initrd and use 
"kexec -e" as follows:

  kexec -l --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  kexec -e

  
  kexec works fine if we do a normal kexec without skipping the shutdown path

  kexec --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  
  Fix is upstream now:
  
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=21a741eb75f80397e5f7d3739e24d7d75e619011

  Thanks,
  Sourabh Jain

  please include in Ubuntu

   
  Oops output:
   no
   
  Stack trace output:
   no
   
  System Dump Info:
The system is not configured to capture a system dump.
   
  *Additional Instructions for sthou...@in.ibm.com: 
  -Attach sysctl -a output output to the bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2075575/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1564475] Re: 128M is not enough for kdump on s390 LPARs

2024-09-24 Thread Frank Heimes

** Tags added: petest-459

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1564475

Title:
  128M is not enough for kdump on s390 LPARs

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in makedumpfile package in Ubuntu:
  Invalid
Status in s390-tools package in Ubuntu:
  Fix Released
Status in zipl-installer package in Ubuntu:
  Fix Released
Status in makedumpfile source package in Xenial:
  Invalid
Status in s390-tools source package in Xenial:
  Fix Released
Status in zipl-installer source package in Xenial:
  Fix Released

Bug description:
  == Comment: #0 - Michael Holzheu  - 2016-03-31 
10:59:26 ==
  With the current Ubuntu default setting "crashkernel=128M" kdump on LPARs 
crashes with out-of-memory (see attachment "dmesg_lpar_out_of_mem_128M.txt").

  On z/VM guests 128M seems to be sufficient.

  One reason on our test LPAR is that a lot of devices are attached (see
  attachment "lscss_lpar.txt") which are not required for kdump but
  consume a lot of memory because the s390 CIO layer allocates data
  structures in the kernel for those devices.

  We can disable the devices by using the "cio_ignore=" kernel parameter
  in "/etc/default/kdump-tools". For example, on our LPAR that uses DASD
  0.0.e934 for /var/crash, we added the following line to disable the
  devices:

  KDUMP_CMDLINE_APPEND="irqpoll maxcpus=1
  cio_ignore=all,!condev,!0.0.e934"

  For more information on the "cio_ignore=" kernel parameter see:
  https://github.com/torvalds/linux/blob/master/Documentation/s390/CommonIO

  Even with "cio_ignore=" we still get out-of-memory with
  "crashkernel=128M".

  With "crashkernel=196M" and "cio_ignore=" we are able to create a dump
  on our LPAR. We currently do not know why kdump with "cio_ignore=" on
  LPAR consumes more memory than on z/VM guests.

  == Comment: #1 - Michael Holzheu  - 2016-03-31 
11:03:15 ==
  Kernel messages of kdump out-of-memory crash on LPAR with many devices 
without cio_ignore parameter and 128M crashkernel memory.

  == Comment: #2 - Michael Holzheu  - 2016-03-31 
11:04:10 ==
  Output of lscss showing all attached (not online) devices on the LPAR.

  == Comment: #3 - Michael Holzheu  - 2016-03-31 
11:07:35 ==
  To solve this issue our recommendation is:

  1) Increase "crashkernel=" default to 196M on Ubuntu for s390.

  2) Document that KDUMP_CMDLINE_APPEND with "cio_ignore=" can be used
  to decrease memory consumption for kdump on systems with many devices
  that are not required for kdump.

  The most user friendly solution would be to automatically determine
  the required kdump devices and set the correct "cio_ignore=" kernel
  parameter. But this is not trivial, because it can be difficult to
  find out the required devices for stacked setups like LVM or for
  network dump.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1564475/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1959940] Re: [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer keys - kernel part

2024-09-23 Thread Frank Heimes

** Tags added: petest-436

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1959940

Title:
  [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer
  keys - kernel part

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * Hypervisor-initiated dumps for Secure Execution
 (aka confidential computing) guests are not helpful,
 because memory and CPU state is encrypted by a
 transient key only available to the Ultravisor (uv).

   * Workload owners can still configure kdump in order to obtain kernel
 crash information, but there are situation where kdump doesn't work.

   * In such situations problem determination is severely impeded.

   * This patch set solves this by implementing dumps created in a way
 that can only be decrypted by the owner of the guest image
 and be used for problem determination.

  [ Test Plan ]

   * The setup of a Secure Execution environment is not trivial
 and requires a certain set of hardware (IBM z15 or higher)
 with FC 115).

   * On top of the modification of qemu that are handled in this
 LP bug, modifications of the Kernel (LP#1959940) and
 the s390-tools (LP#1959965) are required on top.

   * So at least a modified kernel and qemu test builds are needed
 or both should be in -proposed at the same time (which might
 be difficult).
 A modified s390-tools is not urgently needed, since for the
 verification of the kernel and qemu part a newer version
 can be used (but a modified s390-tools is also available in PPA).

   * A detailed description (using Ubuntu as example) on how to setup
 secure execution is available here:
 Introducing IBM Secure Execution for Linux, April 2024 update
 https://www.ibm.com/docs/en/linuxonibm/pdf/lx24se04.pdf

   * And information on 'Working with dumps of KVM guests in
 IBM Secure Execution mode' is available here:
 
https://www.ibm.com/docs/en/linux-on-systems?topic=commands-zgetdump#czgetdump__se_dump_examples

  [ Where problems could occur ]

   * Ultravisor (uv) return codes are introduced, which is
 generally appreciated. Just the right return codes need to be set
 (and reacted upon).

   * Protected virtual machine dumps are newly introduced on top of
 dump of 'normal' KVM VMs.
 Since code is shared, it could have an unforeseen impact.

   * The doc renaming could lead to confusion,
 if people rely on old doc structure.

   * The new capability case (217) could cause issues,
 for example is case of issues during initialization..
   
   * CPU dump functionality was added (mainly as new s390x specific code
 under s390/kvm), but CPU dump is only one part,
 if not working correctly, it may lead to partially useless dump data.

   * Configuration dump functionality was also added
 (again mainly as new s390x specific code under s390/kvm),
 similar to CPU dump.
 And moving from dumping inside of a VM to dumping from outside
 (due to potential failures if done inside), might lead to a more
 complex flow (now involving the uv), hence could be more error prone.

   * Adding query dump information, requires user space buffers.
 Here it's crucial that buffer size is big enough.

   * The newly added constants and structure definitions that are
 needed for dump support could become problematic in case wrong
 data types were used (applies to all header modifications).

   * IOCTL for PV information retrieval got introduced
 (kvm_s390_handle_pv_info, kvm_s390_handle_pv).
 There are potential side effect (see man ioctl),
 hence all potential failure cases should be covered.

   * New dump feature requires to know how much memory is needed, but if
 this call for this is incorrect, it could break the dump process.

   * uv_cb_header struct changed to offset representation,
 but using wrong offsets will lead to a wrong struct,
 dump issues and potential crashes.

  [ Other Info ]

   * Since 22.04 is a popular LTS release, it is already in use by many
 secure execution customers.
 But in case of severe crashes or issues in the secure execution
 (KVM) guests dumps cannot be used as of today.

   * This enables customers, IBM and Canonical to get support in case of
 crashes/dumps on hardware that runs secure execution environments.

  __

  KVM: Secure Execution guest dump encryption with customer keys -
  kernel part

  Description:
  Hypervisor-initiated dumps for Secure Execution guests are not helpful 
because memory and CPU state is encrypted by a transient key only available to 
the Ultravisor.  Workload owners can still configure kdump in order to obtain 
kernel crash infomation, bu

[Kernel-packages] [Bug 2003374] Re: Undefined Behavior Sanitizer (UBSAN) causes failure to match symbols

2024-09-23 Thread Frank Heimes

** Tags added: petest-450

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2003374

Title:
  Undefined Behavior Sanitizer (UBSAN) causes failure to match symbols

Status in dh-kpatches:
  Unknown
Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in kpatch package in Ubuntu:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in kpatch source package in Jammy:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in kpatch source package in Kinetic:
  Won't Fix

Bug description:
  [ Impact ]

   * When UBSAN is enabled in an s390x kernel configuration, kpatch-
  build can fail to find matching symbols in the vmlinux symbol table
  (see attached example_livepatch.patch). This was discovered in both
  Jammy 5.15 and Kinetic 5.19 kernels, where UBSAN was first enabled
  (releases up to Focal did not enable UBSAN). See attached kpatch-build
  console output (output.log) and kpatch-build log (build.log).

  * Disabling UBSAN in s390x kernel configurations resolved the issue
  for both Jammy 5.15 and Kinetic 5.19. Possibly this could be fixed in
  kpatch/kpatch-build to continue to enable UBSAN while still allowing
  Livepatch functionality.

  [ Test Plan ]

   * Use kpatch-build testcases to build and load a fs/proc/meminfo.c
  Livepatch on s390x kernel (see attached example_livepatch.patch). This
  should be successful.

  [ Where problems could occur ]

   * A fix in kpatch/kpatch-build to properly handle UBSAN objects
  shouldn't yield any regressions. If UBSAN is disabled to ultimately
  get past this issue, it could lead to undefined behavior not being
  caught.

To manage notifications about this bug go to:
https://bugs.launchpad.net/dh-kpatches/+bug/2003374/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2072641] Re: [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

2024-09-23 Thread Frank Heimes

** Tags added: petest-448

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072641

Title:
  [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  Description:   KVM: s390: unhandled guest LPSWEY instruction

  Symptom:   guest kernel oops on LPSWEY instruction

  Problem:   in rare cases like machine check injection with
 PSW disabled, all load PSW instructions are
 intercepted. LPSW and LPSWE are handled by KVM but
 not the new LPSWEY

  Solution:  Provide an LPSWEY handler in KVM.

  Reproduction:  hotplug a device while a CPU is disabled for machine
 checks, e.g. during early boot

  Upstream-ID:   4c6abb7f7b349f00c0f7ed5045bf67759c012892

  Preventive:yes
  Reported:  upstream
  Component: kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072641/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2071471] Re: [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive throughput degradation for PCI-related network workloads

2024-09-23 Thread Frank Heimes

** Tags added: petest-441

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2071471

Title:
  [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive
  throughput degradation for PCI-related network workloads

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Released
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [Impact]

   * With the introduction of c76c067e488c "s390/pci: Use dma-iommu layer"
     (upstream with since kernel v6.7-rc1) there was a move (on s390x only)
     to a different dma-iommu implementation.

   * And with 92bce97f0c34 "s390/pci: Fix reset of IOMMU software counters"
     (again upstream since 6.7(rc-1) the IOMMU_DEFAULT_DMA_LAZY kernel config
     option should now be set to 'yes' by default for s390x.

   * Since CONFIG_IOMMU_DEFAULT_DMA_STRICT and IOMMU_DEFAULT_DMA_LAZY
     are related to each other CONFIG_IOMMU_DEFAULT_DMA_STRICT needs to be
     set to "no" by default, which was upstream done by b2b97a62f055
     "Revert "s390: update defconfigs"".

   * These changes are all upstream, but were not picked up by the Ubuntu
     kernel config.

   * And not having these config options set properly is causing significant
     PCI-related network throughput degradation (up to -72%).

   * This shows for almost all workloads and numbers of connections,
     deteriorating with the number of connections increasing.

   * Especially drastic is the drop for a high number of parallel connections
     (50 and 250) and for small and medium-size transactional workloads.
     However, also for streaming-type workloads the degradation is clearly
     visible (up to 48% degradation).

  [Fix]

   * The (upstream accepted) fix is to set
     IOMMU_DEFAULT_DMA_STRICT=no
     and
     IOMMU_DEFAULT_DMA_LAZY=y
     (which is needed for the changed DAM IOMMU implementation since v6.7).

  [Test Case]

   * Setup two Ubuntu Server 24.04 systems (with kernel 6.8)
     (one acting as server and as client)
     that have (PCIe attached) RoCE Express devices attached
     and that are connected to each other.

   * Verify if the the iommu_group type of the used PCI device is DMA-FQ:
 cat /sys/bus/pci/devices/\:00\:00.0/iommu_group/type
 DMA-FQ

   * Sample workload rr1c-200x1000-250 with rr1c-200x1000-250.xml:
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

   * Install uperf on both systems, client and server.

   * Start uperf at server: uperf -s

   * Start uperf at client: uperf -vai 5 -m uperf-profile.xml

   * Switch from strict to lazy mode
     either using the new kernel (or the test build below)
     or using kernel cmd-line parameter iommu.strict=0.

   * Restart uperf on server and client, like before.

   * Verification will be performed by IBM.

  [Regression Potential]

   * The is a certain regression potential, since the behavior with
     the two modified kernel config options will change significantly.

   * This may solve the (network) throughput issue with PCI devices,
     but may also come with side-effects on other PCIe based devices
     (the old compression adapters or the new NVMe carrier cards).

  [Other]

   * CCW devices are not affected.

   * This is s390x-specific only, hence will not affect any other
  architecture.

  __

  Symptom:
  Comparing Ubuntu 24.04 (kernelversion: 6.8.0-31-generic) against Ubuntu 
22.04, all of our PCI-related network measurements on LPAR show massive 
throughput degradations (up to -72%). This shows for almost all workloads and 
numbers of connections, detereorating with the number of connections 
increasing. Especially drastic is the drop for a high number of parallel 
connections (50 and 250) and for small and medium-size transactional workloads. 
However, also for streaming-type workloads the degradation is clearly visible 
(up to 48% degradation).

  Problem:
  With kernel config setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y, IOMMU DMA mode 
changed from lazy to strict, causing these massive degradations.
  Behavior can also be changed with a kernel commandline parameter 
(iommu.strict) for easy verification.

  The issue is known and was quickly fixed upstream in December 2023, after 
being present for little less than two weeks.
  Upstream fix: 
https://github.com/torvalds/linux/commit/b2b97a62f055dd638f7f02087331a8380d8f139a

  Repro:
  rr1c-200x1000-250 with rr1c-200x1000-250.xml:

[Kernel-packages] [Bug 2072760] Re: [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

2024-09-23 Thread Frank Heimes

** Tags added: petest-443

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072760

Title:
  [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  Description:

  The linux kernel already provides a mechanism to switch from horizontal to 
vertical polarization. However the current implementation does not tell the 
Linux scheduler about the "cpu capacity" of the different types of vertical 
cpus.
  For vertical high cpus cpu capacity is 100%. For vertial low cpus it should 
be close to 0% (should not be used). Difficult is to tell the cpu capacity of 
vertical medium cpus.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072760/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2072661] Re: [24.10 FEAT] [KRN1905] Kernel image in vmalloc space (V!=R)

2024-09-23 Thread Frank Heimes

** Tags added: petest-442

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072661

Title:
  [24.10 FEAT] [KRN1905] Kernel image in vmalloc space (V!=R)

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  The kernel is currently mapped in a memory area which is backed with physical 
pages. Therefore virtual addresses are always the same as real addresses (V=R).
  This item is about moving the kernel image into vmalloc space, where random 
physical pages are used to map virtual pages (V!=R).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072661/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075575] Re: kexec fails in LPAR when some cpus are disabled

2024-09-19 Thread Frank Heimes

** Changed in: linux (Ubuntu Oracular)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075575

Title:
  kexec fails in LPAR when some cpus are disabled

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Triaged
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Released

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-02 
03:11:31 ==
  +++ This bug was initially created as a clone of Bug #206083 +++

  ---Problem Description---
  kexec fails in LPAR when some cpus are disabled
   
  Contact Information = sthou...@in.ibm.com 
   
  Machine Type = na 
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Summary:
  At L1 level, kexec fails if some of the cpus in the machine are disabled.

  
  Distros and kernel  versions used:
  1. Distro versions used

a. L1 LPAR :

b. L2 :

  
  Repro steps:
  1. Boot into an L1 lpar
  2. Disable some cpus (eg: ppc64_cpu --cores-on=3)
  3. Try to kexec. 

  
  This bug is reproducible only when we load the target kernel/initrd and use 
"kexec -e" as follows:

  kexec -l --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  kexec -e

  
  kexec works fine if we do a normal kexec without skipping the shutdown path

  kexec --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  
  Fix is upstream now:
  
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=21a741eb75f80397e5f7d3739e24d7d75e619011

  Thanks,
  Sourabh Jain

  please include in Ubuntu

   
  Oops output:
   no
   
  Stack trace output:
   no
   
  System Dump Info:
The system is not configured to capture a system dump.
   
  *Additional Instructions for sthou...@in.ibm.com: 
  -Attach sysctl -a output output to the bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2075575/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2072661] Re: [24.10 FEAT] [KRN1905] Kernel image in vmalloc space (V!=R)

2024-09-16 Thread Frank Heimes

** Changed in: linux (Ubuntu)
   Status: Fix Committed => Fix Released

** Changed in: ubuntu-z-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072661

Title:
  [24.10 FEAT] [KRN1905] Kernel image in vmalloc space (V!=R)

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  The kernel is currently mapped in a memory area which is backed with physical 
pages. Therefore virtual addresses are always the same as real addresses (V=R).
  This item is about moving the kernel image into vmalloc space, where random 
physical pages are used to map virtual pages (V!=R).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072661/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2071471] Re: [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive throughput degradation for PCI-related network workloads

2024-09-16 Thread Frank Heimes

** Changed in: ubuntu-z-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2071471

Title:
  [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive
  throughput degradation for PCI-related network workloads

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Released
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [Impact]

   * With the introduction of c76c067e488c "s390/pci: Use dma-iommu layer"
     (upstream with since kernel v6.7-rc1) there was a move (on s390x only)
     to a different dma-iommu implementation.

   * And with 92bce97f0c34 "s390/pci: Fix reset of IOMMU software counters"
     (again upstream since 6.7(rc-1) the IOMMU_DEFAULT_DMA_LAZY kernel config
     option should now be set to 'yes' by default for s390x.

   * Since CONFIG_IOMMU_DEFAULT_DMA_STRICT and IOMMU_DEFAULT_DMA_LAZY
     are related to each other CONFIG_IOMMU_DEFAULT_DMA_STRICT needs to be
     set to "no" by default, which was upstream done by b2b97a62f055
     "Revert "s390: update defconfigs"".

   * These changes are all upstream, but were not picked up by the Ubuntu
     kernel config.

   * And not having these config options set properly is causing significant
     PCI-related network throughput degradation (up to -72%).

   * This shows for almost all workloads and numbers of connections,
     deteriorating with the number of connections increasing.

   * Especially drastic is the drop for a high number of parallel connections
     (50 and 250) and for small and medium-size transactional workloads.
     However, also for streaming-type workloads the degradation is clearly
     visible (up to 48% degradation).

  [Fix]

   * The (upstream accepted) fix is to set
     IOMMU_DEFAULT_DMA_STRICT=no
     and
     IOMMU_DEFAULT_DMA_LAZY=y
     (which is needed for the changed DAM IOMMU implementation since v6.7).

  [Test Case]

   * Setup two Ubuntu Server 24.04 systems (with kernel 6.8)
     (one acting as server and as client)
     that have (PCIe attached) RoCE Express devices attached
     and that are connected to each other.

   * Verify if the the iommu_group type of the used PCI device is DMA-FQ:
 cat /sys/bus/pci/devices/\:00\:00.0/iommu_group/type
 DMA-FQ

   * Sample workload rr1c-200x1000-250 with rr1c-200x1000-250.xml:
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

   * Install uperf on both systems, client and server.

   * Start uperf at server: uperf -s

   * Start uperf at client: uperf -vai 5 -m uperf-profile.xml

   * Switch from strict to lazy mode
     either using the new kernel (or the test build below)
     or using kernel cmd-line parameter iommu.strict=0.

   * Restart uperf on server and client, like before.

   * Verification will be performed by IBM.

  [Regression Potential]

   * The is a certain regression potential, since the behavior with
     the two modified kernel config options will change significantly.

   * This may solve the (network) throughput issue with PCI devices,
     but may also come with side-effects on other PCIe based devices
     (the old compression adapters or the new NVMe carrier cards).

  [Other]

   * CCW devices are not affected.

   * This is s390x-specific only, hence will not affect any other
  architecture.

  __

  Symptom:
  Comparing Ubuntu 24.04 (kernelversion: 6.8.0-31-generic) against Ubuntu 
22.04, all of our PCI-related network measurements on LPAR show massive 
throughput degradations (up to -72%). This shows for almost all workloads and 
numbers of connections, detereorating with the number of connections 
increasing. Especially drastic is the drop for a high number of parallel 
connections (50 and 250) and for small and medium-size transactional workloads. 
However, also for streaming-type workloads the degradation is clearly visible 
(up to 48% degradation).

  Problem:
  With kernel config setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y, IOMMU DMA mode 
changed from lazy to strict, causing these massive degradations.
  Behavior can also be changed with a kernel commandline parameter 
(iommu.strict) for easy verification.

  The issue is known and was quickly fixed upstream in December 2023, after 
being present for little less than two weeks.
  Upstream fix: 
https://github.com/torvalds/linux/commit/b2b97a62f055dd638f7f02087331a8380d8f139a

  Repro:
  rr1c-200x1000-250 with rr1c-200x1000-250.xml:

[Kernel-packages] [Bug 2072760] Re: [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

2024-09-16 Thread Frank Heimes

** Changed in: ubuntu-z-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072760

Title:
  [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  Description:

  The linux kernel already provides a mechanism to switch from horizontal to 
vertical polarization. However the current implementation does not tell the 
Linux scheduler about the "cpu capacity" of the different types of vertical 
cpus.
  For vertical high cpus cpu capacity is 100%. For vertial low cpus it should 
be close to 0% (should not be used). Difficult is to tell the cpu capacity of 
vertical medium cpus.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072760/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2074376] Re: Disable PCI_DYNAMIC_OF_NODES in Ubuntu

2024-09-16 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074376

Title:
  Disable PCI_DYNAMIC_OF_NODES in Ubuntu

Status in The Ubuntu-power-systems project:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Released
Status in linux source package in Oracular:
  Fix Released

Bug description:
  This came in via KTML from upstream. It is part of a discussion
  between upstream and IBM reporting a bug which occurs in KVM:

  Rob Herring  writes:

  >> On 2024/07/11 06:20 AM, Rob Herring wrote:
  >>> On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal  
wrote:
  
   With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
   of a PCI device attached to a PCI-bridge causes following kernel Oops on
   a pseries KVM guest:
  >>>
  >>> Can I ask why you have this option on in the first place? Do you have
  >>> a use for it or it's just a case of distros turn on every kconfig
  >>> option.
  >>
  >> Yes, this option is turned on in Ubuntu's distro kernel config where the 
issue
  >> was originally reported, while Fedora is keeping this turned off.
  >>
  >> root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
  >> CONFIG_PCI_DYNAMIC_OF_NODES=y
  > 
  > Ubuntu should turn off this option. For starters, it is not complete
  > to be usable. Eventually, it should get removed in favor of some TBD
  > runtime option.
  > 
  > (And we should fix the crash too)

  This option is described in the config system as:

This option enables support for generating device tree nodes for some
PCI devices. Thus, the driver of this kind can load and overlay
flattened device tree for its downstream devices.
.
Once this option is selected, the device tree nodes will be generated
for all PCI bridges.

  Open Firmware (OF) would be used for KVM for UEFI mode. The reported
  bug was related to hot-unplugging PCI devices. My guess would be that
  this probably is not of much use to the majority of users and might
  even go away. So it should really be disabled in Ubuntu, too.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2074376/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2060039] Re: [Ubuntu-24.04] FADump with recommended crash size is making the L1 hang

2024-09-13 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2060039

Title:
  [Ubuntu-24.04] FADump with recommended crash size is making the L1
  hang

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [Impact]
   * L1 host hangs when triggering FADump that results in crash

  [Fix]
   * 353d7a84c214f184d5a6b62acdec8b4424159b7c 353d7a84c214 
"powerpc/64s/radix/kfence: map __kfence_pool at page granularity"

  [Test Case]
   * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
   * Enable FADump with 1GB: fadump=on crashkernel=1024M
   * A kernel panic will happen when dump got triggered

  [Regression Potential]
  * There is a certain risk of a regression, but it is mapping only the memory
    allocated for KFENCE pool at page granularity, reducing memory consumption
    when KFENCE is used.

  * On top the commit is already upstream reviewed and accepted.

  * The modifications were done and tested by IBM.

  * The fadump feature is supported only on IBM POWER systems.

  [Other]
  * The fix/commit got upstream accepted with kernel v6.11-rc4,
    hence Oracular (with a planned kernel of 6.11) is not affected.

  ...

  Problem description :
  ==

  Triggered FADump with the recommended crash. L1 host got hung.

  As per the public document
  https://wiki.ubuntu.com/ppc64el/Recommendations recommended crash
  kernel size is 1024M for the system. But with 1024M and 2048M, the L1
  is getting hanged. with 4096, crash is generated and collected.

  root@ubuntu2404:~# uname -ar
  Linux ubuntu2404 6.8.0-11-generic #11-Ubuntu SMP Wed Feb 14 00:33:03 UTC 2024 
ppc64le ppc64le ppc64le GNU/Linux

  root@ubuntu2404:~# free -h
     totalusedfree  shared  buff/cache   
available
  Mem:48Gi   1.7Gi46Gi13Mi   687Mi
46Gi
  Swap:  8.0Gi  0B   8.0Gi

  root@ubuntu2404:~# cat /proc/cmdline
  BOOT_IMAGE=/vmlinux-6.8.0-11-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro fadump=on crashkernel=1024M

  root@ubuntu2404:~# dmesg | grep -i reser
  [0.00] fadump: Reserved 1024MB of memory at 0x004000 (System 
RAM: 51200MB)
  [0.00] fadump: Initialized 0x4000 bytes cma area at 1024MB from 
0x4007 bytes of memory reserved for firmware-assisted dump
  [0.00] Memory: 49316672K/52428800K available (23616K kernel code, 
4096K rwdata, 25536K rodata, 8832K init, 2487K bss, 2063552K reserved, 1048576K 
cma-reserved)
  [0.396408] ibmvscsi 3066: Client reserve enabled

  root@ubuntu2404:~# kdump-config show
  DUMP_MODE:fadump
  USE_KDUMP:1
  KDUMP_COREDIR:/var/crash
     /var/lib/kdump/vmlinuz
  kdump initrd:
     /var/lib/kdump/initrd.img
  current state:ready to fadump

  IBM is looking to update the crash kernel reservations section of the
  wiki for Power.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2060039/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2080474] Re: ubuntu installation failing for systems having SAN disk

2024-09-12 Thread Frank Heimes

** Also affects: subiquity
   Importance: Undecided
   Status: New

** Also affects: ubuntu-power-systems
   Importance: Undecided
   Status: New

** No longer affects: linux (Ubuntu)

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Changed in: ubuntu-power-systems
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2080474

Title:
  ubuntu installation failing for systems having SAN disk

Status in subiquity:
  New
Status in The Ubuntu-power-systems project:
  New

Bug description:
  == Comment: - Anushree Mathur ==
  OS: 24.04 LTS (Noble Numbat)
  I started the Ubuntu installation for 24.04 LTS (Noble Numbat) on L1(HOST) 
having SAN disk, it failed with the following error just after I chose the disk.

  Ubuntu 24.04 LTS ubuntu-server hvc0

  
  connecting...  
  waiting for cloud-init...  
  generating crash report
  report saved to /var/crash/1724388235.797082424.ui.crash
  Traceback (most recent call last):
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/client/controllers/filesystem.py",
 line 273, in _guided_choice
  self.ui.set_body(FilesystemView(self.model, self))
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/ui/views/filesystem/filesystem.py",
 line 485, in __init__
  self.refresh_model_inputs()
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/ui/views/filesystem/filesystem.py",
 line 540, in refresh_model_inputs
  self.avail_list.refresh_model_inputs()
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/ui/views/filesystem/filesystem.py",
 line 417, in refresh_model_inputs
  for obj, cells in summarize_device(device, filter):
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/ui/views/filesystem/helpers.py",
 line 32, in summarize_device
  anns = labels.annotations(device) + labels.usage_labels(device)
File "/snap/subiquity/5745/usr/lib/python3.10/functools.py", line 889, in 
wrapper
  return dispatch(args[0].__class__)(*args, **kw)
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/common/filesystem/labels.py",
 line 100, in _annotations_vg
  member = next(iter(vg.devices))
  StopIteration

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
File "/snap/subiquity/5745/usr/bin/subiquity", line 8, in 
  sys.exit(main())
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/cmd/tui.py", line 
158, in main
  asyncio.run(run_with_loop())
File "/snap/subiquity/5745/usr/lib/python3.10/asyncio/runners.py", line 44, 
in run
  return loop.run_until_complete(main)
File "/snap/subiquity/5745/usr/lib/python3.10/asyncio/base_events.py", line 
649, in run_until_complete
  return future.result()
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/cmd/tui.py", line 
156, in run_with_loop
  await subiquity_interface.run()
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquity/client/client.py", 
line 403, in run
  await super().run()
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquitycore/tui.py", line 
351, in run
  await super().run()
File 
"/snap/subiquity/5745/lib/python3.10/site-packages/subiquitycore/core.py", line 
134, in run
  raise exc
  RuntimeError: coroutine raised StopIteration

  Ubuntu 24.04 LTS ubuntu-server hvc0

  
  connecting...  
  ProblemType: Bug
  Architecture: ppc64el
  CrashDB: {'impl': 'launchpad', 'project': 'subiquity'}

  I tried following 2 installation methods, it is failing in both the ways: 
  1) kexec method
  2) attaching vdvd and starting installer

  NOTE: It is happening only when the system has SAN disks otherwise this 
installation worked fine.
  I will be attaching the crash report for this!

  == Comment:- Hariharan T S ==
  Verified the following cases. 
  Installation on Disk from VIOS  - PASSED
  Installatoin on Disk from VIOS and system had Disks from SAN - FAILED
  Installation on Disk from SAN and system ahd Disk from VIOS - FAILED

  == Comment:- Vaibhav Jain ==
  Problem seems to happening when Subuquity enters the disk partition view. The 
system has a SAN disk and an existing Mullti Path DM volume on it.

  Mirroring to distro

To manage notifications about this bug go to:
https://bugs.launchpad.net/subiquity/+bug/2080474/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075721] Re: [Ubuntu24.04] virsh detach-interface is crashing the guest

2024-09-10 Thread Frank Heimes

*** This bug is a duplicate of bug 2074376 ***
https://bugs.launchpad.net/bugs/2074376

** Changed in: linux (Ubuntu)
   Status: Fix Committed => Fix Released

** Changed in: ubuntu-power-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075721

Title:
  [Ubuntu24.04] virsh detach-interface is crashing the guest

Status in The Ubuntu-power-systems project:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  == Comment: #0 - Kowshik Jois B S  - 2024-05-28 
01:07:02 ==
  ---Problem Description---
  While trying virsh attach-interface and virsh detach-interface, It is 
observed that, attaching an interface is successful. But trying to detach the 
same results in the guest crash with the below trace messages on the console.

  
  root@ubuntulp3guest1:~# [ 5363.726428] Kernel attempted to read user page 
(10ec0058) - exploit attempt? (uid: 0)
  [ 5363.726570] BUG: Unable to handle kernel data access on read at 
0x10ec0058
  [ 5363.726662] Faulting instruction address: 0xc12d4828
  [ 5363.726739] Oops: Kernel access of bad area, sig: 11 [#1]
  [ 5363.726800] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  [ 5363.726880] Modules linked in: 8139too 8139cp mii qrtr cfg80211 
binfmt_misc uio_pdrv_genirq vmx_crypto uio dm_multipath nfnetlink ip_tables 
x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
poly1305_p10_crypto chacha_p10_crypto libchacha crct10dif_vpmsum crc32c_vpmsum 
xhci_pci xhci_pci_renesas aes_gcm_p10_crypto
  [ 5363.727302] CPU: 0 PID: 1614 Comm: drmgr Not tainted 6.8.0-31-generic 
#31-Ubuntu
  [ 5363.727426] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [ 5363.727563] NIP:  c12d4828 LR: c12d68f0 CTR: 

  [ 5363.727653] REGS: c000149cb440 TRAP: 0300   Not tainted  
(6.8.0-31-generic)
  [ 5363.727742] MSR:  8280b033   CR: 
44088282  XER: 2004
  [ 5363.727855] CFAR: c12d68ec DAR: 10ec0058 DSISR: 4000 
IRQMASK: 0 
  [ 5363.727855] GPR00: c12d68f0 c000149cb6e0 c2254800 
10ec0048 
  [ 5363.727855] GPR04: c000149cb748   
 
  [ 5363.727855] GPR08:    
 
  [ 5363.727855] GPR12:  c3e8  
 
  [ 5363.727855] GPR16:    
 
  [ 5363.727855] GPR20:    
 
  [ 5363.727855] GPR24:   c48585a0 
c000149cb7d4 
  [ 5363.727855] GPR28: 0001 c00014de9400 10ec0048 
 
  [ 5363.728644] NIP [c12d4828] __of_changeset_entry_invert+0x10/0x1ac
  [ 5363.728732] LR [c12d68f0] __of_changeset_revert_entries+0x98/0x180
  [ 5363.728813] Call Trace:
  [ 5363.728845] [c000149cb7b0] [c12d6b60] 
of_changeset_revert+0x58/0xd8
  [ 5363.728937] [c000149cb800] [c0d0d498] 
of_pci_remove_node+0x74/0xb0
  [ 5363.729029] [c000149cb830] [c0cdbde0] 
pci_stop_bus_device+0xf4/0x138
  [ 5363.729126] [c000149cb870] [c0cdbf40] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
  [ 5363.729232] [c000149cb8a0] [c0cf2950] remove_store+0xf0/0x108
  [ 5363.729311] [c000149cb8f0] [c0e88384] dev_attr_store+0x34/0x78
  [ 5363.729389] [c000149cb910] [c07f8234] sysfs_kf_write+0x70/0xa4
  [ 5363.729467] [c000149cb930] [c07f66a8] 
kernfs_fop_write_iter+0x1d0/0x2e0
  [ 5363.729558] [c000149cb980] [c06c8fc8] vfs_write+0x27c/0x558
  [ 5363.729639] [c000149cba30] [c06c9628] ksys_write+0x90/0x170
  [ 5363.729716] [c000149cba80] [c0033248] 
system_call_exception+0xf8/0x290
  [ 5363.729811] [c000149cbe50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [ 5363.729903] --- interrupt: 3000 at 0x74191e15c720
  [ 5363.729964] NIP:  74191e15c720 LR: 74191e15c720 CTR: 

  [ 5363.730053] REGS: c000149cbe80 TRAP: 3000   Not tainted  
(6.8.0-31-generic)
  [ 5363.730143] MSR:  8280f033   
CR: 48088202  XER: 
  [ 5363.730257] IRQMASK: 0 
  [ 5363.730257] GPR00: 0004 7bdfb730 74191e296d00 
000b 
  [ 5363.730257] GPR04: 0be4ed58d640 0001  
0031 
  [ 5363.730257] GPR08:    
 
  [ 5363.730257] GPR12:  74191e3eb300  
 
  [ 5363.730

[Kernel-packages] [Bug 2070358] Re: [Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel OOPS crash

2024-09-10 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070358

Title:
  [Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel
  OOPS crash

Status in The Ubuntu-power-systems project:
  Fix Released
Status in linux package in Ubuntu:
  Invalid
Status in sosreport package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  Fix Released
Status in sosreport source package in Noble:
  Invalid

Bug description:
  SRU Justification:

  [Impact]
   * When the sosreport command is executed, a kernel OOPS happens and the 
system is crashing,
    depending on the configuration (but default) the system/LPAR is rebooting.

  [Fix]
   * e0011bca603c101f2a3c007bdb77f7006fa78fb1 e0011bca603c "nfsd: initialise 
nfsd_info.mutex early"

  [Test Case]
   * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
   * one option is only running sosreport on the system - and
   the crash is seen when the sosreport is starting to capture dump
   * second option (without sosreport) is:
   * CONFIG_NFSD=m (or y) must be set
   * mount nfsd if not already, using "$ mount -t nfsd nfsd /proc/fs/nfsd" 
command
   * The kernel oops will happen and the logs will show:
     ...
     BUG: Kernel NULL pointer dereference on read at 0x
     Faulting instruction address: 0xc16ff114
     Oops: Kernel access of bad area, sig: 11 [#1]
     ...
   * On a system with that kernel that incl. the above patch
     no oops will occur and the sosreport command will execute normally.

  [Regression Potential]
  * There is a certain risk of a regression, with any code modification,
    and here because the mutex handling in nfsd is modified.

  * But the changes are pretty traceable.

  * On top the commit is already upstream reviewed and accepted.

  * The modifications were done by the NFSD maintainer and also tested
  by IBM.

  [Other]
  * The fix/commit got upstream accepted with kernel v6.10-rc7,
    hence Oracular (with a planned kernel of >=6.10) is not affected.

  == Comment: #0 - Tasmiya Nalatwad  - 2024-05-28 
04:35:50 ==
  --- Description ---
  When sosreport command is executed the kernel OOPS crash is happening and 
lpar is rebooting. As kdump was enabled the dump is captured.

  Note : The bug looks similar Bug 206504 Which is seen on z lpars.

  --- Lpar Details ---
  1. PowerVM
  2. FW: FW1060.00 (NH1060_026)
  3. OS: Ubuntu 24.04
  4. Kernel: 6.8.0-31-generic
  5. Mem (free -mh): 47Gi
  6. cpus: 40

  --- Steps to reproduce ---
  1. run sosreport command on the lpar and the crash is seen when the sosreport 
is starting to capture dump.

  --- Traces ---
  root@ubuntulp2host:~# sosreport
  Please note the 'sosreport' command has been deprecated in favor of the new 
'sos' command, E.G. 'sos report'.
  Redirecting to 'sos report '

  sosreport (version 4.5.6)

  This command will collect system configuration and diagnostic
  information from this Ubuntu system.

  For more information on Canonical visit:

  Community Website  : https://www.ubuntu.com/
  Commercial Support : https://www.canonical.com

  The generated archive may contain data considered sensitive and its
  content should be reviewed by the originating organization before being
  passed to any third party.

  No changes will be made to system configuration.

  Press ENTER to continue, or CTRL-C to quit.

  Optionally, please enter the case id that you are generating this
  report for []:

   Setting up archive ...
   Setting up plugins ...
  [plugin:lxd] skipped command 'lxc image list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc network list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc profile list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc storage list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:networking] skipped command 'ip -s macsec show': required kmods 
missing: macsec.   Use '--a

[Kernel-packages] [Bug 2072760] Re: [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

2024-09-06 Thread Frank Heimes

patch set was applied to oracular 6.11 master-next tree
updating ticket to Fix Committed

** Changed in: linux (Ubuntu)
   Status: In Progress => Fix Committed

** Changed in: ubuntu-z-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072760

Title:
  [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed

Bug description:
  Description:

  The linux kernel already provides a mechanism to switch from horizontal to 
vertical polarization. However the current implementation does not tell the 
Linux scheduler about the "cpu capacity" of the different types of vertical 
cpus.
  For vertical high cpus cpu capacity is 100%. For vertial low cpus it should 
be close to 0% (should not be used). Difficult is to tell the cpu capacity of 
vertical medium cpus.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072760/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2072641] Re: [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

2024-09-05 Thread Frank Heimes

** Changed in: ubuntu-z-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072641

Title:
  [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  Description:   KVM: s390: unhandled guest LPSWEY instruction

  Symptom:   guest kernel oops on LPSWEY instruction

  Problem:   in rare cases like machine check injection with
 PSW disabled, all load PSW instructions are
 intercepted. LPSW and LPSWE are handled by KVM but
 not the new LPSWEY

  Solution:  Provide an LPSWEY handler in KVM.

  Reproduction:  hotplug a device while a CPU is disabled for machine
 checks, e.g. during early boot

  Upstream-ID:   4c6abb7f7b349f00c0f7ed5045bf67759c012892

  Preventive:yes
  Reported:  upstream
  Component: kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072641/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076866] Re: Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-05 Thread Frank Heimes

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-September/thread.html#153390
changing status to 'In Progress'.

** Changed in: linux (Ubuntu Noble)
   Status: Triaged => In Progress

** Changed in: ubuntu-power-systems
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Noble)
 Assignee: (unassigned) => Canonical Kernel Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * A KVM guest (VM) that got live migrated between two Power 10 systems
     (using nested virtualization, means KVM on top of PowerVM) will
     highly likely crash after about an hour.

   * At that point it looked like the live migration itself was already
     successful, but it wasn't, and the crash is caused due to it.

  [ Test Plan ]

   * Setting up two Power 10 systems (with firmware level FW1060 or newer,
     that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup a qemu/KVM environment that allows to live migrate a KVM
     guest from one P10 system to the other.

   * (The disk type does not seem to matter, hence NFS based disk storage
  can be used for example).

   * After about an hour the live migrated guest is likely to crash.
     Hence wait for 2 hours (which increases the likeliness) and
     a crash due to:
     "migrate_misplaced_folio+0x540/0x5d0"
     occurs.

  [ Where problems could occur ]

   * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
     folio might have already been unmapped and the move of the checks
     might have an impact on page table locks if done wrong,
     which may lead to wrong locks, blocked memory and finally crashes.

   * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
     'in-directed', which may lead to a different behaviour and side-effects.
     However, isolation is still done, just slightly different and
     instead of using numamigrate_isolate_folio, now in (the renamed)
     migrate_misplaced_folio_prepare.

   * Further upstream conversations:
     https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
     https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
     https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com

   * Fixing a confusing return code, now to just return 0, on success is
     clarifying the return code handling and usage, and was mainly done in
     preparation of further changes,
     but can have bad side effects if the return code was used in other
     code places already as is.

   * Further upstream conversations:
     https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
     https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com

   * Fixing the fact that NUMA balancing prohibits mTHP
     (multi-size Transparent Hugepage Support) seems to be unreasonable
     since its an exclusive mapping.
     Allowing this seems to bring significant performance improvements
     see commit message d2136d749d76), but introduced significant changes
     PTE mapping and modifications and even relies on further commits:
     859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
     80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
     This case cause issues on systems configured for THP,
     may confuse the ordering, which may even lead to memory corruption.
     And this may especially hit (NUMA) systems with high core numbers,
     where balancing is more often needed.

   * Further upstream conversations:
     
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
     
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of the code for NUMA mapping rebuilding and moving
     it into a new helper, seems to be straight forward, since the active code
     stays unchanged, however the new function needs to be callable, but this
     is the case since its all in mm/memory.c.

   * Further upstream conversations:
     
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.w...@linux.ali

[Kernel-packages] [Bug 2076866] Re: Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-05 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
   * A KVM guest (VM) that got live migrated between two Power 10 systems
     (using nested virtualization, means KVM on top of PowerVM) will
     highly likely crash after about an hour.
  
   * At that point it looked like the live migration itself was already
     successful, but it wasn't, and the crash is caused due to it.
  
  [ Test Plan ]
  
   * Setting up two Power 10 systems (with firmware level FW1060 or newer,
     that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
  
   * Setup a qemu/KVM environment that allows to live migrate a KVM
     guest from one P10 system to the other.
  
   * (The disk type does not seem to matter, hence NFS based disk storage
  can be used for example).
  
   * After about an hour the live migrated guest is likely to crash.
     Hence wait for 2 hours (which increases the likeliness) and
     a crash due to:
     "migrate_misplaced_folio+0x540/0x5d0"
     occurs.
  
  [ Where problems could occur ]
  
   * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
     folio might have already been unmapped and the move of the checks
     might have an impact on page table locks if done wrong,
     which may lead to wrong locks, blocked memory and finally crashes.
  
   * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
     'in-directed', which may lead to a different behaviour and side-effects.
     However, isolation is still done, just slightly different and
     instead of using numamigrate_isolate_folio, now in (the renamed)
     migrate_misplaced_folio_prepare.
  
   * Further upstream conversations:
     https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
     https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
     https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com
  
   * Fixing a confusing return code, now to just return 0, on success is
     clarifying the return code handling and usage, and was mainly done in
     preparation of further changes,
     but can have bad side effects if the return code was used in other
     code places already as is.
  
   * Further upstream conversations:
     https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
     https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com
  
   * Fixing the fact that NUMA balancing prohibits mTHP
     (multi-size Transparent Hugepage Support) seems to be unreasonable
     since its an exclusive mapping.
     Allowing this seems to bring significant performance improvements
     see commit message d2136d749d76), but introduced significant changes
     PTE mapping and modifications and even relies on further commits:
     859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
     80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
     This case cause issues on systems configured for THP,
     may confuse the ordering, which may even lead to memory corruption.
     And this may especially hit (NUMA) systems with high core numbers,
     where balancing is more often needed.
  
   * Further upstream conversations:
     
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
     
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com
  
   * The refactoring of the code for NUMA mapping rebuilding and moving
     it into a new helper, seems to be straight forward, since the active code
     stays unchanged, however the new function needs to be callable, but this
     is the case since its all in mm/memory.c.
  
   * Further upstream conversations:
     
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.w...@linux.alibaba.com
  
   * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
     is more significant, since the logic changed from
     (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
     (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
     tables of more than one MM'.
  
   * Since this is an estimation, the results may be unpredictable
     (especially for bigger folios), and not like expected or assumed
     (there are quite some side-notes in the code comments of bb34f78d72c2,
     that mention potential fuzzy results), hence this
     may lead to unforeseen behavior.
  
   * The condition statements became clearer since it's now based on
     (more or less obvious) number counts, but can still be erroneous in
     case folio_estimated_sharers does incorrect calculations.

[Kernel-packages] [Bug 2072641] Re: [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

2024-09-05 Thread Frank Heimes

Commit landed in Ubuntu-5.15.0-120.130 (and newer)
and we have 5.15.0.121.121 in proposed.
Hence updating status to Fix Committed.

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: Medium
   Status: Triaged

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Oracular)
   Status: Triaged => Fix Committed

** Changed in: linux (Ubuntu Jammy)
   Status: New => Fix Committed

** Changed in: ubuntu-z-systems
   Status: Triaged => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072641

Title:
  [UBUNTU 22.04] KVM: s390: unhandled guest LPSWEY instruction

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  New
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  Description:   KVM: s390: unhandled guest LPSWEY instruction

  Symptom:   guest kernel oops on LPSWEY instruction

  Problem:   in rare cases like machine check injection with
 PSW disabled, all load PSW instructions are
 intercepted. LPSW and LPSWE are handled by KVM but
 not the new LPSWEY

  Solution:  Provide an LPSWEY handler in KVM.

  Reproduction:  hotplug a device while a CPU is disabled for machine
 checks, e.g. during early boot

  Upstream-ID:   4c6abb7f7b349f00c0f7ed5045bf67759c012892

  Preventive:yes
  Reported:  upstream
  Component: kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072641/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076866] Re: Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-05 Thread Frank Heimes

** Summary changed:

- Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0
+ Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * A KVM guest (VM) that got live migrated between two Power 10 systems
     (using nested virtualization, means KVM on top of PowerVM) will
     highly likely crash after about an hour.

   * At that point it looked like the live migration itself was already
     successful, but it wasn't, and the crash is caused due to it.

  [ Test Plan ]

   * Setting up two Power 10 systems (with firmware level FW1060 or newer,
     that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup a qemu/KVM environment that allows to live migrate a KVM
     guest from one P10 system to the other.

   * (The disk type does not seem to matter, hence NFS based disk storage
  can be used for example).

   * After about an hour the live migrated guest is likely to crash.
     Hence wait for 2 hours (which increases the likeliness) and
     a crash due to:
     "migrate_misplaced_folio+0x540/0x5d0"
     occurs.

  [ Where problems could occur ]

   * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
     folio might have already been unmapped and the move of the checks
     might have an impact on page table locks if done wrong,
     which may lead to wrong locks, blocked memory and finally crashes.

   * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
     'in-directed', which may lead to a different behaviour and side-effects.
     However, isolation is still done, just slightly different and
     instead of using numamigrate_isolate_folio, now in (the renamed)
     migrate_misplaced_folio_prepare.

   * Further upstream conversations:
     https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
     https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
     https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com

   * Fixing a confusing return code, now to just return 0, on success is
     clarifying the return code handling and usage, and was mainly done in
     preparation of further changes,
     but can have bad side effects if the return code was used in other
     code places already as is.

   * Further upstream conversations:
     https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
     https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com

   * Fixing the fact that NUMA balancing prohibits mTHP
     (multi-size Transparent Hugepage Support) seems to be unreasonable
     since its an exclusive mapping.
     Allowing this seems to bring significant performance improvements
     see commit message d2136d749d76), but introduced significant changes
     PTE mapping and modifications and even relies on further commits:
     859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
     80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
     This case cause issues on systems configured for THP,
     may confuse the ordering, which may even lead to memory corruption.
     And this may especially hit (NUMA) systems with high core numbers,
     where balancing is more often needed.

   * Further upstream conversations:
     
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
     
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of the code for NUMA mapping rebuilding and moving
     it into a new helper, seems to be straight forward, since the active code
     stays unchanged, however the new function needs to be callable, but this
     is the case since its all in mm/memory.c.

   * Further upstream conversations:
     
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
     
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.w...@linux.alibaba.com

   * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
     is more significant, since the logic changed from
     (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
     (folio_likely_mapped_shared) 'estimate if the fol

[Kernel-packages] [Bug 2072661] Re: [24.10 FEAT] [KRN1905] Kernel image in vmalloc space (V!=R)

2024-09-05 Thread Frank Heimes

** Information type changed from Private to Public

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072661

Title:
  [24.10 FEAT] [KRN1905] Kernel image in vmalloc space (V!=R)

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed

Bug description:
  The kernel is currently mapped in a memory area which is backed with physical 
pages. Therefore virtual addresses are always the same as real addresses (V=R).
  This item is about moving the kernel image into vmalloc space, where random 
physical pages are used to map virtual pages (V!=R).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072661/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076147] Re: Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test

2024-09-05 Thread Frank Heimes

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-September/thread.html#153383
changing status to 'In Progress', assigning kernel team.

** Changed in: linux (Ubuntu Noble)
   Status: Triaged => In Progress

** Changed in: ubuntu-power-systems
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Noble)
 Assignee: (unassigned) => Canonical Kernel Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  Add 'mm: hold PTL from the first PTE while reclaiming a large folio'
  to fix L2 Guest hang during LTP Test

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
     PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.

   * It hangs with:
     "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"

   * Diagnosing the issues points this this fix/upstream-commit:
     [commit message, by Barry Song ]
     Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
     modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
     it only starts acquiring PTL from the first valid (present) PTE.
     PTE modifications can temporarily set PTEs to pte_none.
     Consequently, the initial PTEs of a large folio might be skipped
     in try_to_unmap_one().
     For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
     still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
     try_to_unmap_one().
     So folio will be still mapped, the folio fails to be reclaimed and is put
     back to LRU in this round.
     This also breaks up PTEs optimization such as CONT-PTE on this large folio
     and may lead to accident folio_split() afterwards.
     And since a part of PTEs are now swap entries, accessing those parts will
     introduce overhead - do_swap_page.
     Although the kernel can withstand all of the above issues, the situation
     still seems quite awkward and warrants making it more ideal.
     The same race also occurs with small folios, but they have only one PTE,
     thus, it won't be possible for them to be partially unmapped.
     This patch [see below] holds PTL from PTE0, allowing us to avoid reading
     PTE values that are in the process of being transformed. With stable PTE
     values, we can ensure that this large folio is either completely reclaimed
     or that all PTEs remain untouched in this round.
     A corner case is that if we hold PTL from PTE0 and most initial PTEs have
     been really unmapped before that, we may increase the duration of holding
     PTL. Thus we only apply this optimization to folios which are still 
entirely
     mapped (not in deferred_split list).

  [ Fix ]

   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
     "mm: hold PTL from the first PTE while reclaiming a large folio"

  [ Test Plan ]

   * An IBM Power 10 system (where PowerVM is mandatory)
     running Ubuntu Server 24.04 (kernel 6.8) or later
     with (nested) KVM setup (so KVM on top of PowerVM).

   * Run LTP test suite
     Tests running: SLS(io,base)

   * Without the patch the above test will hang with
     Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab

  [ Where problems could occur ]

   * This is a common code change in the memory management sub-system,
     hence great care needs to be taken, even if it was discussed upfront
     at the https://lore.kernel.org/ mailing list and the upstream commit
     provenance shows that many eyes had a look at this.

   * The modification is relatively small with just one if statement
     (across two lines) in mm/vmscan.c.

   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
     from the first page table entry (PTE) and to eliminate the influence of
     temporary and volatile PTE values.

   * If done wrong it can especially have a negative impact in case of large 
folios.
     and wrong hints might be given to try_to_unmap
     which may lead to bad page swapping.

   * In case of an issue with this patch the result can also be decreased
     performance and efficiency in the page table handling - the opposite
     of what the patch is supposed to address.

   * Fortunately several developers had their eyes on this commit,
     as the provenance of the patch and the discussion at LKML shows.

   * Further upstream conversation:
 Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cn...@gmail.com

  [ Other Info ]

   * The commit is upstream

[Kernel-packages] [Bug 2076147] Re: Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test

2024-09-04 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
     PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.
  
   * It hangs with:
     "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"
  
   * Diagnosing the issues points this this fix/upstream-commit:
     [commit message, by Barry Song ]
     Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
     modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
     it only starts acquiring PTL from the first valid (present) PTE.
     PTE modifications can temporarily set PTEs to pte_none.
     Consequently, the initial PTEs of a large folio might be skipped
     in try_to_unmap_one().
     For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
     still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
     try_to_unmap_one().
     So folio will be still mapped, the folio fails to be reclaimed and is put
     back to LRU in this round.
     This also breaks up PTEs optimization such as CONT-PTE on this large folio
     and may lead to accident folio_split() afterwards.
     And since a part of PTEs are now swap entries, accessing those parts will
     introduce overhead - do_swap_page.
     Although the kernel can withstand all of the above issues, the situation
     still seems quite awkward and warrants making it more ideal.
     The same race also occurs with small folios, but they have only one PTE,
     thus, it won't be possible for them to be partially unmapped.
     This patch [see below] holds PTL from PTE0, allowing us to avoid reading
     PTE values that are in the process of being transformed. With stable PTE
     values, we can ensure that this large folio is either completely reclaimed
     or that all PTEs remain untouched in this round.
     A corner case is that if we hold PTL from PTE0 and most initial PTEs have
     been really unmapped before that, we may increase the duration of holding
     PTL. Thus we only apply this optimization to folios which are still 
entirely
     mapped (not in deferred_split list).
  
  [ Fix ]
  
   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
     "mm: hold PTL from the first PTE while reclaiming a large folio"
  
  [ Test Plan ]
  
   * An IBM Power 10 system (where PowerVM is mandatory)
     running Ubuntu Server 24.04 (kernel 6.8) or later
     with (nested) KVM setup (so KVM on top of PowerVM).
  
   * Run LTP test suite
     Tests running: SLS(io,base)
  
   * Without the patch the above test will hang with
     Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab
  
  [ Where problems could occur ]
  
   * This is a common code change in the memory management sub-system,
     hence great care needs to be taken, even if it was discussed upfront
     at the https://lore.kernel.org/ mailing list and the upstream commit
     provenance shows that many eyes had a look at this.
  
   * The modification is relatively small with just one if statement
     (across two lines) in mm/vmscan.c.
  
   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
     from the first page table entry (PTE) and to eliminate the influence of
     temporary and volatile PTE values.
  
   * If done wrong it can especially have a negative impact in case of large 
folios.
     and wrong hints might be given to try_to_unmap
     which may lead to bad page swapping.
  
   * In case of an issue with this patch the result can also be decreased
     performance and efficiency in the page table handling - the opposite
     of what the patch is supposed to address.
  
   * Fortunately several developers had their eyes on this commit,
-    as the provenance of the patch and the discussion at lkml shows.
+    as the provenance of the patch and the discussion at LKML shows.
+ 
+  * Further upstream conversation:
+Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cn...@gmail.com
  
  [ Other Info ]
  
   * The commit is upstream since v6.10(-rc1), hence it will be included
-    in oracular with the planned target kernel.
+    in oracular with the planned target kernel of 6.11.
+ 
+  * And since (nested) KVM virtualization on ppc64el was (re-)introduced
+just with noble, no older Ubuntu releases older than noble are affected.
  
  __
  
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as a clone of Bug #206372 +++
  
  ---Problem Description---
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab (edit)
  
  ---uname output---
  NA
  
  ---Additional Hardware Info---
  NA
  
  Contact Information = na
  
  ---Debugger Data---
  NA
  
  ---Patches Installed---
  NA
  
  ---Steps to Reproduce---
  
  Tests running: SLS(io,base)
  LPAR Confi

[Kernel-packages] [Bug 2076866] Re: Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-04 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
-  * A KVM guest (VM) that got live migrated between two Power 10 systems
-(using nested virtualization, means KVM on top of PowerVM) will
-highly likely crash after about an hour.
- 
-  * At that point it looked like the live migration itself was already
-successful, but it wasn't, and the crash is caused due to it.
+  * A KVM guest (VM) that got live migrated between two Power 10 systems
+    (using nested virtualization, means KVM on top of PowerVM) will
+    highly likely crash after about an hour.
+ 
+  * At that point it looked like the live migration itself was already
+    successful, but it wasn't, and the crash is caused due to it.
  
  [ Test Plan ]
  
-  * Setting up two Power 10 systems (with firmware level FW1060 or newer,
-that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
- 
-  * Setup a qemu/KVM environment that allows to live migrate a KVM
-guest from one P10 system to the other.
- 
-  * (The disk type does not seem to matter, hence NFS based disk storage
- can be used for example).
- 
-  * After about an hour the live migrated guest is likely to crash.
-Hence wait for 2 hours (which increases the likeliness) and
-a crash due to:
-"migrate_misplaced_folio+0x540/0x5d0"
-occurs.
+  * Setting up two Power 10 systems (with firmware level FW1060 or newer,
+    that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
+ 
+  * Setup a qemu/KVM environment that allows to live migrate a KVM
+    guest from one P10 system to the other.
+ 
+  * (The disk type does not seem to matter, hence NFS based disk storage
+ can be used for example).
+ 
+  * After about an hour the live migrated guest is likely to crash.
+    Hence wait for 2 hours (which increases the likeliness) and
+    a crash due to:
+    "migrate_misplaced_folio+0x540/0x5d0"
+    occurs.
  
  [ Where problems could occur ]
  
-  * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
-folio might have already been unmapped and the move of the checks
-might have an impact on page table locks if done wrong,
-which may lead to wrong locks, blocked memory and finally crashes.
- 
-  * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
-'in-directed', which may lead to a different behaviour and side-effects.
-However, isolation is still done, just slightly different and
-instead of using numamigrate_isolate_folio, now in (the renamed)
-migrate_misplaced_folio_prepare.
- 
-  * Further upstream conversations:
-https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
-https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
-https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com
- 
-  * Fixing a confusing return code, now to just return 0, on success is
-clarifying the return code handling and usage, and was mainly done in
-preparation of further changes,
-but can have bad side effects if the return code was used in other
-code places already as is.
- 
-  * Further upstream conversations:
-https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
-https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com
- 
-  * Fixing the fact that NUMA balancing prohibits mTHP
-(multi-size Transparent Hugepage Support) seems to be unreasonable
-since its an exclusive mapping.
-Allowing this seems to bring significant performance improvements
-see commit message d2136d749d76), but introduced significant changes
-PTE mapping and modifications and even relies on further commits:
-859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
-80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
-This case cause issues on systems configured for THP,
-may confuse the ordering, which may even lead to memory corruption.
-And this may especially hit (NUMA) systems with high core numbers,
-where balancing is more often needed.
- 
-  * Further upstream conversations:
-
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
-
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
-
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com
- 
-  * The refactoring of the code for NUMA mapping rebuilding and moving
-it into a new helper, seems to be straight forward, since the active code
-stays unchanged, however the new function needs to be callable, but this
-is the case since its all in mm/memory.c.
- 
-  * Further upstream conversations:
-
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
-
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
-
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09

[Kernel-packages] [Bug 2076866] Re: Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-04 Thread Frank Heimes

** Description changed:

+ SRU Justification:
+ 
+ [ Impact ]
+ 
+  * A KVM guest (VM) that got live migrated between two Power 10 systems
+(using nested virtualization, means KVM on top of PowerVM) will
+highly likely crash after about an hour.
+ 
+  * At that point it looked like the live migration itself was already
+successful, but it wasn't, and the crash is caused due to it.
+ 
+ [ Test Plan ]
+ 
+  * Setting up two Power 10 systems (with firmware level FW1060 or newer,
+that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.
+ 
+  * Setup a qemu/KVM environment that allows to live migrate a KVM
+guest from one P10 system to the other.
+ 
+  * (The disk type does not seem to matter, hence NFS based disk storage
+ can be used for example).
+ 
+  * After about an hour the live migrated guest is likely to crash.
+Hence wait for 2 hours (which increases the likeliness) and
+a crash due to:
+"migrate_misplaced_folio+0x540/0x5d0"
+occurs.
+ 
+ [ Where problems could occur ]
+ 
+  * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
+folio might have already been unmapped and the move of the checks
+might have an impact on page table locks if done wrong,
+which may lead to wrong locks, blocked memory and finally crashes.
+ 
+  * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
+'in-directed', which may lead to a different behaviour and side-effects.
+However, isolation is still done, just slightly different and
+instead of using numamigrate_isolate_folio, now in (the renamed)
+migrate_misplaced_folio_prepare.
+ 
+  * Further upstream conversations:
+https://lkml.kernel.org/r/8f85c31a-e603-4578-bf49-136dae0d4...@redhat.com
+https://lkml.kernel.org/r/20240626191129.658cfc32...@smtp.kernel.org
+https://lkml.kernel.org/r/20240620212935.656243-3-da...@redhat.com
+ 
+  * Fixing a confusing return code, now to just return 0, on success is
+clarifying the return code handling and usage, and was mainly done in
+preparation of further changes,
+but can have bad side effects if the return code was used in other
+code places already as is.
+ 
+  * Further upstream conversations:
+https://lkml.kernel.org/r/20240620212935.656243-1-da...@redhat.com
+https://lkml.kernel.org/r/20240620212935.656243-2-da...@redhat.com
+ 
+  * Fixing the fact that NUMA balancing prohibits mTHP
+(multi-size Transparent Hugepage Support) seems to be unreasonable
+since its an exclusive mapping.
+Allowing this seems to bring significant performance improvements
+see commit message d2136d749d76), but introduced significant changes
+PTE mapping and modifications and even relies on further commits:
+859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
+80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other 
uses")
+This case cause issues on systems configured for THP,
+may confuse the ordering, which may even lead to memory corruption.
+And this may especially hit (NUMA) systems with high core numbers,
+where balancing is more often needed.
+ 
+  * Further upstream conversations:
+
https://lore.kernel.org/all/20231117100745.fnpijbk4xgmal...@techsingularity.net/
+
https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.w...@linux.alibaba.com
+
https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.w...@linux.alibaba.com
+ 
+  * The refactoring of the code for NUMA mapping rebuilding and moving
+it into a new helper, seems to be straight forward, since the active code
+stays unchanged, however the new function needs to be callable, but this
+is the case since its all in mm/memory.c.
+ 
+  * Further upstream conversations:
+
https://lkml.kernel.org/r/cover.1712132950.git.baolin.w...@linux.alibaba.com
+
https://lkml.kernel.org/r/cover.1711683069.git.baolin.w...@linux.alibaba.com
+
https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.w...@linux.alibaba.com
+ 
+  * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
+is more significant, since the logic changed from
+(folio_estimated_sharers) 'estimate the number of sharers of a folio' to
+(folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
+tables of more than one MM'.
+ 
+  * Since this is an estimation, the results may be unpredictable
+(especially for bigger folios), and not like expected or assumed
+(there are quite some side-notes in the code comments of bb34f78d72c2,
+that mention potential fuzzy results), hence this
+may lead to unforeseen behavior.
+ 
+  * The condition statements became clearer since it's now based on
+(more or less obvious) number counts, but can still be erroneous in
+case folio_estimated_sharers does incorrect calculations.
+ 
+

[Kernel-packages] [Bug 2072760] Re: [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

2024-09-04 Thread Frank Heimes

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-September/thread.html#153380
changing status to 'In Progress', assigning kernel team.

** Changed in: ubuntu-z-systems
   Status: Incomplete => In Progress

** Changed in: linux (Ubuntu)
   Status: Incomplete => In Progress

** Information type changed from Private to Public

** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Canonical Kernel Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2072760

Title:
  [24.10 FEAT] [KRN1911] Vertical CPU Polarization Support Stage 2

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  In Progress

Bug description:
  Description:

  The linux kernel already provides a mechanism to switch from horizontal to 
vertical polarization. However the current implementation does not tell the 
Linux scheduler about the "cpu capacity" of the different types of vertical 
cpus.
  For vertical high cpus cpu capacity is 100%. For vertial low cpus it should 
be close to 0% (should not be used). Difficult is to tell the cpu capacity of 
vertical medium cpus.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2072760/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2077722] Re: [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck at booting after triggering crash

2024-09-03 Thread Frank Heimes

Looks like it is not yet clear if xive is the problem.

So isn't it too early to revert the patches? Is it really safe to do so?
I see they got introduced with kernel 6.8, but are still in the later kernels.
I don't see any upstream revert (ideally as "stable update"), which would be 
the right approach - I think.

** Changed in: ubuntu-power-systems
   Status: New => Incomplete

** Changed in: linux (Ubuntu)
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2077722

Title:
  [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck
  at booting after triggering crash

Status in The Ubuntu-power-systems project:
  Incomplete
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  Problem:
  While bringing up 2 Ubuntu 24.04 guests and running stress-ng (90% load) on 
both and triggering crash simultaneously, 1st guest gets stuck and does not 
boot up. In one of the attempts, both the guests got stuck on booting with 
console hang. 

  Attempts:
  Reproducible 3/3 consecutive times
  Run 1: L2-1 guest got stuck 
  Run 2: L2-1 guest got stuck
  Run 3: L2-1 and L2-2 guest got stuck

  
  =
  L1 Host:
  1. PowerVM
  2. OS: Ubuntu 24.04
  3. Kernel: 6.8.0-31-generic
  4. Mem (free -mh): 47Gi
  5. cpus: 40

  Guest L2-1:
  1. OS: Ubuntu 24.04
  2. Kernel: 6.8.0-31-generic
  3. Mem (free -mh): 9.5Gi
  4. cpus: 8
  5. Stress: stress-ng - 90% load
  6. XML configuration:
 16
 10971520
 

  Guest L2-2:
  1. OS: Ubuntu 24.04
  2. Kernel: 6.8.0-31-generic
  3. Mem (free -mh): 9.5Gi
  4. cpus: 8
  5. Stress: stress-ng - 90% load
  6. XML configuration:
 16
 10971520
 

  
  =
  Steps to reproduce:
  1. Bring up 2 Ubuntu 24.04 L2 guests with configuration mentioned as above
  2. Run the attached stress-ng.sh script on both L2 guests
  3. Trigger crash: echo c >/proc/sysrq-trigger on both L2 guests at the same 
time

  After triggering the crash, 1 or both guest consoles will get stuck.
  And then, we will not be able to enter the guest neither shut it down.
  In oder to boot into the guest, virsh destroy of the guest will be
  required.

  
  =
  Run1: Console.log Error message of L2-1
Booting `Ubuntu'

  Loading Linux 6.8.0-31-generic ...
  Loading initial ramdisk ...
  OF stdout device is: /vdevice/vty@3000
  Preparing to boot Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
  Detected machine type: 0101
  command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
  Max number of cores passed to firmware: 1024 (NR_CPUS = 2048)
  Calling ibm,client-architecture-support... done
  memory layout at init:
memory_limit :  (16 MB aligned)
alloc_bottom : 09d7
alloc_top: 3000
alloc_top_hi : 0002a000
rmo_top  : 3000
ram_top  : 0002a000
  instantiating rtas at 0x2fff... done
  prom_hold_cpus: skipped
  copying OF device tree...
  Building dt strings...
  Building dt structure...
  Device tree strings 0x09d8 -> 0x09d80bc6
  Device tree struct  0x09d9 -> 0x09da
  Quiescing Open Firmware ...
  Booting Linux via __start() @ 0x0023 ...
  [0.00] random: crng init done
  [0.00] Reserving 512MB of memory at 512MB for crashkernel (System 
RAM: 10752MB)
  [0.00] radix-mmu: Page sizes from device-tree:
  [0.00] radix-mmu: Page size shift = 12 AP=0x0
  [0.00] radix-mmu: Page size shift = 16 AP=0x5
  [0.00] radix-mmu: Page size shift = 21 AP=0x1
  [0.00] radix-mmu: Page size shift = 30 AP=0x2
  [0.00] Activating Kernel Userspace Access Prevention
  [0.00] Activating Kernel Userspace Execution Prevention
  [0.00] radix-mmu: Mapped 0x-0x038a with 
64.0 KiB pages (exec)
  [0.00] radix-mmu: Mapped 0x038a-0x0002a000 with 
64.0 KiB pages
  [0.00] lpar: Using radix MMU under hypervisor
  [0.00] Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
  [0.00] Secure boot mode disabled
  [0.00] Found initrd at 0xc620:0xc000

[Kernel-packages] [Bug 2060039] Re: [Ubuntu-24.04] FADump with recommended crash size is making the L1 hang

2024-09-03 Thread Frank Heimes

** Description changed:

  SRU Justification:
-  
+ 
  [Impact]
-  * L1 host hangs when triggering FADump with recommended crash
-   


+  * L1 host hangs when triggering FADump that results in crash
+ 
  [Fix]
-  * 353d7a84c214f184d5a6b62acdec8b4424159b7c 353d7a84c214 
"powerpc/64s/radix/kfence: map __kfence_pool at page granularity"
-  
+  * 353d7a84c214f184d5a6b62acdec8b4424159b7c 353d7a84c214 
"powerpc/64s/radix/kfence: map __kfence_pool at page granularity"
+ 
  [Test Case]
-  * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
-  * Enable FADump with 1GB: fadump=on crashkernel=1024M
-  * A kernel panic will happen when dump got triggered
-  
+  * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
+  * Enable FADump with 1GB: fadump=on crashkernel=1024M
+  * A kernel panic will happen when dump got triggered
+ 
  [Regression Potential]
  * There is a certain risk of a regression, but it is mapping only the memory
-   allocated for KFENCE pool at page granularity, reducing memory consumption
-   when KFENCE is used.
-  
+   allocated for KFENCE pool at page granularity, reducing memory consumption
+   when KFENCE is used.
+ 
  * On top the commit is already upstream reviewed and accepted.
-  
+ 
  * The modifications were done and tested by IBM.
-  
+ 
  * The fadump feature is supported only on IBM POWER systems.
-  
+ 
  [Other]
  * The fix/commit got upstream accepted with kernel v6.11-rc4,
-   hence Oracular (with a planned kernel of 6.11) is not affected.
+   hence Oracular (with a planned kernel of 6.11) is not affected.
  
  ...
  
  Problem description :
  ==
  
  Triggered FADump with the recommended crash. L1 host got hung.
  
  As per the public document
  https://wiki.ubuntu.com/ppc64el/Recommendations recommended crash kernel
  size is 1024M for the system. But with 1024M and 2048M, the L1 is
  getting hanged. with 4096, crash is generated and collected.
  
  root@ubuntu2404:~# uname -ar
  Linux ubuntu2404 6.8.0-11-generic #11-Ubuntu SMP Wed Feb 14 00:33:03 UTC 2024 
ppc64le ppc64le ppc64le GNU/Linux
  
  root@ubuntu2404:~# free -h
     totalusedfree  shared  buff/cache   
available
  Mem:48Gi   1.7Gi46Gi13Mi   687Mi
46Gi
  Swap:  8.0Gi  0B   8.0Gi
  
  root@ubuntu2404:~# cat /proc/cmdline
  BOOT_IMAGE=/vmlinux-6.8.0-11-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro fadump=on crashkernel=1024M
  
  root@ubuntu2404:~# dmesg | grep -i reser
  [0.00] fadump: Reserved 1024MB of memory at 0x004000 (System 
RAM: 51200MB)
  [0.00] fadump: Initialized 0x4000 bytes cma area at 1024MB from 
0x4007 bytes of memory reserved for firmware-assisted dump
  [0.00] Memory: 49316672K/52428800K available (23616K kernel code, 
4096K rwdata, 25536K rodata, 8832K init, 2487K bss, 2063552K reserved, 1048576K 
cma-reserved)
  [0.396408] ibmvscsi 3066: Client reserve enabled
  
  root@ubuntu2404:~# kdump-config show
  DUMP_MODE:fadump
  USE_KDUMP:1
  KDUMP_COREDIR:/var/crash
     /var/lib/kdump/vmlinuz
  kdump initrd:
     /var/lib/kdump/initrd.img
  current state:ready to fadump
  
  IBM is looking to update the crash kernel reservations section of the
  wiki for Power.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2060039

Title:
  [Ubuntu-24.04] FADump with recommended crash size is making the L1
  hang

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [Impact]
   * L1 host hangs when triggering FADump that results in crash

  [Fix]
   * 353d7a84c214f184d5a6b62acdec8b4424159b7c 353d7a84c214 
"powerpc/64s/radix/kfence: map __kfence_pool at page granularity"

  [Test Case]
   * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
   * Enable FADump with 1GB: fadump=on crashkernel=1024M
   * A kernel panic will happen when dump got triggered

  [Regression Potential]
  * There is a certain risk of a regression, but it is mapping only the memory
    allocated for KFENCE pool at page granularity, reducing memory consumption
    when KFENCE is used.

  * On top the commit is already upstream reviewed and accepted.

  * The modifications were done and tested by IBM.

  * The fadump feature is supported only on IBM POWER systems.

  [Other]
  * The fix/commit got upstream accepted with kernel v6.11-rc4,
    hence Oracular (wi

[Kernel-packages] [Bug 2076866] Re: Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0

2024-09-03 Thread Frank Heimes

** Summary changed:

- ISST-LTE:KOP:1060.1:doodlp1g8:Post Migration Non-MDC L1 eralp1 crashed with  
migrate_misplaced_folio+0x4cc/0x5d0
+ Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Triaged

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-12 
23:50:17 ==
  +++ This bug was initially created as a clone of Bug #207985 +++

  ---Problem Description---
  Post Migration Non-MDC L1 eralp1 crashed with 
migrate_misplaced_folio+0x4cc/0x5d0 (
   
  Machine Type = na 
   
  Contact Information = sthou...@in.ibm.com 
   
  ---Steps to Reproduce---
   Problem description  : 
  After 1 hour of successful migration from doodlp1 [MDC MODE] to eralp1[NON 
MDC mode],eralp1 guest 
  and dump is collected
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured

  
  [281827.975244] NIP [c05f0620] migrate_misplaced_folio+0x4f0/0x5d0
  [281827.975251] LR [c05f067c] migrate_misplaced_folio+0x54c/0x5d0
  [281827.975258] Call Trace:
  [281827.975260] [c01e19ff7140] [c05f0670] 
migrate_misplaced_folio+0x540/0x5d0 (unreliable)
  [281827.975268] [c01e19ff71d0] [c054c9f0] 
__handle_mm_fault+0xf70/0x28e0
  [281827.975276] [c01e19ff7310] [c054e478] 
handle_mm_fault+0x118/0x400
  [281827.975284] [c01e19ff7360] [c053598c] 
__get_user_pages+0x1ec/0x5b0
  [281827.975291] [c01e19ff7420] [c0536920] 
get_user_pages_unlocked+0x120/0x4f0
  [281827.975298] [c01e19ff74c0] [c0081894ea9c] hva_to_pfn+0xf4/0x630 
[kvm]
  [281827.975316] [c01e19ff7550] [c00818b4efc4] 
kvmppc_book3s_instantiate_page+0xec/0x790 [kvm_hv]
  [281827.975326] [c01e19ff7660] [c00818b4f750] 
kvmppc_book3s_radix_page_fault+0xe8/0x380 [kvm_hv]
  [281827.975335] [c01e19ff7700] [c00818b488fc] 
kvmppc_book3s_hv_page_fault+0x294/0xd60 [kvm_hv]
  [281827.975344] [c01e19ff77e0] [c00818b43f5c] 
kvmppc_vcpu_run_hv+0xf94/0x11d0 [kvm_hv]
  [281827.975352] [c01e19ff78a0] [c0081896131c] 
kvmppc_vcpu_run+0x34/0x48 [kvm]
  [281827.975365] [c01e19ff78c0] [c0081895c164] 
kvm_arch_vcpu_ioctl_run+0x39c/0x570 [kvm]
  [281827.975379] [c01e19ff7950] [c0081894a104] 
kvm_vcpu_ioctl+0x20c/0x9a8 [kvm]
  [281827.975391] [c01e19ff7b30] [c0683974] sys_ioctl+0x574/0x16a0
  [281827.975395] [c01e19ff7c30] [c0030838] 
system_call_exception+0x168/0x310
  [281827.975400] [c01e19ff7e50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [281827.975406] --- interrupt: 3000 at 0x7fffb7d4d2bc

  Mirroring to distro as per message in group channel

  
  Please pick these patches for this bug:

  ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks 
under PTL")
  4b88c23ab8c9 ("mm/migrate: make migrate_misplaced_folio() return 0 on 
success")
  d2136d749d76 ("mm: support multi-size THP numa balancing")
  6b0ed7b3c775 ("mm: factor out the numa mapping rebuilding into a new helper")
  ebb34f78d72c ("mm: convert folio_estimated_sharers() to 
folio_likely_mapped_shared()")
  133d04b1eee9 ("mm/numa_balancing: allow migrate on protnone reference with 
MPOL_PREFERRED_MANY policy")
  f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead of cpu_to_node()")

  Thanks,
  Amit

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076866/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076406] Re: L2 Guest migration: continuously dumping while running NFS guest migration

2024-09-02 Thread Frank Heimes

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-September/thread.html#153261
changing status to 'In Progress', assigning kernel team.

** Changed in: linux (Ubuntu Noble)
   Status: Triaged => In Progress

** Changed in: ubuntu-power-systems
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Noble)
 Assignee: (unassigned) => Canonical Kernel Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076406

Title:
  L2 Guest migration: continuously dumping while running NFS guest
  migration

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * While doing ISST testing it turned out that a 2nd level (KVM)
 guest (aka VM) continuously dumped when running an NFS
 guest migration.

  [ Test Plan ]

   * Setup two IBM Power 10 system (with firmware 1060, that offers
 support for KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on both on these system to allow guest migration.

   * Setup a KVM guest and place its disk on an NFS volume.

   * Now initiate a guest migration.

   * Without the two patches the initiator system will start to dump.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Although the patch set looks huge,
 the patches themselves are relatively small and less invasive
 and I would consider them mainly as fixes.

   * kvmppc_set_one_reg_hv() wrongly get() the value instead of
 set() for MMCR3.

   * And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting
 the SIAR instead of SDAR - which is quite traceable.

   * Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR
 is introduced. Here issues can happen if the initialization
 is done wrong or in the case statement.
 A fix was added to keep nested guest DEXCR in sync.
 The guest state element defined for DEXCR was already there,
 but not really considered - this is fixed now (DEXCR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 Guest state may get out of sync.

   * Another one-reg register identifier was introduced
 that is used to read and set the virtual HASHKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHKEYR.
 Again initialization and the case code are critical.
 Code was added to keep nested guest HASHKEYR in sync.
 Again the state element defined for HASHKEYR was there,
 but not considered, what is fixed now (HASHKEYR GSID)
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

   * Again another one-reg identifier was introduced
 that is used to read and set the virtual HASHPKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR.
 And again the guest state element defined for HASHPKEYR
 was there but ignored which is now fixed (HASHPKEYR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++

  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}

  ---uname output---
  NA

  Machine Type = NA

  Contact Information = NA

  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
  [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
     GPR00: c02bbc58 c003871cf450 c000

[Kernel-packages] [Bug 2076406] Re: L2 Guest migration: continuously dumping while running NFS guest migration

2024-09-02 Thread Frank Heimes

A test kernel was build in this PPA:
https://launchpad.net/~fheimes/+archive/ubuntu/lp2076406

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076406

Title:
  L2 Guest migration: continuously dumping while running NFS guest
  migration

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * While doing ISST testing it turned out that a 2nd level (KVM)
 guest (aka VM) continuously dumped when running an NFS
 guest migration.

  [ Test Plan ]

   * Setup two IBM Power 10 system (with firmware 1060, that offers
 support for KVM) with Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on both on these system to allow guest migration.

   * Setup a KVM guest and place its disk on an NFS volume.

   * Now initiate a guest migration.

   * Without the two patches the initiator system will start to dump.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Although the patch set looks huge,
 the patches themselves are relatively small and less invasive
 and I would consider them mainly as fixes.

   * kvmppc_set_one_reg_hv() wrongly get() the value instead of
 set() for MMCR3.

   * And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting
 the SIAR instead of SDAR - which is quite traceable.

   * Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR
 is introduced. Here issues can happen if the initialization
 is done wrong or in the case statement.
 A fix was added to keep nested guest DEXCR in sync.
 The guest state element defined for DEXCR was already there,
 but not really considered - this is fixed now (DEXCR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 Guest state may get out of sync.

   * Another one-reg register identifier was introduced
 that is used to read and set the virtual HASHKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHKEYR.
 Again initialization and the case code are critical.
 Code was added to keep nested guest HASHKEYR in sync.
 Again the state element defined for HASHKEYR was there,
 but not considered, what is fixed now (HASHKEYR GSID)
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

   * Again another one-reg identifier was introduced
 that is used to read and set the virtual HASHPKEYR
 for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR.
 And again the guest state element defined for HASHPKEYR
 was there but ignored which is now fixed (HASHPKEYR GSID).
 If initialization was done wrong or code in case stmt,
 this can harm the guest state.
 This can harm the L2 guest during enter or exit.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++

  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}

  ---uname output---
  NA

  Machine Type = NA

  Contact Information = NA

  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
  [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
     GPR00: c02bbc58 c003871cf450 c20ded00 
0009
     GPR04: 0009 0009 0080 
0200
     GPR08: 01ff 0001 c00740f57ee0 
44048222
     GPR12: c00d192c c00743ddc980  

     GPR16:  cd

[Kernel-packages] [Bug 2070329] Re: KOP L2 guest fails to boot with 1 core - SMT8 topology

2024-09-02 Thread Frank Heimes

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-September/thread.html#153260
changing status to 'In Progress', assigning kernel team.

** Changed in: linux (Ubuntu Noble)
   Status: Confirmed => In Progress

** Changed in: ubuntu-power-systems
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu Noble)
 Assignee: (unassigned) => Canonical Kernel Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070329

Title:
  KOP L2 guest fails to boot with 1 core - SMT8 topology

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * On a P10 system with SMT-8 configured
 a level 2 guest (VM) fails to boot in case
 it only has one core assigned.

  [ Test Plan ]

   * Setup an IBM Power 10 system - that support up to SMT-8
 and with firmware 1060, that offers support for KVM -
 using Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on this system.

   * Configure a KVM guest (e.g. using virtinst or
 qemu-system-ppc64 directly) now with smt-8,
 but only one virtual CPU.

   * Try to boot this specific guest:
 qemu-system-ppc64 \
-drive file=rhel.qcow2,format=qcow2 \
-m 20G \
-smp 8,cores=1,threads=8 \
-cpu  host \
-nographic \
-machine pseries,ic-mode=xics -accel kvm

   * It will fail to boot with a kernel that does not
 have the two patches in place.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Primarily support for using DPDES (register) is required,
 since its needed for enabling usage of doorbells in L2 gusts.
 This is mainly done by adding DEFINEs, stubs and case.
 If the definitions are not correct or if the code executed by
 the new case (KVMPPC_GSID_DPDES) is done wrong,
 the guest state could be incorrect, harming the L2 guest doorbell.
 (DPDES is to provide the means for the hypervisor to save a
  [sub-]processor's Directed Privileged Doorbell exception state
  when the set of programs running on the [sub-]processor is
  swapped out or moved from one [sub-]processor to another.)

   * The missing Doorbell emulation got added by a 4 line if statement
 in powerpc/kvm/book3s_hv.c, which is relatively traceable.

   * The main issue I can think of is that kvmppc_set_dpdes is called
 with wrong arguments.

   * And kvmppc_set_dpdes will not work (at all) if the above DPDES
 support (and commit/patch) is missing.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-06-25 
01:24:11 ==
  +++ This bug was initially created as a clone of Bug #205277 +++

  ---Problem Description---
  KOP L2 guest fails to boot with 1 core - SMT8 topology

  ---Additional Hardware Info---
  na

  ---Debugger Data---
  na

  ---Steps to Reproduce---
   KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8

  command line used to verify the issue:
  #!/bin/sh

  QEMU="/home/mgautam/qemu"
  qemu-system-ppc64 -s \
  -drive file=/root/debian-12-nocloud-ppc64el.qcow2,format=qcow2 \
  -m 20G \
  -smp 8,cores=1,sockets=1,threads=8 \
  -cpu host \
  -nographic \
  -machine pseries,ic-mode=xics -accel kvm  \
  -net nic,model=virtio \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10022-:22

  NOTE: L2 boots fine when doorbells are turned off in L1 kernel

  As per the investigation so far, the doorbell exception is not getting
  fired inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for
  L2, the guest never receives the doorbell and also it is never cleared
  from the GSB. We are discussing this behaviour with phyp team.

  The root cause of this issue is lack of DPDES support at L1. I've
  posted the fix upstream - https://lore.kernel.org/linuxppc-
  dev/20240522084949.123148-1-gau...@linux.ibm.com/T/#u

  The fix has been accepted upstream and will be backported for kernels
  >= 6.7

  https://lore.kernel.org/linuxppc-
  dev/20240605113913.83715-1-gau...@linux.ibm.com/

  ---Patches Installed---
  na

  ---System Hang---
   na

  ---uname

[Kernel-packages] [Bug 2070329] Re: KOP L2 guest fails to boot with 1 core - SMT8 topology

2024-09-02 Thread Frank Heimes

A test kernel is currently being build in this PPA:
https://launchpad.net/~fheimes/+archive/ubuntu/lp2070329

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070329

Title:
  KOP L2 guest fails to boot with 1 core - SMT8 topology

Status in The Ubuntu-power-systems project:
  Confirmed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Confirmed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * On a P10 system with SMT-8 configured
 a level 2 guest (VM) fails to boot in case
 it only has one core assigned.

  [ Test Plan ]

   * Setup an IBM Power 10 system - that support up to SMT-8
 and with firmware 1060, that offers support for KVM -
 using Ubuntu Server 24.04 for ppc64el.

   * Setup qemu/KVM on this system.

   * Configure a KVM guest (e.g. using virtinst or
 qemu-system-ppc64 directly) now with smt-8,
 but only one virtual CPU.

   * Try to boot this specific guest:
 qemu-system-ppc64 \
-drive file=rhel.qcow2,format=qcow2 \
-m 20G \
-smp 8,cores=1,threads=8 \
-cpu  host \
-nographic \
-machine pseries,ic-mode=xics -accel kvm

   * It will fail to boot with a kernel that does not
 have the two patches in place.

   * Since this setup requires a special firmware level,
 the verification will be done by the IBM Power team.

  [ Where problems could occur ]

   * Primarily support for using DPDES (register) is required,
 since its needed for enabling usage of doorbells in L2 gusts.
 This is mainly done by adding DEFINEs, stubs and case.
 If the definitions are not correct or if the code executed by
 the new case (KVMPPC_GSID_DPDES) is done wrong,
 the guest state could be incorrect, harming the L2 guest doorbell.
 (DPDES is to provide the means for the hypervisor to save a
  [sub-]processor's Directed Privileged Doorbell exception state
  when the set of programs running on the [sub-]processor is
  swapped out or moved from one [sub-]processor to another.)

   * The missing Doorbell emulation got added by a 4 line if statement
 in powerpc/kvm/book3s_hv.c, which is relatively traceable.

   * The main issue I can think of is that kvmppc_set_dpdes is called
 with wrong arguments.

   * And kvmppc_set_dpdes will not work (at all) if the above DPDES
 support (and commit/patch) is missing.

  [ Other Info ]

   * Since (nested) KVM support is new on P10,
 this does not affect older Power generation
 (P9 is the only other hw generation that is supported by 24.04,
 but it only supports native virtualization).

   * Both patches are upstream accepted since v6.11(-rc1),
 hence will be in oracular
 and are also upstream tagged as stable updates.

   * Since the required firmware FW1060 is relatively new,
 we can assume that not many user ran into this issue yet.
  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-06-25 
01:24:11 ==
  +++ This bug was initially created as a clone of Bug #205277 +++

  ---Problem Description---
  KOP L2 guest fails to boot with 1 core - SMT8 topology

  ---Additional Hardware Info---
  na

  ---Debugger Data---
  na

  ---Steps to Reproduce---
   KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8

  command line used to verify the issue:
  #!/bin/sh

  QEMU="/home/mgautam/qemu"
  qemu-system-ppc64 -s \
  -drive file=/root/debian-12-nocloud-ppc64el.qcow2,format=qcow2 \
  -m 20G \
  -smp 8,cores=1,sockets=1,threads=8 \
  -cpu host \
  -nographic \
  -machine pseries,ic-mode=xics -accel kvm  \
  -net nic,model=virtio \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10022-:22

  NOTE: L2 boots fine when doorbells are turned off in L1 kernel

  As per the investigation so far, the doorbell exception is not getting
  fired inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for
  L2, the guest never receives the doorbell and also it is never cleared
  from the GSB. We are discussing this behaviour with phyp team.

  The root cause of this issue is lack of DPDES support at L1. I've
  posted the fix upstream - https://lore.kernel.org/linuxppc-
  dev/20240522084949.123148-1-gau...@linux.ibm.com/T/#u

  The fix has been accepted upstream and will be backported for kernels
  >= 6.7

  https://lore.kernel.org/linuxppc-
  dev/20240605113913.83715-1-gau...@linux.ibm.com/

  ---Patches Installed---
  na

  ---System Hang---
   na

  ---uname output---
  na

  Contact Information = na

  Machine Type = na

  Userspace rpm: na

  Userspace tool common name: na

  The userspace tool has the following bit modes: na

  Userspace tool obtained from project website:  na

  *Additional Instructions for na:
  -Post a private note with access information to the machine that is currently 
in the deb

[Kernel-packages] [Bug 2070329] Re: KOP L2 guest fails to boot with 1 core - SMT8 topology

2024-08-30 Thread Frank Heimes

** Description changed:

+ SRU Justification:
+ 
+ [ Impact ]
+ 
+  * On a P10 system with SMT-8 configured
+a level 2 guest (VM) fails to boot in case
+it only has one core assigned.
+ 
+ [ Test Plan ]
+ 
+  * Setup an IBM Power 10 system - that support up to SMT-8
+and with firmware 1060, that offers support for KVM -
+using Ubuntu Server 24.04 for ppc64el.
+ 
+  * Setup qemu/KVM on this system.
+ 
+  * Configure a KVM guest (e.g. using virtinst or
+qemu-system-ppc64 directly) now with smt-8,
+but only one virtual CPU.
+ 
+  * Try to boot this specific guest:
+qemu-system-ppc64 \
+   -drive file=rhel.qcow2,format=qcow2 \
+   -m 20G \
+   -smp 8,cores=1,threads=8 \
+   -cpu  host \
+   -nographic \
+   -machine pseries,ic-mode=xics -accel kvm
+ 
+  * It will fail to boot with a kernel that does not
+have the two patches in place.
+ 
+  * Since this setup requires a special firmware level,
+the verification will be done by the IBM Power team.
+ 
+ [ Where problems could occur ]
+ 
+  * Primarily support for using DPDES (register) is required,
+since its needed for enabling usage of doorbells in L2 gusts.
+This is mainly done by adding DEFINEs, stubs and case.
+If the definitions are not correct or if the code executed by
+the new case (KVMPPC_GSID_DPDES) is done wrong,
+the guest state could be incorrect, harming the L2 guest doorbell.
+(DPDES is to provide the means for the hypervisor to save a
+ [sub-]processor's Directed Privileged Doorbell exception state
+ when the set of programs running on the [sub-]processor is
+ swapped out or moved from one [sub-]processor to another.)
+ 
+  * The missing Doorbell emulation got added by a 4 line if statement
+in powerpc/kvm/book3s_hv.c, which is relatively traceable.
+ 
+  * The main issue I can think of is that kvmppc_set_dpdes is called
+with wrong arguments.
+ 
+  * And kvmppc_set_dpdes will not work (at all) if the above DPDES
+support (and commit/patch) is missing.
+ 
+ [ Other Info ]
+ 
+  * Since (nested) KVM support is new on P10,
+this does not affect older Power generation
+(P9 is the only other hw generation that is supported by 24.04,
+but it only supports native virtualization).
+ 
+  * Both patches are upstream accepted since v6.11(-rc1),
+hence will be in oracular
+and are also upstream tagged as stable updates.
+ 
+  * Since the required firmware FW1060 is relatively new,
+we can assume that not many user ran into this issue yet.
+ __
+ 
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-06-25 
01:24:11 ==
  +++ This bug was initially created as a clone of Bug #205277 +++
  
  ---Problem Description---
  KOP L2 guest fails to boot with 1 core - SMT8 topology
-  
+ 
  ---Additional Hardware Info---
- na 
+ na
  
-  
  ---Debugger Data---
- na 
-  
+ na
+ 
  ---Steps to Reproduce---
-  KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8
+  KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8
  
  command line used to verify the issue:
  #!/bin/sh
  
  QEMU="/home/mgautam/qemu"
  qemu-system-ppc64 -s \
  -drive file=/root/debian-12-nocloud-ppc64el.qcow2,format=qcow2 \
  -m 20G \
  -smp 8,cores=1,sockets=1,threads=8 \
  -cpu host \
  -nographic \
  -machine pseries,ic-mode=xics -accel kvm  \
  -net nic,model=virtio \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10022-:22
  
- 
  NOTE: L2 boots fine when doorbells are turned off in L1 kernel
  
- 
- As per the investigation so far, the doorbell exception is not getting fired 
inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for L2, the guest 
never receives the doorbell and also it is never cleared from the GSB. We are 
discussing this behaviour with phyp team.
+ As per the investigation so far, the doorbell exception is not getting
+ fired inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for L2,
+ the guest never receives the doorbell and also it is never cleared from
+ the GSB. We are discussing this behaviour with phyp team.
  
  The root cause of this issue is lack of DPDES support at L1. I've posted
  the fix upstream - https://lore.kernel.org/linuxppc-
  dev/20240522084949.123148-1-gau...@linux.ibm.com/T/#u
  
+ The fix has been accepted upstream and will be backported for kernels >=
+ 6.7
  
- The fix has been accepted upstream and will be backported for kernels >= 6.7
+ https://lore.kernel.org/linuxppc-
+ dev/20240605113913.83715-1-gau...@linux.ibm.com/
  
- 
https://lore.kernel.org/linuxppc-dev/20240605113913.83715-1-gau...@linux.ibm.com/
-  
  ---Patches Installed---
  na
-  
+ 
  ---System Hang---
-  na
-  
+  na
+ 
  ---uname output---
  na
-  
- Contact Information = na 
-  
- Machine Type = na 
  
- Userspace rpm: na 
-  
- Userspace tool common name: na 
-  
- The userspace tool has the following bit modes: na 
+ Contact Information = na
  
- Userspace tool obtained from project website:  n

[Kernel-packages] [Bug 2076406] Re: L2 Guest migration: continuously dumping while running NFS guest migration

2024-08-30 Thread Frank Heimes

** Description changed:

+ SRU Justification:
+ 
+ [ Impact ]
+ 
+  * While doing ISST testing it turned out that a 2nd level (KVM)
+guest (aka VM) continuously dumped when running an NFS
+guest migration.
+ 
+ [ Test Plan ]
+ 
+  * Setup two IBM Power 10 system (with firmware 1060, that offers
+support for KVM) with Ubuntu Server 24.04 for ppc64el.
+ 
+  * Setup qemu/KVM on both on these system to allow guest migration.
+ 
+  * Setup a KVM guest and place its disk on an NFS volume.
+ 
+  * Now initiate a guest migration.
+ 
+  * Without the two patches the initiator system will start to dump.
+ 
+  * Since this setup requires a special firmware level,
+the verification will be done by the IBM Power team.
+ 
+ [ Where problems could occur ]
+ 
+  * Although the patch set looks huge,
+the patches themselves are relatively small and less invasive
+and I would consider them mainly as fixes.
+ 
+  * kvmppc_set_one_reg_hv() wrongly get() the value instead of
+set() for MMCR3.
+ 
+  * And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting
+the SIAR instead of SDAR - which is quite traceable.
+ 
+  * Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR
+is introduced. Here issues can happen if the initialization
+is done wrong or in the case statement.
+A fix was added to keep nested guest DEXCR in sync.
+The guest state element defined for DEXCR was already there,
+but not really considered - this is fixed now (DEXCR GSID).
+If initialization was done wrong or code in case stmt,
+this can harm the guest state.
+Guest state may get out of sync.
+ 
+  * Another one-reg register identifier was introduced
+that is used to read and set the virtual HASHKEYR
+for the guest during enter/exit with KVM_REG_PPC_HASHKEYR.
+Again initialization and the case code are critical.
+Code was added to keep nested guest HASHKEYR in sync.
+Again the state element defined for HASHKEYR was there,
+but not considered, what is fixed now (HASHKEYR GSID)
+If initialization was done wrong or code in case stmt,
+this can harm the guest state.
+This can harm the L2 guest during enter or exit.
+ 
+  * Again another one-reg identifier was introduced
+that is used to read and set the virtual HASHPKEYR
+for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR.
+And again the guest state element defined for HASHPKEYR
+was there but ignored which is now fixed (HASHPKEYR GSID).
+If initialization was done wrong or code in case stmt,
+this can harm the guest state.
+This can harm the L2 guest during enter or exit.
+ 
+ [ Other Info ]
+ 
+  * Since (nested) KVM support is new on P10,
+this does not affect older Power generation
+(P9 is the only other hw generation that is supported by 24.04,
+but it only supports native virtualization).
+ 
+  * Both patches are upstream accepted since v6.11(-rc1),
+hence will be in oracular
+and are also upstream tagged as stable updates.
+ 
+  * Since the required firmware FW1060 is relatively new,
+we can assume that not many user ran into this issue yet.
+ __
+ 
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++
  
  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}
-  
+ 
  ---uname output---
  NA
-  
- Machine Type = NA 
-  
+ 
+ Machine Type = NA
+ 
  Contact Information = NA
  
  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
- [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L   
   
+ [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
-GPR00: c02bbc58 c003871cf450 c20ded00 
0009
-GPR04: 0009 0009 0080 
0200
-GPR08: 01ff 0001 c00740f57ee0 
44048222
-GPR12: c00d192c c00743ddc980  

-GPR16:  cd86e200 0001 
0001
-GPR20: 000c c3d06188 c00ac4d0 
ca374e00
-GPR24: c3d06840  c00741193188 
c00741193188
-GPR28: c00741193180 c3d06840 0048 
0009
+    GPR00: c02bbc58 c003871cf450 c20ded0

[Kernel-packages] [Bug 2076406] Re: L2 Guest migration: continuously dumping while running NFS guest migration

2024-08-30 Thread Frank Heimes

** Summary changed:

- ISST-LTE:KOP:1060FW:evelp2 :L2 Guest migration: evelp2g4[L2]: while running 
NFS guest migration  continuously  dumping 
smp_call_function_many_cond+0x500/0x738 (unreliable) and watchdog: BUG: soft 
lockup - CPU#14 stuck for 223s! [systemd-homed} (Fedora)
+ L2 Guest migration: continuously dumping while running NFS guest migration

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076406

Title:
  L2 Guest migration: continuously dumping while running NFS guest
  migration

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++

  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}
   
  ---uname output---
  NA
   
  Machine Type = NA 
   
  Contact Information = NA

  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
  [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L   
   
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
 GPR00: c02bbc58 c003871cf450 c20ded00 
0009
 GPR04: 0009 0009 0080 
0200
 GPR08: 01ff 0001 c00740f57ee0 
44048222
 GPR12: c00d192c c00743ddc980  

 GPR16:  cd86e200 0001 
0001
 GPR20: 000c c3d06188 c00ac4d0 
ca374e00
 GPR24: c3d06840  c00741193188 
c00741193188
 GPR28: c00741193180 c3d06840 0048 
0009
  [79205.171660] NIP [c02bb7a4] smp_call_function_many_cond+0x1e0/0x738
  [79205.171752] LR [c02bb750] smp_call_function_many_cond+0x18c/0x738
  [79205.171835] Call Trace:
  [79205.171869] [c003871cf450] [c02bbc58] 
smp_call_function_many_cond+0x694/0x738 (unreliable)
  [79205.171986] [c003871cf520] [c00ac4d0] 
radix__tlb_flush+0x4c/0x140
  [79205.173636] [c003871cf560] [c052e900] 
tlb_finish_mmu+0x130/0x1f0
  [79205.173754] [c003871cf590] [c052a280] exit_mmap+0x1cc/0x574
  [79205.173848] [c003871cf6c0] [c016ec9c] __mmput+0x54/0x1d4
  [79205.173939] [c003871cf6f0] [c06385c4] 
begin_new_exec+0x6dc/0xefc
  [79205.174037] [c003871cf780] [c06edea8] 
load_elf_binary+0x4c8/0x1a50
  [79205.174136] [c003871cf880] [c06361c8] bprm_execve+0x2b4/0x7a0
  [79205.174219] [c003871cf950] [c0637988] 
do_execveat_common+0x1c0/0x2d8
  [79205.174316] [c003871cf9f0] [c0638e38] sys_execve+0x54/0x6c
  [79205.174399] [c003871cfa20] [c002fec8] 
system_call_exception+0x168/0x310
  [79205.174497] [c003871cfe50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [79205.176245] --- interrupt: 3000 at 0x7fff95b10b08
  [79205.176326] NIP:  7fff95b10b08 LR: 7fff95b10b08 CTR: 

  [79205.176438] REGS: c003871cfe80 TRAP: 3000   Tainted: G L   
   (
  [79205.176558] MSR:  8280f033   
CR: 48044424  XER: 
  [79205.176686] IRQMASK: 0
 GPR00: 000b 7fffe6919aa0 7fff95c47c00 
000152598c80
 GPR04: 7fffe6919bf8 0001525db6e0  
7fffe6919a20
 GPR08: 000152598c88   

 GPR12:  7fff969a4220 000152585570 

 GPR16: 7fffe6919c48 0570 000152598c80 

 GPR20:  9998 00015259a450 
000152586460
 GPR24: 0001525bca90 7fffe6919e48  
0001525db6e0
 GPR28: 000117e98448 0001525d0b00  
0010
  [79205.177505] NIP [7fff95b10b08] 0x7fff95b10b08
  [79205.177578] LR [7fff95b10b08] 0x7fff95b10b08
  [79205.177649] --- interrupt: 3000

  
  Steps to reproduce: Install the  build on NFS storage  guest kernel 
6.8.10-300 

  Start the

[Kernel-packages] [Bug 1959940] Re: [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer keys - kernel part

2024-08-30 Thread Frank Heimes

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-August/thread.html#153225
changing status to 'In Progress', assigning kernel team.

** Changed in: linux (Ubuntu)
 Assignee: Canonical Kernel Team (canonical-kernel-team) => (unassigned)

** Changed in: linux (Ubuntu Jammy)
     Assignee: Frank Heimes (fheimes) => Canonical Kernel Team 
(canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1959940

Title:
  [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer
  keys - kernel part

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  In Progress

Bug description:
  SRU Justification:

  [ Impact ]

   * Hypervisor-initiated dumps for Secure Execution
 (aka confidential computing) guests are not helpful,
 because memory and CPU state is encrypted by a
 transient key only available to the Ultravisor (uv).

   * Workload owners can still configure kdump in order to obtain kernel
 crash information, but there are situation where kdump doesn't work.

   * In such situations problem determination is severely impeded.

   * This patch set solves this by implementing dumps created in a way
 that can only be decrypted by the owner of the guest image
 and be used for problem determination.

  [ Test Plan ]

   * The setup of a Secure Execution environment is not trivial
 and requires a certain set of hardware (IBM z15 or higher)
 with FC 115).

   * On top of the modification of qemu that are handled in this
 LP bug, modifications of the Kernel (LP#1959940) and
 the s390-tools (LP#1959965) are required on top.

   * So at least a modified kernel and qemu test builds are needed
 or both should be in -proposed at the same time (which might
 be difficult).
 A modified s390-tools is not urgently needed, since for the
 verification of the kernel and qemu part a newer version
 can be used (but a modified s390-tools is also available in PPA).

   * A detailed description (using Ubuntu as example) on how to setup
 secure execution is available here:
 Introducing IBM Secure Execution for Linux, April 2024 update
 https://www.ibm.com/docs/en/linuxonibm/pdf/lx24se04.pdf

   * And information on 'Working with dumps of KVM guests in
 IBM Secure Execution mode' is available here:
 
https://www.ibm.com/docs/en/linux-on-systems?topic=commands-zgetdump#czgetdump__se_dump_examples

  [ Where problems could occur ]

   * Ultravisor (uv) return codes are introduced, which is
 generally appreciated. Just the right return codes need to be set
 (and reacted upon).

   * Protected virtual machine dumps are newly introduced on top of
 dump of 'normal' KVM VMs.
 Since code is shared, it could have an unforeseen impact.

   * The doc renaming could lead to confusion,
 if people rely on old doc structure.

   * The new capability case (217) could cause issues,
 for example is case of issues during initialization..
   
   * CPU dump functionality was added (mainly as new s390x specific code
 under s390/kvm), but CPU dump is only one part,
 if not working correctly, it may lead to partially useless dump data.

   * Configuration dump functionality was also added
 (again mainly as new s390x specific code under s390/kvm),
 similar to CPU dump.
 And moving from dumping inside of a VM to dumping from outside
 (due to potential failures if done inside), might lead to a more
 complex flow (now involving the uv), hence could be more error prone.

   * Adding query dump information, requires user space buffers.
 Here it's crucial that buffer size is big enough.

   * The newly added constants and structure definitions that are
 needed for dump support could become problematic in case wrong
 data types were used (applies to all header modifications).

   * IOCTL for PV information retrieval got introduced
 (kvm_s390_handle_pv_info, kvm_s390_handle_pv).
 There are potential side effect (see man ioctl),
 hence all potential failure cases should be covered.

   * New dump feature requires to know how much memory is needed, but if
 this call for this is incorrect, it could break the dump process.

   * uv_cb_header struct changed to offset representation,
 but using wrong offsets will lead to a wrong struct,
 dump issues and potential crashes.

  [ Other Info ]

   * Since 22.04 is a popular LTS release, it is already in use by many
 secure execution customers.
 But in case of severe crashes or issues in the secure execution
 (KVM) guests dumps cannot be used as of today.

   * This enables customers, IBM and Canonical to get support in case

[Kernel-packages] [Bug 1959940] Re: [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer keys - kernel part

2024-08-30 Thread Frank Heimes

** Description changed:

+ SRU Justification:
+
+ [ Impact ]
+
+ * Hypervisor-initiated dumps for Secure Execution
+(aka confidential computing) guests are not helpful,
+because memory and CPU state is encrypted by a
+transient key only available to the Ultravisor (uv).
+
+ * Workload owners can still configure kdump in order to obtain kernel
+crash information, but there are situation where kdump doesn't work.
+
+ * In such situations problem determination is severely impeded.
+
+ * This patch set solves this by implementing dumps created in a way
+that can only be decrypted by the owner of the guest image
+and be used for problem determination.
+
+ [ Test Plan ]
+
+ * The setup of a Secure Execution environment is not trivial
+and requires a certain set of hardware (IBM z15 or higher)
+with FC 115).
+
+ * On top of the modification of qemu that are handled in this
+LP bug, modifications of the Kernel (LP#1959940) and
+the s390-tools (LP#1959965) are required on top.
+
+ * So at least a modified kernel and qemu test builds are needed
+or both should be in -proposed at the same time (which might
+be difficult).
+A modified s390-tools is not urgently needed, since for the
+verification of the kernel and qemu part a newer version
+can be used (but a modified s390-tools is also available in PPA).
+
+ * A detailed description (using Ubuntu as example) on how to setup
+secure execution is available here:
+Introducing IBM Secure Execution for Linux, April 2024 update
+https://www.ibm.com/docs/en/linuxonibm/pdf/lx24se04.pdf
+
+ * And information on 'Working with dumps of KVM guests in
+IBM Secure Execution mode' is available here:
+
https://www.ibm.com/docs/en/linux-on-systems?topic=commands-zgetdump#czgetdump__se_dump_examples
+
+ [ Where problems could occur ]
+
+ * Ultravisor (uv) return codes are introduced, which is
+generally appreciated. Just the right return codes need to be set
+(and reacted upon).
+
+ * Protected virtual machine dumps are newly introduced on top of
+dump of 'normal' KVM VMs.
+Since code is shared, it could have an unforeseen impact.
+
+ * The doc renaming could lead to confusion,
+if people rely on old doc structure.
+
+ * The new capability case (217) could cause issues,
+for example is case of issues during initialization..
+
+ * CPU dump functionality was added (mainly as new s390x specific code
+under s390/kvm), but CPU dump is only one part,
+if not working correctly, it may lead to partially useless dump data.
+
+ * Configuration dump functionality was also added
+(again mainly as new s390x specific code under s390/kvm),
+similar to CPU dump.
+And moving from dumping inside of a VM to dumping from outside
+(due to potential failures if done inside), might lead to a more
+complex flow (now involving the uv), hence could be more error prone.
+
+ * Adding query dump information, requires user space buffers.
+Here it's crucial that buffer size is big enough.
+
+ * The newly added constants and structure definitions that are
+needed for dump support could become problematic in case wrong
+data types were used (applies to all header modifications).
+
+ * IOCTL for PV information retrieval got introduced
+(kvm_s390_handle_pv_info, kvm_s390_handle_pv).
+There are potential side effect (see man ioctl),
+hence all potential failure cases should be covered.
+
+ * New dump feature requires to know how much memory is needed, but if
+this call for this is incorrect, it could break the dump process.
+
+ * uv_cb_header struct changed to offset representation,
+but using wrong offsets will lead to a wrong struct,
+dump issues and potential crashes.
+
+ [ Other Info ]
+
+ * Since 22.04 is a popular LTS release, it is already in use by many
+secure execution customers.
+But in case of severe crashes or issues in the secure execution
+(KVM) guests dumps cannot be used as of today.
+
+ * This enables customers, IBM and Canonical to get support in case of
+crashes/dumps on hardware that runs secure execution environments.
+
+ __
+
KVM: Secure Execution guest dump encryption with customer keys - kernel
part

Description:
Hypervisor-initiated dumps for Secure Execution guests are not helpful
because memory and CPU state is encrypted by a transient key only available to
the Ultravisor. Workload owners can still configure kdump in order to obtain
kernel crash infomation, but there are situation where kdump doesn't work. In
such situations problem determination is severely impeded. This feature will
implement dumps created in a way that can only be decrypted by the owner of the
guest image and be used for problem determination.

Request Type: Kernel - Enhancement from IBM
Upstream Acceptance: In Progress
Code Contribution: IBM code

--
You rec

[Kernel-packages] [Bug 2071774] Re: zfs-dkms FTBFS on Linux 6.10/s390x

2024-08-29 Thread Frank Heimes

Thanks for confirming, Timo!

** Changed in: ubuntu-z-systems
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to zfs-linux in Ubuntu.
https://bugs.launchpad.net/bugs/2071774

Title:
  zfs-dkms FTBFS on Linux 6.10/s390x

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in zfs-linux package in Ubuntu:
  Fix Released
Status in zfs-linux source package in Oracular:
  Fix Released

Bug description:
  ZFS FTBFS on Linux 6.10/s390x - full build log:
  https://launchpadlibrarian.net/737125941/buildlog_ubuntu-
  oracular-s390x.linux-unstable_6.10.0-12.12_BUILDING.txt.gz

  ```
CC [M]  <>/build/zfs/2.2.4/build/module/os/linux/zfs/zvol_os.o
LD [M]  <>/build/zfs/2.2.4/build/module/spl.o
LD [M]  <>/build/zfs/2.2.4/build/module/zfs.o
MODPOST <>/build/zfs/2.2.4/build/module/Module.symvers
  ERROR: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 
'vm_layout'
  make[6]: *** [scripts/Makefile.modpost:145: 
<>/build/zfs/2.2.4/build/module/Module.symvers] Error 1
  make[5]: *** 
[<>/headers/linux-headers-6.10.0-12-generic/Makefile:1891: modpost] 
Error 2
  ```

  same issue is 100% reproducible with upstream zfs (plus the 6.10
  compatiblity patches on top):
  https://github.com/openzfs/zfs/pull/16250#issuecomment-2202351855

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2071774/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1959940] Re: [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer keys - kernel part

2024-08-29 Thread Frank Heimes

I think I got it, there are just two backports, for
commit e9bf3acb23f0a6e18438c35944d6cb618d16cf05 and
commit 437cfd714db9c1d28878a6e2555e9a730f3490c8 .
The rest are cherrypicks.

With that a test kernel is currently being build in this PPA:
https://launchpad.net/~fheimes/+archive/ubuntu/lp1959940j

** Changed in: linux (Ubuntu Jammy)
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Jammy)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1959940

Title:
  [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer
  keys - kernel part

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  In Progress

Bug description:
  KVM: Secure Execution guest dump encryption with customer keys -
  kernel part

  Description:
  Hypervisor-initiated dumps for Secure Execution guests are not helpful 
because memory and CPU state is encrypted by a transient key only available to 
the Ultravisor.  Workload owners can still configure kdump in order to obtain 
kernel crash infomation, but there are situation where kdump doesn't work. In 
such situations problem determination is severely impeded. This feature will 
implement dumps created in a way that can only be decrypted by the owner of the 
guest image and be used for problem determination.

  Request Type: Kernel - Enhancement from IBM
  Upstream Acceptance: In Progress
  Code Contribution: IBM code

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1959940/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076406] Re: ISST-LTE:KOP:1060FW:evelp2 :L2 Guest migration: evelp2g4[L2]: while running NFS guest migration continuously dumping smp_call_function_many_cond+0x500/0x738 (unreli

2024-08-28 Thread Frank Heimes

** Package changed: kernel-package (Ubuntu) => linux (Ubuntu)

** Also affects: ubuntu-power-systems
   Importance: Undecided
   Status: New

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Changed in: ubuntu-power-systems
   Importance: Undecided => Critical

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: High
 Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
   Status: New

** Changed in: linux (Ubuntu Noble)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Noble)
   Status: New => Triaged

** Changed in: ubuntu-power-systems
   Status: New => Triaged

** Changed in: linux (Ubuntu Oracular)
 Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) => 
(unassigned)

** Changed in: linux (Ubuntu Oracular)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076406

Title:
  ISST-LTE:KOP:1060FW:evelp2 :L2 Guest migration: evelp2g4[L2]: while
  running NFS guest migration  continuously  dumping
  smp_call_function_many_cond+0x500/0x738 (unreliable) and watchdog:
  BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed} (Fedora)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-09 
03:50:24 ==
  +++ This bug was initially created as a clone of Bug #206737 +++

  ---Problem Description---
  L2 Guest migration: evelp2g4[L2]: while running NFS guest migration 
continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and 
watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed}
   
  ---uname output---
  NA
   
  Machine Type = NA 
   
  Contact Information = NA

  [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [79205.163834] NIP:  c02bb7a4 LR: c02bb750 CTR: 
c00d192c
  [79205.163929] REGS: c003871cf1b0 TRAP: 0900   Tainted: G L   
   
  [79205.165041] MSR:  8280b033   CR: 
4404  XER: 20040004
  [79205.165266] CFAR:  IRQMASK: 0
 GPR00: c02bbc58 c003871cf450 c20ded00 
0009
 GPR04: 0009 0009 0080 
0200
 GPR08: 01ff 0001 c00740f57ee0 
44048222
 GPR12: c00d192c c00743ddc980  

 GPR16:  cd86e200 0001 
0001
 GPR20: 000c c3d06188 c00ac4d0 
ca374e00
 GPR24: c3d06840  c00741193188 
c00741193188
 GPR28: c00741193180 c3d06840 0048 
0009
  [79205.171660] NIP [c02bb7a4] smp_call_function_many_cond+0x1e0/0x738
  [79205.171752] LR [c02bb750] smp_call_function_many_cond+0x18c/0x738
  [79205.171835] Call Trace:
  [79205.171869] [c003871cf450] [c02bbc58] 
smp_call_function_many_cond+0x694/0x738 (unreliable)
  [79205.171986] [c003871cf520] [c00ac4d0] 
radix__tlb_flush+0x4c/0x140
  [79205.173636] [c003871cf560] [c052e900] 
tlb_finish_mmu+0x130/0x1f0
  [79205.173754] [c003871cf590] [c052a280] exit_mmap+0x1cc/0x574
  [79205.173848] [c003871cf6c0] [c016ec9c] __mmput+0x54/0x1d4
  [79205.173939] [c003871cf6f0] [c06385c4] 
begin_new_exec+0x6dc/0xefc
  [79205.174037] [c003871cf780] [c06edea8] 
load_elf_binary+0x4c8/0x1a50
  [79205.174136] [c003871cf880] [c06361c8] bprm_execve+0x2b4/0x7a0
  [79205.174219] [c003871cf950] [c0637988] 
do_execveat_common+0x1c0/0x2d8
  [79205.174316] [c003871cf9f0] [c0638e38] sys_execve+0x54/0x6c
  [79205.174399] [c003871cfa20] [c002fec8] 
system_call_exception+0x168/0x310
  [79205.174497] [c003871cfe50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [79205.176245] --- interrupt: 3000 at 0x7fff95b10b08
  [79205.176326] NIP:  7fff95b10b08 LR: 7fff95b10b08 CTR: 

  [79205.176438] REGS: c003871cfe80 TRAP: 3000   Tainted: G L   
   (
  [79205.176558] MSR:  8280f033   
CR: 48044424  XER: 
  [79205.176686] IRQMASK: 0
 GPR00: 000b 7fffe6919aa0 7fff95c47c00 
000152598c80

[Kernel-packages] [Bug 2060039] Re: [Ubuntu-24.04] FADump with recommended crash size is making the L1 hang

2024-08-28 Thread Frank Heimes

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: Undecided
   Status: Confirmed

** Changed in: linux (Ubuntu Noble)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Noble)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu Oracular)
   Status: Confirmed => Fix Committed

** Changed in: ubuntu-power-systems
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2060039

Title:
  [Ubuntu-24.04] FADump with recommended crash size is making the L1
  hang

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:
   
  [Impact]
   * L1 host hangs when triggering FADump with recommended crash



  [Fix]
   * 353d7a84c214f184d5a6b62acdec8b4424159b7c 353d7a84c214 
"powerpc/64s/radix/kfence: map __kfence_pool at page granularity"
   
  [Test Case]
   * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
   * Enable FADump with 1GB: fadump=on crashkernel=1024M
   * A kernel panic will happen when dump got triggered
   
  [Regression Potential]
  * There is a certain risk of a regression, but it is mapping only the memory
allocated for KFENCE pool at page granularity, reducing memory consumption
when KFENCE is used.
   
  * On top the commit is already upstream reviewed and accepted.
   
  * The modifications were done and tested by IBM.
   
  * The fadump feature is supported only on IBM POWER systems.
   
  [Other]
  * The fix/commit got upstream accepted with kernel v6.11-rc4,
hence Oracular (with a planned kernel of 6.11) is not affected.

  ...

  Problem description :
  ==

  Triggered FADump with the recommended crash. L1 host got hung.

  As per the public document
  https://wiki.ubuntu.com/ppc64el/Recommendations recommended crash
  kernel size is 1024M for the system. But with 1024M and 2048M, the L1
  is getting hanged. with 4096, crash is generated and collected.

  root@ubuntu2404:~# uname -ar
  Linux ubuntu2404 6.8.0-11-generic #11-Ubuntu SMP Wed Feb 14 00:33:03 UTC 2024 
ppc64le ppc64le ppc64le GNU/Linux

  root@ubuntu2404:~# free -h
     totalusedfree  shared  buff/cache   
available
  Mem:48Gi   1.7Gi46Gi13Mi   687Mi
46Gi
  Swap:  8.0Gi  0B   8.0Gi

  root@ubuntu2404:~# cat /proc/cmdline
  BOOT_IMAGE=/vmlinux-6.8.0-11-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro fadump=on crashkernel=1024M

  root@ubuntu2404:~# dmesg | grep -i reser
  [0.00] fadump: Reserved 1024MB of memory at 0x004000 (System 
RAM: 51200MB)
  [0.00] fadump: Initialized 0x4000 bytes cma area at 1024MB from 
0x4007 bytes of memory reserved for firmware-assisted dump
  [0.00] Memory: 49316672K/52428800K available (23616K kernel code, 
4096K rwdata, 25536K rodata, 8832K init, 2487K bss, 2063552K reserved, 1048576K 
cma-reserved)
  [0.396408] ibmvscsi 3066: Client reserve enabled

  root@ubuntu2404:~# kdump-config show
  DUMP_MODE:fadump
  USE_KDUMP:1
  KDUMP_COREDIR:/var/crash
     /var/lib/kdump/vmlinuz
  kdump initrd:
     /var/lib/kdump/initrd.img
  current state:ready to fadump

  IBM is looking to update the crash kernel reservations section of the
  wiki for Power.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2060039/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2070253] Re: KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low performance. Possible tuning opportunity.

2024-08-28 Thread Frank Heimes

The commit is now included in the 24.04 / noble kernel:
Ubuntu-6.8.0-44.44 (and newer)
which is currently in -proposed:
linux-generic | 6.8.0-44.44| noble-proposed| amd64, arm64, 
armhf, ppc64el, s390x
Hence updating this ticket for noble to Fix Committed.

** Changed in: linux (Ubuntu Noble)
   Status: Triaged => Fix Committed

** Changed in: ubuntu-power-systems
   Status: Triaged => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070253

Title:
  KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low
  performance. Possible tuning opportunity.

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low
  performance. Possible tuning opportunity.

  ---uname output---
  Linux rhel86edb1 #1 SMP Sun Jan 21 11:45:44 EST 2024 ppc64le ppc64le ppc64le 
GNU/Linux
   
  ---Steps to Reproduce---
  Example: run READ only Test using EDB-PGBENCH and DT7 workloads on
   1. L1-Host 
   2. L2-Guest CEDE ON
   3. L2-Guest CEDE OFF

  significant performance drop is observed in L2-Guest CEDE on vs
  L2-Guest CEDE off case.

  Note: Host and Guest configuration  used performance experiments are
  listed below.

  Location of EDB-PGBENCH: 
  #wget 
http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/pgbench_install.sh
  #chmod 777 pgbench_install.sh
  #./pgbench_install.sh -->> it will install EDB(pgbench) and run edb on target 
lpar. 

  Location of DT7 workload:

  #wget 
http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/DT7-Install.sh
  #chmod 777 DT7-Install.sh
  #./DT7-Install.sh -->> It will install DT7.

  Sample Commands : Once installation was successful run below commands
  on target lpar.

  EDB-PGBENCH Commands :

  # su - enterprisedb
  # vi t1.tc -->> copy below lines to t1.tc file . 

  ##t1.tc##
  runname=select
  SCALE=100
  runtime=300
  thread="40"
  smtlist="8"
  mode=select
  recreateinstance=yes
  recreateduringrun=yes
  warmup=no
  perf_stat=yes
  PGSQL=/usr/local/pgsql/bin
  #PGSQL=/usr/edb/as14/bin
  #PGPORT=5432
  cores=5
  ##t1.tc##

  #cp t1.tc tc/
  #./auto-run-test.sh

  DT7 Commands :

  After installation of DT7 run below command :
  #cd /root
  #./DayTrader7_Run.sh -u 20 -l 900 -i 2  

  ##
  Machine Type: Power 10  LPAR (RHEL9.3)
  gcc   : 11.4.1
  Memory: 300GB
  Test type : pgbench-edb, DT7
  ##
  KVM Host lscpu output : 

  # lscpu
  Architecture:ppc64le
Byte Order:Little Endian
  CPU(s):  96
On-line CPU(s) list:   0-39
Off-line CPU(s) list:  40-95
  Model name:  POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core:8
Core(s) per socket:5
Socket(s): 1
Physical sockets:  1
Physical chips:4
Physical cores/chip:   12
  Virtualization features:
Hypervisor vendor: pHyp
Virtualization type:   para
  Caches (sum of all):
L1d:   320 KiB (10 instances)
L1i:   480 KiB (10 instances)
L2:10 MiB (10 instances)
L3:40 MiB (10 instances)
  NUMA:
NUMA node(s):  1
NUMA node2 CPU(s): 0-39
  Vulnerabilities:
Gather data sampling:  Not affected
Itlb multihit: Not affected
L1tf:  Not affected
Mds:   Not affected
Meltdown:  Not affected
Mmio stale data:   Not affected
Retbleed:  Not affected
Spec rstack overflow:  Not affected
Spec store bypass: Not affected
Spectre v1:Vulnerable, ori31 speculation barrier enabled
Spectre v2:Vulnerable
Srbds: Not affected
Tsx async abort:   Not affected

  
  ##

  KVM on PowerVM setup:

  KVM (Kernel Virtual Machine) is a virtualization module for Linux that
  provides the ability of virtualization to Linux i.e. it allows the
  kernel to function as a hypervisor.

  We used P10 2S4U system for this experiment.

  Workloads: DT7 and PGBENCH in details:

  DT7 is an open source benchmark application emulating an online stock trading 
system.
  DT7 consist of 3 components 
  1) Jmeter 
  2) WAS (WebSphere Application Server)
  3) DB2

  DayTrader benchmark/application will be installed/deployed on WAS and
  this used DB2 as a backbone database.  Jme

[Kernel-packages] [Bug 2070253] Re: KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low performance. Possible tuning opportunity.

2024-08-27 Thread Frank Heimes

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: High
   Status: New

** Changed in: ubuntu-power-systems
   Status: New => Triaged

** Changed in: linux (Ubuntu Noble)
   Status: New => Triaged

** Changed in: linux (Ubuntu Oracular)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070253

Title:
  KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low
  performance. Possible tuning opportunity.

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low
  performance. Possible tuning opportunity.

  ---uname output---
  Linux rhel86edb1 #1 SMP Sun Jan 21 11:45:44 EST 2024 ppc64le ppc64le ppc64le 
GNU/Linux
   
  ---Steps to Reproduce---
  Example: run READ only Test using EDB-PGBENCH and DT7 workloads on
   1. L1-Host 
   2. L2-Guest CEDE ON
   3. L2-Guest CEDE OFF

  significant performance drop is observed in L2-Guest CEDE on vs
  L2-Guest CEDE off case.

  Note: Host and Guest configuration  used performance experiments are
  listed below.

  Location of EDB-PGBENCH: 
  #wget 
http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/pgbench_install.sh
  #chmod 777 pgbench_install.sh
  #./pgbench_install.sh -->> it will install EDB(pgbench) and run edb on target 
lpar. 

  Location of DT7 workload:

  #wget 
http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/DT7-Install.sh
  #chmod 777 DT7-Install.sh
  #./DT7-Install.sh -->> It will install DT7.

  Sample Commands : Once installation was successful run below commands
  on target lpar.

  EDB-PGBENCH Commands :

  # su - enterprisedb
  # vi t1.tc -->> copy below lines to t1.tc file . 

  ##t1.tc##
  runname=select
  SCALE=100
  runtime=300
  thread="40"
  smtlist="8"
  mode=select
  recreateinstance=yes
  recreateduringrun=yes
  warmup=no
  perf_stat=yes
  PGSQL=/usr/local/pgsql/bin
  #PGSQL=/usr/edb/as14/bin
  #PGPORT=5432
  cores=5
  ##t1.tc##

  #cp t1.tc tc/
  #./auto-run-test.sh

  DT7 Commands :

  After installation of DT7 run below command :
  #cd /root
  #./DayTrader7_Run.sh -u 20 -l 900 -i 2  

  ##
  Machine Type: Power 10  LPAR (RHEL9.3)
  gcc   : 11.4.1
  Memory: 300GB
  Test type : pgbench-edb, DT7
  ##
  KVM Host lscpu output : 

  # lscpu
  Architecture:ppc64le
Byte Order:Little Endian
  CPU(s):  96
On-line CPU(s) list:   0-39
Off-line CPU(s) list:  40-95
  Model name:  POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core:8
Core(s) per socket:5
Socket(s): 1
Physical sockets:  1
Physical chips:4
Physical cores/chip:   12
  Virtualization features:
Hypervisor vendor: pHyp
Virtualization type:   para
  Caches (sum of all):
L1d:   320 KiB (10 instances)
L1i:   480 KiB (10 instances)
L2:10 MiB (10 instances)
L3:40 MiB (10 instances)
  NUMA:
NUMA node(s):  1
NUMA node2 CPU(s): 0-39
  Vulnerabilities:
Gather data sampling:  Not affected
Itlb multihit: Not affected
L1tf:  Not affected
Mds:   Not affected
Meltdown:  Not affected
Mmio stale data:   Not affected
Retbleed:  Not affected
Spec rstack overflow:  Not affected
Spec store bypass: Not affected
Spectre v1:Vulnerable, ori31 speculation barrier enabled
Spectre v2:Vulnerable
Srbds: Not affected
Tsx async abort:   Not affected

  
  ##

  KVM on PowerVM setup:

  KVM (Kernel Virtual Machine) is a virtualization module for Linux that
  provides the ability of virtualization to Linux i.e. it allows the
  kernel to function as a hypervisor.

  We used P10 2S4U system for this experiment.

  Workloads: DT7 and PGBENCH in details:

  DT7 is an open source benchmark application emulating an online stock trading 
system.
  DT7 consist of 3 components 
  1) Jmeter 
  2) WAS (WebSphere Application Server)
  3) DB2

  DayTrader benchmark/application will be installed/deployed on WAS and
  this used DB2 as a backbone database.  Jmeter generate the request and
  interact with the WAS. which would

[Kernel-packages] [Bug 2076147] Re: Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test

2024-08-27 Thread Frank Heimes

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: High
 Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
   Status: Triaged

** Changed in: linux (Ubuntu Oracular)
   Status: Triaged => Fix Committed

** Changed in: linux (Ubuntu Noble)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  Add 'mm: hold PTL from the first PTE while reclaiming a large folio'
  to fix L2 Guest hang during LTP Test

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
     PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.

   * It hangs with:
     "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"

   * Diagnosing the issues points this this fix/upstream-commit:
     [commit message, by Barry Song ]
     Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
     modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
     it only starts acquiring PTL from the first valid (present) PTE.
     PTE modifications can temporarily set PTEs to pte_none.
     Consequently, the initial PTEs of a large folio might be skipped
     in try_to_unmap_one().
     For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
     still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
     try_to_unmap_one().
     So folio will be still mapped, the folio fails to be reclaimed and is put
     back to LRU in this round.
     This also breaks up PTEs optimization such as CONT-PTE on this large folio
     and may lead to accident folio_split() afterwards.
     And since a part of PTEs are now swap entries, accessing those parts will
     introduce overhead - do_swap_page.
     Although the kernel can withstand all of the above issues, the situation
     still seems quite awkward and warrants making it more ideal.
     The same race also occurs with small folios, but they have only one PTE,
     thus, it won't be possible for them to be partially unmapped.
     This patch [see below] holds PTL from PTE0, allowing us to avoid reading
     PTE values that are in the process of being transformed. With stable PTE
     values, we can ensure that this large folio is either completely reclaimed
     or that all PTEs remain untouched in this round.
     A corner case is that if we hold PTL from PTE0 and most initial PTEs have
     been really unmapped before that, we may increase the duration of holding
     PTL. Thus we only apply this optimization to folios which are still 
entirely
     mapped (not in deferred_split list).

  [ Fix ]

   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
     "mm: hold PTL from the first PTE while reclaiming a large folio"

  [ Test Plan ]

   * An IBM Power 10 system (where PowerVM is mandatory)
     running Ubuntu Server 24.04 (kernel 6.8) or later
     with (nested) KVM setup (so KVM on top of PowerVM).

   * Run LTP test suite
     Tests running: SLS(io,base)

   * Without the patch the above test will hang with
     Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab

  [ Where problems could occur ]

   * This is a common code change in the memory management sub-system,
     hence great care needs to be taken, even if it was discussed upfront
     at the https://lore.kernel.org/ mailing list and the upstream commit
     provenance shows that many eyes had a look at this.

   * The modification is relatively small with just one if statement
     (across two lines) in mm/vmscan.c.

   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
     from the first page table entry (PTE) and to eliminate the influence of
     temporary and volatile PTE values.

   * If done wrong it can especially have a negative impact in case of large 
folios.
     and wrong hints might be given to try_to_unmap
     which may lead to bad page swapping.

   * In case of an issue with this patch the result can also be decreased
     performance and efficiency in the page table handling - the opposite
     of what the patch is supposed to address.

   * Fortunately several developers had their eyes on this commit,
     as the provenance of the patch and the discussion at lkml shows.

  [ Other Info ]

   * The commit is upstream since v6.10(-rc1), hence it will be included
     in oracular with the planned target kernel.

  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug

[Kernel-packages] [Bug 2070329] Re: KOP L2 guest fails to boot with 1 core - SMT8 topology

2024-08-27 Thread Frank Heimes

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: High
   Status: Confirmed

** Changed in: linux (Ubuntu Noble)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Oracular)
   Status: Confirmed => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070329

Title:
  KOP L2 guest fails to boot with 1 core - SMT8 topology

Status in The Ubuntu-power-systems project:
  Confirmed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Confirmed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-06-25 
01:24:11 ==
  +++ This bug was initially created as a clone of Bug #205277 +++

  ---Problem Description---
  KOP L2 guest fails to boot with 1 core - SMT8 topology
   
  ---Additional Hardware Info---
  na 

   
  ---Debugger Data---
  na 
   
  ---Steps to Reproduce---
   KOP L2 guest fails to boot when we set the CPU topology as 1 core - SMT 8

  command line used to verify the issue:
  #!/bin/sh

  QEMU="/home/mgautam/qemu"
  qemu-system-ppc64 -s \
  -drive file=/root/debian-12-nocloud-ppc64el.qcow2,format=qcow2 \
  -m 20G \
  -smp 8,cores=1,sockets=1,threads=8 \
  -cpu host \
  -nographic \
  -machine pseries,ic-mode=xics -accel kvm  \
  -net nic,model=virtio \
  -net user,host=10.0.2.10,hostfwd=tcp:127.0.0.1:10022-:22

  
  NOTE: L2 boots fine when doorbells are turned off in L1 kernel

  
  As per the investigation so far, the doorbell exception is not getting fired 
inside L2 guest. At L1 level, if we set DPDES=1 in the GSB for L2, the guest 
never receives the doorbell and also it is never cleared from the GSB. We are 
discussing this behaviour with phyp team.

  The root cause of this issue is lack of DPDES support at L1. I've
  posted the fix upstream - https://lore.kernel.org/linuxppc-
  dev/20240522084949.123148-1-gau...@linux.ibm.com/T/#u

  
  The fix has been accepted upstream and will be backported for kernels >= 6.7

  
https://lore.kernel.org/linuxppc-dev/20240605113913.83715-1-gau...@linux.ibm.com/
   
  ---Patches Installed---
  na
   
  ---System Hang---
   na
   
  ---uname output---
  na
   
  Contact Information = na 
   
  Machine Type = na 

  Userspace rpm: na 
   
  Userspace tool common name: na 
   
  The userspace tool has the following bit modes: na 

  Userspace tool obtained from project website:  na 
   
  *Additional Instructions for na: 
  -Post a private note with access information to the machine that is currently 
in the debugger.
  -Attach ltrace and strace of userspace application.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2070329/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075575] Re: kexec fails in LPAR when some cpus are disabled

2024-08-27 Thread Frank Heimes

** Changed in: linux (Ubuntu Oracular)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075575

Title:
  kexec fails in LPAR when some cpus are disabled

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Jammy:
  Triaged
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-02 
03:11:31 ==
  +++ This bug was initially created as a clone of Bug #206083 +++

  ---Problem Description---
  kexec fails in LPAR when some cpus are disabled
   
  Contact Information = sthou...@in.ibm.com 
   
  Machine Type = na 
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Summary:
  At L1 level, kexec fails if some of the cpus in the machine are disabled.

  
  Distros and kernel  versions used:
  1. Distro versions used

a. L1 LPAR :

b. L2 :

  
  Repro steps:
  1. Boot into an L1 lpar
  2. Disable some cpus (eg: ppc64_cpu --cores-on=3)
  3. Try to kexec. 

  
  This bug is reproducible only when we load the target kernel/initrd and use 
"kexec -e" as follows:

  kexec -l --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  kexec -e

  
  kexec works fine if we do a normal kexec without skipping the shutdown path

  kexec --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  
  Fix is upstream now:
  
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=21a741eb75f80397e5f7d3739e24d7d75e619011

  Thanks,
  Sourabh Jain

  please include in Ubuntu

   
  Oops output:
   no
   
  Stack trace output:
   no
   
  System Dump Info:
The system is not configured to capture a system dump.
   
  *Additional Instructions for sthou...@in.ibm.com: 
  -Attach sysctl -a output output to the bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2075575/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2070358] Re: [Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel OOPS crash

2024-08-27 Thread Frank Heimes

Thank you Tasmiya and Jamie - I'm updating the tags accordingly ...

** Tags removed: verification-needed-noble-linux
** Tags added: verification-done-noble-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070358

Title:
  [Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel
  OOPS crash

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Invalid
Status in sosreport package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  Fix Committed
Status in sosreport source package in Noble:
  Invalid

Bug description:
  SRU Justification:

  [Impact]
   * When the sosreport command is executed, a kernel OOPS happens and the 
system is crashing,
    depending on the configuration (but default) the system/LPAR is rebooting.

  [Fix]
   * e0011bca603c101f2a3c007bdb77f7006fa78fb1 e0011bca603c "nfsd: initialise 
nfsd_info.mutex early"

  [Test Case]
   * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
   * one option is only running sosreport on the system - and
   the crash is seen when the sosreport is starting to capture dump
   * second option (without sosreport) is:
   * CONFIG_NFSD=m (or y) must be set
   * mount nfsd if not already, using "$ mount -t nfsd nfsd /proc/fs/nfsd" 
command
   * The kernel oops will happen and the logs will show:
     ...
     BUG: Kernel NULL pointer dereference on read at 0x
     Faulting instruction address: 0xc16ff114
     Oops: Kernel access of bad area, sig: 11 [#1]
     ...
   * On a system with that kernel that incl. the above patch
     no oops will occur and the sosreport command will execute normally.

  [Regression Potential]
  * There is a certain risk of a regression, with any code modification,
    and here because the mutex handling in nfsd is modified.

  * But the changes are pretty traceable.

  * On top the commit is already upstream reviewed and accepted.

  * The modifications were done by the NFSD maintainer and also tested
  by IBM.

  [Other]
  * The fix/commit got upstream accepted with kernel v6.10-rc7,
    hence Oracular (with a planned kernel of >=6.10) is not affected.

  == Comment: #0 - Tasmiya Nalatwad  - 2024-05-28 
04:35:50 ==
  --- Description ---
  When sosreport command is executed the kernel OOPS crash is happening and 
lpar is rebooting. As kdump was enabled the dump is captured.

  Note : The bug looks similar Bug 206504 Which is seen on z lpars.

  --- Lpar Details ---
  1. PowerVM
  2. FW: FW1060.00 (NH1060_026)
  3. OS: Ubuntu 24.04
  4. Kernel: 6.8.0-31-generic
  5. Mem (free -mh): 47Gi
  6. cpus: 40

  --- Steps to reproduce ---
  1. run sosreport command on the lpar and the crash is seen when the sosreport 
is starting to capture dump.

  --- Traces ---
  root@ubuntulp2host:~# sosreport
  Please note the 'sosreport' command has been deprecated in favor of the new 
'sos' command, E.G. 'sos report'.
  Redirecting to 'sos report '

  sosreport (version 4.5.6)

  This command will collect system configuration and diagnostic
  information from this Ubuntu system.

  For more information on Canonical visit:

  Community Website  : https://www.ubuntu.com/
  Commercial Support : https://www.canonical.com

  The generated archive may contain data considered sensitive and its
  content should be reviewed by the originating organization before being
  passed to any third party.

  No changes will be made to system configuration.

  Press ENTER to continue, or CTRL-C to quit.

  Optionally, please enter the case id that you are generating this
  report for []:

   Setting up archive ...
   Setting up plugins ...
  [plugin:lxd] skipped command 'lxc image list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc network list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc profile list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:lxd] skipped command 'lxc storage list': required kmods missing: 
ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, 
iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, 
ip6_tables, ip6table_filter.
  [plugin:networki

[Kernel-packages] [Bug 2064539] Re: Revert back frame pointers for ppc64el (remove -fno-omit-frame-pointer)

2024-08-26 Thread Frank Heimes

Did the verification for the kernel (linux-generic) like above and based
on representative kernel modules (arch-specific and common):

noble:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 24.04 LTS
Release:24.04
Codename:   noble
$ uname -a
Linux P10d-LPAR06 6.8.0-41-generic #41-Ubuntu SMP Fri Aug  2 21:00:36 UTC 2024 
ppc64le ppc64le ppc64le GNU/Linux
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko.zst
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko.zst:
 338761 bytes 
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko
  | grep DW_AT_produce | grep -c mbackchain
0
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko
 | grep DW_AT_produce | grep -c no-omit-frame-pointer
0
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/fs//nfs/nfsv4.ko.zst 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/fs//nfs/nfsv4.ko.zst: 
18009201 bytes 
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/fs//nfs/nfsv4.ko | grep 
DW_AT_produce | grep -c mbackchain
0
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-41-generic/kernel/fs//nfs/nfsv4.ko | grep 
DW_AT_produce | grep -c no-omit-frame-pointer
0

oracular:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu Oracular Oriole (development branch)
Release:24.10
Codename:   oracular
$ uname -a
Linux P10d-LPAR06 6.11.0-4-generic #4-Ubuntu SMP Tue Aug 20 14:54:18 UTC 2024 
ppc64le ppc64le ppc64le GNU/Linux
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko.zst
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko.zst:
 354129 bytes 
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko
 | grep DW_AT_produce | grep -c mbackchain
0
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/powerpc/crypto/aes-gcm-p10-crypto.ko
 | grep DW_AT_produce | grep -c no-omit-frame-pointer
0
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko.zst
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko.zst: 
19618441 bytes 
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko | grep 
DW_AT_produce | grep -c mbackchain
0
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko | grep 
DW_AT_produce | grep -c no-omit-frame-pointer
0

(Btw. the negative test for mbackchain was done by intention, since this
needs to be set for s390x, but for s390x only. The changes were done at
the same time, verifying here, that is was really not set for ppc64el.)

Like one can see "-no-omit-frame-pointer" (and 'mbackchain') is not set,
hence successful verified.

Closing the affected 'linux (Ubuntu)' as 'Fix Released' - and with that
the project entry as well.

** Changed in: linux (Ubuntu Noble)
   Status: New => Fix Released

** Changed in: linux (Ubuntu Oracular)
   Status: New => Fix Released

** Changed in: ubuntu-power-systems
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2064539

Title:
  Revert back frame pointers for ppc64el (remove -fno-omit-frame-
  pointer)

Status in The Ubuntu-power-systems project:
  Fix Released
Status in dpkg package in Ubuntu:
  Fix Released
Status in glibc package in Ubuntu:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in dpkg source package in Noble:
  Fix Released
Status in glibc source package in Noble:
  Fix Released
Status in linux source package in Noble:
  Fix Released
Status in dpkg source package in Oracular:
  Fix Released
Status in glibc source package in Oracular:
  Fix Released
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * Power's Linux ABIs all require an explicit call chain be stored on
  the call stack frames which are all accessible via the stack pointer.

   * Therefore, having a (soft/simulated) frame pointer does not improve
  backtraces at all on Power.

   * However, forcing a frame pointer via the -fno-omit-frame-pointer
  option negatively affects performance for multiple reasons: extra
  prologue/epilogue overhead and fewer shrink-wrapping opportunities.

   * Given -fno-omit-frame-pointer does not provide any improvements
  (backtraces or otherwise) and only reduces performance, -fno-omit-
  frame-pointers should not be used on Power.

   * So we are facing here a performance penalty without any gain - on
  this particular platform.

   * And sometimes (in rare cases like LP#2060108) frame pointers

[Kernel-packages] [Bug 2064538] Re: Revert back frame pointers for s390x (remove -fno-omit-frame-pointer but use -mbackchain)

2024-08-26 Thread Frank Heimes

Did the verification for the kernel (linux-generic) like above and based
on representative kernel modules (arch-specific and common):

noble:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 24.04 LTS
Release:24.04
Codename:   noble
$ uname -a
Linux testlpar1 6.8.0-35-generic #35-Ubuntu SMP Mon May 20 15:36:54 UTC 2024 
s390x s390x s390x GNU/Linux
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/arch/s390/crypto/aes_s390.ko.zst
 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/arch/s390/crypto/aes_s390.ko.zst:
 507329 bytes
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/arch/s390/crypto/aes_s390.ko 
| grep DW_AT_produce | grep -c mbackchain
2
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/arch/s390/crypto/aes_s390.ko 
| grep DW_AT_produce | grep -c no-omit-frame-pointer
0
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/fs/nfs/nfsv4.ko.zst
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/fs/nfs/nfsv4.ko.zst: 
17383593 bytes
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/fs/nfs/nfsv4.ko | grep 
DW_AT_produce | grep -c mbackchain
24
$ readelf -wi 
/usr/lib/debug/lib/modules/6.8.0-35-generic/kernel/fs/nfs/nfsv4.ko | grep 
DW_AT_produce | grep -c no-omit-frame-pointer
0

oracular:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu Oracular Oriole (development branch)
Release:24.10
Codename:   oracular
$ uname -a
Linux s1lp14 6.11.0-4-generic #4-Ubuntu SMP Tue Aug 20 14:03:40 UTC 2024 s390x 
s390x s390x GNU/Linux
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/s390/crypto/aes_s390.ko.zst
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/s390/crypto/aes_s390.ko.zst:
 529441 bytes 
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/s390/crypto/aes_s390.ko 
| grep DW_AT_produce | grep -c mbackchain
2
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/arch/s390/crypto/aes_s390.ko 
| grep DW_AT_produce | grep -c no-omit-frame-pointer
0
$ sudo unzstd 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko.zst
[sudo] password for ubuntu: 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko.zst: 
18990945 bytes
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko  | grep 
DW_AT_produce | grep -c mbackchain
24
$ readelf -wi 
/usr/lib/debug/lib/modules/6.11.0-4-generic/kernel/fs/nfs/nfsv4.ko  | grep 
DW_AT_produce | grep -c no-omit-frame-pointer
0

Like one can see '-mbackchain' is set, but "-no-omit-frame-pointer" is
not set, hence successful verified.

Closing the affected 'linux (Ubuntu)' as 'Fix Released' - and with that
the project entry as well.


** Changed in: linux (Ubuntu Noble)
   Status: New => Fix Released

** Changed in: linux (Ubuntu Oracular)
   Status: New => Fix Released

** Changed in: ubuntu-z-systems
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2064538

Title:
  Revert back frame pointers for s390x (remove -fno-omit-frame-pointer
  but use -mbackchain)

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in dpkg package in Ubuntu:
  Fix Released
Status in glibc package in Ubuntu:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in dpkg source package in Noble:
  Fix Released
Status in glibc source package in Noble:
  Fix Released
Status in linux source package in Noble:
  Fix Released
Status in dpkg source package in Oracular:
  Fix Released
Status in glibc source package in Oracular:
  Fix Released
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

   * The preferred way of doing stack unwinding on Linux on Z is via dwarf call 
frame information.
  In absence of a dwarf unwinder (as in the Linux kernel) a stack chain can be 
maintained at runtime in addition to the dwarf unwinding information.

   * This allows for simple backtrace implementations, but imposes a
  small runtime overhead. For this to work, all code that might be part
  of backtrace must be built with the -mbackchain GCC option.

   * The -fno-omit-framepointer switch is neither necessary nor helpful in this 
context.
    Having a (soft/simulated) frame pointer does not improve backtraces at all 
on IBM Z.

   * However, forcing a frame pointer via the -fno-omit-frame-pointer
  option negatively affects performance for multiple reasons: extra
  prologue/epilogue overhead and fewer shrink-wrapping opportunities.

   * Given -fno-omit-frame-pointer does not provide any improvements
  (backtraces or otherwise) and only reduces performance, -fno-omit-
  frame-pointers should not be used on IBM Z.

   * So we are facing here a performance penalty without any gain - on
  this part

[Kernel-packages] [Bug 2064538] Re: Revert back frame pointers for s390x (remove -fno-omit-frame-pointer but use -mbackchain)

2024-08-26 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
   * The preferred way of doing stack unwinding on Linux on Z is via dwarf call 
frame information.
  In absence of a dwarf unwinder (as in the Linux kernel) a stack chain can be 
maintained at runtime in addition to the dwarf unwinding information.
  
   * This allows for simple backtrace implementations, but imposes a small
  runtime overhead. For this to work, all code that might be part of
  backtrace must be built with the -mbackchain GCC option.
  
   * The -fno-omit-framepointer switch is neither necessary nor helpful in this 
context.
    Having a (soft/simulated) frame pointer does not improve backtraces at all 
on IBM Z.
  
   * However, forcing a frame pointer via the -fno-omit-frame-pointer
  option negatively affects performance for multiple reasons: extra
  prologue/epilogue overhead and fewer shrink-wrapping opportunities.
  
   * Given -fno-omit-frame-pointer does not provide any improvements
  (backtraces or otherwise) and only reduces performance, -fno-omit-frame-
  pointers should not be used on IBM Z.
  
   * So we are facing here a performance penalty without any gain - on
  this particular platform.
  
   * And sometimes (in rare cases like LP#2060108) frame pointers may even
  lead to failed builds.
  
  [ Test Plan ]
  
   * Due to the above description of the impact and rationale,
     this pragmatic approach for testing is given:
  
   * Build the affected packages where frame-pointers should be reverted
     using the updated dpkg package (that incl. the modified build defaults)
     on (or for) this particular platform.
  
   * Now frame-pointer usage be checked in the following different ways:
  
   * 1) For the ease of use (and thanks to Julian Klode), there is this python
    test script available that allows to verify a binary in regard to
    frame pointers:
    https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
-   
https://gist.githubusercontent.com/julian-klode/85e3f85c410a1b856a93dce77208/raw/488b8509e6f23fe48f917961fe711b285dcb2e28/dwprod.py
+   
https://gist.githubusercontent.com/julian-klode/85e3f85c410a1b856a93dce77208/raw/488b8509e6f23fe48f917961fe711b285dcb2e28/dwprod.py
+   requires python3-pyelftools
  
   * 2) Another more manual way is to verify based on debug symbols like this:
    - find and install the ddeb package
    - maybe extract the  file (e.g. unzstd)
    - use 'readelf -wi'
    - and grep for 'DW_AT_produce' (build options)
    - look for entries regarding frame-pointer
    The output may look similar to this:
    readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
    <23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
    <129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g 
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common 
-fno-strict-aliasing -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -fno-allow-store-data-races 
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection 
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once 
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow 
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift 
-fsanitize=bool -fsanitize=enum -fPIC
  
   * 3) And maybe watching the build messages / log for the build options that
    were used (but that is probably not sufficient - it's better to inspect
    the output.)
  
  [ Where problems could occur ]
  
   * The dpkg modifications could have been done erroneously.
     A dpkg test build and/or builds of other packages with the modified dpkg
     version in place would show this.
  
   * The settings in dpkg might be overwritten by other settings/packages.
     Tests like above, would show this.
  
   * One may think there could be issues in an environment where some packages
     have frame-pointer enabled and other don't.
     This is fine and was confirmed by IBM toolchain team and ours
     (as well as by a longer running  test system,
  with FP disabled in kernel, that showed no issues - like expected).
  
  [ Other Info ]
  
   * These changes were implemented during the opening of the oracular series.
     The very same changes are backported to 24.04 LTS.
  
   * These only affect the ppc64el and s390x architectures,
     for other architectures it's a no-change upload.
  
   * We didn't see any fallout for these changes during the development
     on the oracular series, and therefore don't expect any fallout or
     regressions in 24.04 LTS either.

-- 
You received this bug not

[Kernel-packages] [Bug 2064539] Re: Revert back frame pointers for ppc64el (remove -fno-omit-frame-pointer)

2024-08-26 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
-  * Power's Linux ABIs all require an explicit call chain be stored on
+  * Power's Linux ABIs all require an explicit call chain be stored on
  the call stack frames which are all accessible via the stack pointer.
  
-  * Therefore, having a (soft/simulated) frame pointer does not improve
+  * Therefore, having a (soft/simulated) frame pointer does not improve
  backtraces at all on Power.
  
-  * However, forcing a frame pointer via the -fno-omit-frame-pointer
+  * However, forcing a frame pointer via the -fno-omit-frame-pointer
  option negatively affects performance for multiple reasons: extra
  prologue/epilogue overhead and fewer shrink-wrapping opportunities.
  
-  * Given -fno-omit-frame-pointer does not provide any improvements
+  * Given -fno-omit-frame-pointer does not provide any improvements
  (backtraces or otherwise) and only reduces performance, -fno-omit-frame-
  pointers should not be used on Power.
  
-  * So we are facing here a performance penalty without any gain - on
+  * So we are facing here a performance penalty without any gain - on
  this particular platform.
  
-  * And sometimes (in rare cases like LP#2060108) frame pointers may even
+  * And sometimes (in rare cases like LP#2060108) frame pointers may even
  lead to failed builds.
  
  [ Test Plan ]
  
-  * Due to the above description of the impact and rationale,
-this pragmatic approach for testing is given:
+  * Due to the above description of the impact and rationale,
+    this pragmatic approach for testing is given:
  
-  * Build the affected packages where frame-pointers should be reverted
-using the updated dpkg package (that incl. the modified build defaults)
-on (or for) this particular platform.
+  * Build the affected packages where frame-pointers should be reverted
+    using the updated dpkg package (that incl. the modified build defaults)
+    on (or for) this particular platform.
  
-  * Now frame-pointer usage be checked in the following different ways:
+  * Now frame-pointer usage be checked in the following different ways:
  
-  * 1) For the ease of use (and thanks to Julian Klode), there is this python
-   test script available that allows to verify a binary in regard to
-   frame pointers:
-   https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
+  * 1) For the ease of use (and thanks to Julian Klode), there is this python
+   test script available that allows to verify a binary in regard to
+   frame pointers:
+   https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
+   
https://gist.githubusercontent.com/julian-klode/85e3f85c410a1b856a93dce77208/raw/488b8509e6f23fe48f917961fe711b285dcb2e28/dwprod.py
+   requires python3-pyelftools
  
-  * 2) Another more manual way is to verify based on debug symbols like this:
-   - find and install the ddeb package
-   - maybe extract the  file (e.g. unzstd)
-   - use 'readelf -wi'
-   - and grep for 'DW_AT_produce' (build options)
-   - look for entries regarding frame-pointer
-   The output may look similar to this:
-   readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
-   <23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
-   <129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g 
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common 
-fno-strict-aliasing -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -fno-allow-store-data-races 
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection 
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once 
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow 
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift 
-fsanitize=bool -fsanitize=enum -fPIC
+  * 2) Another more manual way is to verify based on debug symbols like this:
+   - find and install the ddeb package
+   - maybe extract the  file (e.g. unzstd)
+   - use 'readelf -wi'
+   - and grep for 'DW_AT_produce' (build options)
+   - look for entries regarding frame-pointer
+   The output may look similar to this:
+   readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
+   <23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
+   <129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-t

[Kernel-packages] [Bug 2064538] Re: Revert back frame pointers for s390x (remove -fno-omit-frame-pointer but use -mbackchain)

2024-08-26 Thread Frank Heimes

** Description changed:

SRU Justification:

[ Impact ]

* The preferred way of doing stack unwinding on Linux on Z is via dwarf call
frame information.
In absence of a dwarf unwinder (as in the Linux kernel) a stack chain can be
maintained at runtime in addition to the dwarf unwinding information.

* This allows for simple backtrace implementations, but imposes a small
runtime overhead. For this to work, all code that might be part of
backtrace must be built with the -mbackchain GCC option.

* The -fno-omit-framepointer switch is neither necessary nor helpful in this
context.
Having a (soft/simulated) frame pointer does not improve backtraces at all
on IBM Z.

* However, forcing a frame pointer via the -fno-omit-frame-pointer
option negatively affects performance for multiple reasons: extra
prologue/epilogue overhead and fewer shrink-wrapping opportunities.

* Given -fno-omit-frame-pointer does not provide any improvements
(backtraces or otherwise) and only reduces performance, -fno-omit-frame-
pointers should not be used on IBM Z.

* So we are facing here a performance penalty without any gain - on
this particular platform.

* And sometimes (in rare cases like LP#2060108) frame pointers may even
lead to failed builds.

[ Test Plan ]

* Due to the above description of the impact and rationale,
this pragmatic approach for testing is given:

* Build the affected packages where frame-pointers should be reverted
using the updated dpkg package (that incl. the modified build defaults)
on (or for) this particular platform.

* Now frame-pointer usage be checked in the following different ways:

* 1) For the ease of use (and thanks to Julian Klode), there is this python
test script available that allows to verify a binary in regard to
frame pointers:
https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
+
https://gist.githubusercontent.com/julian-klode/85e3f85c410a1b856a93dce77208/raw/488b8509e6f23fe48f917961fe711b285dcb2e28/dwprod.py

* 2) Another more manual way is to verify based on debug symbols like this:
- find and install the ddeb package
- maybe extract the file (e.g. unzstd)
- use 'readelf -wi'
- and grep for 'DW_AT_produce' (build options)
- look for entries regarding frame-pointer
The output may look similar to this:
readelf -wi
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
| grep DW_AT_produce
<23> DW_AT_producer: (indirect string, offset: 0x7d): GNU AS
2.42
<129> DW_AT_producer: (indirect string, offset: 0x3eef): GNU
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common
-fno-strict-aliasing -fno-asynchronous-unwind-tables
-fno-delete-null-pointer-checks -fno-allow-store-data-races
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift
-fsanitize=bool -fsanitize=enum -fPIC

* 3) And maybe watching the build messages / log for the build options that
were used (but that is probably not sufficient - it's better to inspect
the output.)

[ Where problems could occur ]

* The dpkg modifications could have been done erroneously.
A dpkg test build and/or builds of other packages with the modified dpkg
version in place would show this.

* The settings in dpkg might be overwritten by other settings/packages.
Tests like above, would show this.

* One may think there could be issues in an environment where some packages
have frame-pointer enabled and other don't.
This is fine and was confirmed by IBM toolchain team and ours
(as well as by a longer running test system,
with FP disabled in kernel, that showed no issues - like expected).

[ Other Info ]

- * These changes were implemented during the opening of the oracular series.
-The very same changes are backported to 24.04 LTS.
+ * These changes were implemented during the opening of the oracular series.
+ The very same changes are backported to 24.04 LTS.

- * These only affect the ppc64el and s390x architectures,
-for other architectures it's a no-change upload.
+ * These only affect the ppc64el and s390x architectures,
+ for other architectures it's a no-change upload.

- * We didn't see any fallout for these changes during the development
-on the oracular series, and therefore don't expect any fallout or

[Kernel-packages] [Bug 2077722] Re: [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck at booting after triggering crash

2024-08-26 Thread Frank Heimes

** Package changed: kernel-package (Ubuntu) => linux (Ubuntu)

** Also affects: ubuntu-power-systems
   Importance: Undecided
   Status: New

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2077722

Title:
  [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck
  at booting after triggering crash

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  Problem:
  While bringing up 2 Ubuntu 24.04 guests and running stress-ng (90% load) on 
both and triggering crash simultaneously, 1st guest gets stuck and does not 
boot up. In one of the attempts, both the guests got stuck on booting with 
console hang. 

  Attempts:
  Reproducible 3/3 consecutive times
  Run 1: L2-1 guest got stuck 
  Run 2: L2-1 guest got stuck
  Run 3: L2-1 and L2-2 guest got stuck

  
  =
  L1 Host:
  1. PowerVM
  2. OS: Ubuntu 24.04
  3. Kernel: 6.8.0-31-generic
  4. Mem (free -mh): 47Gi
  5. cpus: 40

  Guest L2-1:
  1. OS: Ubuntu 24.04
  2. Kernel: 6.8.0-31-generic
  3. Mem (free -mh): 9.5Gi
  4. cpus: 8
  5. Stress: stress-ng - 90% load
  6. XML configuration:
 16
 10971520
 

  Guest L2-2:
  1. OS: Ubuntu 24.04
  2. Kernel: 6.8.0-31-generic
  3. Mem (free -mh): 9.5Gi
  4. cpus: 8
  5. Stress: stress-ng - 90% load
  6. XML configuration:
 16
 10971520
 

  
  =
  Steps to reproduce:
  1. Bring up 2 Ubuntu 24.04 L2 guests with configuration mentioned as above
  2. Run the attached stress-ng.sh script on both L2 guests
  3. Trigger crash: echo c >/proc/sysrq-trigger on both L2 guests at the same 
time

  After triggering the crash, 1 or both guest consoles will get stuck.
  And then, we will not be able to enter the guest neither shut it down.
  In oder to boot into the guest, virsh destroy of the guest will be
  required.

  
  =
  Run1: Console.log Error message of L2-1
Booting `Ubuntu'

  Loading Linux 6.8.0-31-generic ...
  Loading initial ramdisk ...
  OF stdout device is: /vdevice/vty@3000
  Preparing to boot Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
  Detected machine type: 0101
  command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
  Max number of cores passed to firmware: 1024 (NR_CPUS = 2048)
  Calling ibm,client-architecture-support... done
  memory layout at init:
memory_limit :  (16 MB aligned)
alloc_bottom : 09d7
alloc_top: 3000
alloc_top_hi : 0002a000
rmo_top  : 3000
ram_top  : 0002a000
  instantiating rtas at 0x2fff... done
  prom_hold_cpus: skipped
  copying OF device tree...
  Building dt strings...
  Building dt structure...
  Device tree strings 0x09d8 -> 0x09d80bc6
  Device tree struct  0x09d9 -> 0x09da
  Quiescing Open Firmware ...
  Booting Linux via __start() @ 0x0023 ...
  [0.00] random: crng init done
  [0.00] Reserving 512MB of memory at 512MB for crashkernel (System 
RAM: 10752MB)
  [0.00] radix-mmu: Page sizes from device-tree:
  [0.00] radix-mmu: Page size shift = 12 AP=0x0
  [0.00] radix-mmu: Page size shift = 16 AP=0x5
  [0.00] radix-mmu: Page size shift = 21 AP=0x1
  [0.00] radix-mmu: Page size shift = 30 AP=0x2
  [0.00] Activating Kernel Userspace Access Prevention
  [0.00] Activating Kernel Userspace Execution Prevention
  [0.00] radix-mmu: Mapped 0x-0x038a with 
64.0 KiB pages (exec)
  [0.00] radix-mmu: Mapped 0x038a-0x0002a000 with 
64.0 KiB pages
  [0.00] lpar: Using radix MMU under hypervisor
  [0.00] Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
  [0.00] Secure boot mode disabled
  [0.00] Found initrd at 0xc620:0xc9d6da29
  [0.00] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [0.00] printk: legacy bootconsole [udbg

[Kernel-packages] [Bug 1959940] Re: [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer keys - kernel part

2024-08-26 Thread Frank Heimes

Hello Janosch,
many thanks for the patch set!
I'm just back and will work on this soon-ish.
But just to clarify, are these all plain cherry-picks from upstream - or did 
you had to do any real backport work to get some of the commits applied to the 
jammy kernel (I mean if any real modifications of code or context were needed)?
I just need to add this to the PR/provenance - for our kernel team.
I could of course also compare the patches you've sent with what exists 
upstream, but I think you know it right away ...
(Of course talking about the kernel patches/commits only.)

** Changed in: linux (Ubuntu Jammy)
 Assignee: (unassigned) => Frank Heimes (fheimes)

** Changed in: linux (Ubuntu Jammy)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1959940

Title:
  [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer
  keys - kernel part

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Triaged

Bug description:
  KVM: Secure Execution guest dump encryption with customer keys -
  kernel part

  Description:
  Hypervisor-initiated dumps for Secure Execution guests are not helpful 
because memory and CPU state is encrypted by a transient key only available to 
the Ultravisor.  Workload owners can still configure kdump in order to obtain 
kernel crash infomation, but there are situation where kdump doesn't work. In 
such situations problem determination is severely impeded. This feature will 
implement dumps created in a way that can only be decrypted by the owner of the 
guest image and be used for problem determination.

  Request Type: Kernel - Enhancement from IBM
  Upstream Acceptance: In Progress
  Code Contribution: IBM code

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1959940/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2074376] Re: Disable PCI_DYNAMIC_OF_NODES in Ubuntu

2024-08-26 Thread Frank Heimes

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074376

Title:
  Disable PCI_DYNAMIC_OF_NODES in Ubuntu

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  This came in via KTML from upstream. It is part of a discussion
  between upstream and IBM reporting a bug which occurs in KVM:

  Rob Herring  writes:

  >> On 2024/07/11 06:20 AM, Rob Herring wrote:
  >>> On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal  
wrote:
  
   With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
   of a PCI device attached to a PCI-bridge causes following kernel Oops on
   a pseries KVM guest:
  >>>
  >>> Can I ask why you have this option on in the first place? Do you have
  >>> a use for it or it's just a case of distros turn on every kconfig
  >>> option.
  >>
  >> Yes, this option is turned on in Ubuntu's distro kernel config where the 
issue
  >> was originally reported, while Fedora is keeping this turned off.
  >>
  >> root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
  >> CONFIG_PCI_DYNAMIC_OF_NODES=y
  > 
  > Ubuntu should turn off this option. For starters, it is not complete
  > to be usable. Eventually, it should get removed in favor of some TBD
  > runtime option.
  > 
  > (And we should fix the crash too)

  This option is described in the config system as:

This option enables support for generating device tree nodes for some
PCI devices. Thus, the driver of this kind can load and overlay
flattened device tree for its downstream devices.
.
Once this option is selected, the device tree nodes will be generated
for all PCI bridges.

  Open Firmware (OF) would be used for KVM for UEFI mode. The reported
  bug was related to hot-unplugging PCI devices. My guess would be that
  this probably is not of much use to the majority of users and might
  even go away. So it should really be disabled in Ubuntu, too.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2074376/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1959940] Re: [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer keys - kernel part

2024-08-26 Thread Frank Heimes

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1959940

Title:
  [22.10 FEAT] KVM: Secure Execution guest dump encryption with customer
  keys - kernel part

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  New

Bug description:
  KVM: Secure Execution guest dump encryption with customer keys -
  kernel part

  Description:
  Hypervisor-initiated dumps for Secure Execution guests are not helpful 
because memory and CPU state is encrypted by a transient key only available to 
the Ultravisor.  Workload owners can still configure kdump in order to obtain 
kernel crash infomation, but there are situation where kdump doesn't work. In 
such situations problem determination is severely impeded. This feature will 
implement dumps created in a way that can only be decrypted by the owner of the 
guest image and be used for problem determination.

  Request Type: Kernel - Enhancement from IBM
  Upstream Acceptance: In Progress
  Code Contribution: IBM code

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1959940/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2077540] Re: [24.10] Please test secure-boot and lockdown on the 6.11 kernel (s390x) for Oracular

2024-08-23 Thread Frank Heimes

Updating this ticket with a message that I received via mail:

Secureboot lockdown was successfully tested by Grgo/IBM.
Test completed!

With that I'm closing this ticket as Fix Released.

** Changed in: ubuntu-z-systems
 Assignee: (unassigned) => bugproxy (bugproxy)

** Changed in: ubuntu-z-systems
   Importance: Undecided => Critical

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: ubuntu-z-systems
   Importance: Critical => High

** Changed in: ubuntu-z-systems
   Status: New => Fix Released

** Changed in: linux (Ubuntu)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2077540

Title:
   [24.10] Please test secure-boot and lockdown on the 6.11 kernel
  (s390x) for Oracular

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  The Canonical kernel team is working on a new 6.11 kernel for
  'oracular' (24.10) and has an early build ready for secure-boot and
  lockdown testing (version 6.11.0-4.4).

  To avoid potentially negative implications that a broken secure-boot
  lockdown functionality would cause (esp. using the production key), we
  ask to get secure-boot tested early in the cycle using Canonical
  kernel team's PPA key for signature.

  The early test build is available at: ppa:canonical-kernel-team/unstable
  (https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/)

  The PPA key used for signing can be found in the tarball available here:
  
https://ppa.launchpad.net/canonical-kernel-team/unstable/ubuntu/dists/devel/main/signed/linux-generate-unstable-s390x/current/

  (Please note that this kernel is coming from the 'canonical-kernel-
  team' PPA, hence it is NOT signed with the regular
  archive/release/production key, instead with the above PPA's key!)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2077540/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076569] Re: ISST-LTE:KOP:doodlp1g3:L2 guest hung and Call traces seen with Snapshot tests

2024-08-16 Thread Frank Heimes

*** This bug is a duplicate of bug 2076147 ***
https://bugs.launchpad.net/bugs/2076147

This is a duplicate of:
LP#2076147 - "Add 'mm: hold PTL from the first PTE while reclaiming a large 
folio' to fix L2 Guest hang during LTP Test"
https://bugs.launchpad.net/bugs/2076147
Marking this LP bug as such ...

** Package changed: kernel-package (Ubuntu) => linux (Ubuntu)

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Changed in: linux (Ubuntu)
 Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) => 
(unassigned)

** Changed in: ubuntu-power-systems
   Importance: Undecided => High

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** This bug has been marked a duplicate of bug 2076147
   Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix 
L2 Guest hang during LTP Test

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076569

Title:
  ISST-LTE:KOP:doodlp1g3:L2 guest hung and Call traces seen with
  Snapshot tests

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-12 
00:15:56 ==
  +++ This bug was initially created as a clone of Bug #206986 +++

  ---Problem Description---
  :doodlp1g3:L2 guest hung and Call traces seen with Snapshot test
   
  ---Steps to Reproduce---
   Problem description  : 

  Problem description  : Problem on L2 Guest

  doodlp1g3 is hang and calltraces are seeing

  [254596.011652] watchdog: BUG: soft lockup - CPU#0 stuck for 805533275s! 
[systemd-userwor:2849437]
  [254596.011796] Modules linked in: chacha_generic wp512 streebog_generic 
rmd160 poly1305_generic nhpoly1305 michael_mic md4 crc32_generic 
twofish_generic twofish_common serpent_generic fcrypt des_generic libdes 
cast6_generic cast5_generic cast_common camellia_generic blowfish_generic 
blowfish_common aegis128 tun rpcsec_gss_krb5 rpcrdma rdma_cm iw_cm nfsv4 ib_cm 
dns_resolver ib_core nfs netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat 
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls rfkill nf_tables 
binfmt_misc virtio_net net_failover aes_gcm_p10_crypto failover virtio_balloon 
crct10dif_vpmsum nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse loop 
dm_multipath nfnetlink zram xfs vmx_crypto crc32c_vpmsum virtio_scsi 
pseries_wdt scsi_dh_rdac scsi_dh_emc scsi_dh_alua
  [254596.012693] CPU: 0 PID: 2849437 Comm: systemd-userwor Tainted: G  
   L 
  [254596.012817] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [254596.012947] NIP:  c00b9054 LR: c00b9104 CTR: 
c053c400
  [254596.013040] REGS: c001626d6b40 TRAP: 0900   Tainted: G L  

  [254596.013151] MSR:  80009033   CR: 28822824  
XER: 
  [254596.013266] CFAR:  IRQMASK: 0
  [254596.013266] GPR00: 28822824 c001626d6de0 c20ced00 

  [254596.013266] GPR04:  0a858fa1 00018017 
c0077fff
  [254596.013266] GPR08: c0077fffc400   
28822824
  [254596.013266] GPR12: c053c400 c3f0 c001626d7238 
c3cf5fe8
  [254596.013266] GPR16: 00073d64 c3251500 0001 
1c8c6203
  [254596.013266] GPR20: c3cf5fe8 a28f850a  
c3251500
  [254596.013266] GPR24: 0002 0018 c00740891500 
c00740891508
  [254596.013266] GPR28: 0003 0001 c3cf6120 
c00c00fe4ca8
  [254596.014319] NIP [c00b9054] queued_spin_lock_slowpath+0xf20/0x163c
  [254596.014397] LR [c00b9104] queued_spin_lock_slowpath+0xfd0/0x163c
  [254596.014474] Call Trace:
  [254596.014507] [c001626d6de0] [c001626d6e34] 0xc001626d6e34 
(unreliable)
  [254596.014604] [c001626d6f00] [c14f9080] _raw_spin_lock+0x68/0x88
  [254596.014681] [c001626d6f20] [c053548c] 
page_vma_mapped_walk+0x738/0x1220
  [254596.014781] [c001626d6fd0] [c053c5b8] 
try_to_unmap_one+0x1b8/0xe84
  [254596.014876] [c001626d7110] [c0539a98] 
rmap_walk_anon+0x15c/0x324
  [254596.014970] [c001626d7170] [c053e950] try_to_unmap+0xc8/0xf0
  [254596.015048] [c001626d71d0] [c04c6118] 
shrink_folio_list+0xa84/0xf8c
  [254596.015141] [c001626d72f0] [c04c69b8] evict_folios+0x398/0xdd0
  [254596.015218] [c001626d74a0] [c04c7640] 
try_to_shrink_lruvec+0x250/0x5ac
  [254596.015309] [c001626d7580] [c04c7ae4] shrink_one+0x148/0x2d8
  [254596.015388] [c001626d75e0] [c

[Kernel-packages] [Bug 2074376] Re: Disable PCI_DYNAMIC_OF_NODES in Ubuntu

2024-08-16 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Importance: Undecided => Medium

** Changed in: ubuntu-power-systems
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074376

Title:
  Disable PCI_DYNAMIC_OF_NODES in Ubuntu

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  This came in via KTML from upstream. It is part of a discussion
  between upstream and IBM reporting a bug which occurs in KVM:

  Rob Herring  writes:

  >> On 2024/07/11 06:20 AM, Rob Herring wrote:
  >>> On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal  
wrote:
  
   With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
   of a PCI device attached to a PCI-bridge causes following kernel Oops on
   a pseries KVM guest:
  >>>
  >>> Can I ask why you have this option on in the first place? Do you have
  >>> a use for it or it's just a case of distros turn on every kconfig
  >>> option.
  >>
  >> Yes, this option is turned on in Ubuntu's distro kernel config where the 
issue
  >> was originally reported, while Fedora is keeping this turned off.
  >>
  >> root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
  >> CONFIG_PCI_DYNAMIC_OF_NODES=y
  > 
  > Ubuntu should turn off this option. For starters, it is not complete
  > to be usable. Eventually, it should get removed in favor of some TBD
  > runtime option.
  > 
  > (And we should fix the crash too)

  This option is described in the config system as:

This option enables support for generating device tree nodes for some
PCI devices. Thus, the driver of this kind can load and overlay
flattened device tree for its downstream devices.
.
Once this option is selected, the device tree nodes will be generated
for all PCI bridges.

  Open Firmware (OF) would be used for KVM for UEFI mode. The reported
  bug was related to hot-unplugging PCI devices. My guess would be that
  this probably is not of much use to the majority of users and might
  even go away. So it should really be disabled in Ubuntu, too.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2074376/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076866] Re: ISST-LTE:KOP:1060.1:doodlp1g8:Post Migration Non-MDC L1 eralp1 crashed with migrate_misplaced_folio+0x4cc/0x5d0

2024-08-13 Thread Frank Heimes

** Also affects: ubuntu-power-systems
   Importance: Undecided
   Status: New

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Changed in: linux (Ubuntu)
 Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) => 
(unassigned)

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: ubuntu-power-systems
   Importance: Undecided => High

** Changed in: linux (Ubuntu)
   Status: New => Triaged

** Changed in: ubuntu-power-systems
   Status: New => Triaged

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: High
   Status: Triaged

** Changed in: linux (Ubuntu Noble)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  ISST-LTE:KOP:1060.1:doodlp1g8:Post Migration Non-MDC L1 eralp1 crashed
  with  migrate_misplaced_folio+0x4cc/0x5d0

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  Triaged

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-12 
23:50:17 ==
  +++ This bug was initially created as a clone of Bug #207985 +++

  ---Problem Description---
  Post Migration Non-MDC L1 eralp1 crashed with 
migrate_misplaced_folio+0x4cc/0x5d0 (
   
  Machine Type = na 
   
  Contact Information = sthou...@in.ibm.com 
   
  ---Steps to Reproduce---
   Problem description  : 
  After 1 hour of successful migration from doodlp1 [MDC MODE] to eralp1[NON 
MDC mode],eralp1 guest 
  and dump is collected
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured

  
  [281827.975244] NIP [c05f0620] migrate_misplaced_folio+0x4f0/0x5d0
  [281827.975251] LR [c05f067c] migrate_misplaced_folio+0x54c/0x5d0
  [281827.975258] Call Trace:
  [281827.975260] [c01e19ff7140] [c05f0670] 
migrate_misplaced_folio+0x540/0x5d0 (unreliable)
  [281827.975268] [c01e19ff71d0] [c054c9f0] 
__handle_mm_fault+0xf70/0x28e0
  [281827.975276] [c01e19ff7310] [c054e478] 
handle_mm_fault+0x118/0x400
  [281827.975284] [c01e19ff7360] [c053598c] 
__get_user_pages+0x1ec/0x5b0
  [281827.975291] [c01e19ff7420] [c0536920] 
get_user_pages_unlocked+0x120/0x4f0
  [281827.975298] [c01e19ff74c0] [c0081894ea9c] hva_to_pfn+0xf4/0x630 
[kvm]
  [281827.975316] [c01e19ff7550] [c00818b4efc4] 
kvmppc_book3s_instantiate_page+0xec/0x790 [kvm_hv]
  [281827.975326] [c01e19ff7660] [c00818b4f750] 
kvmppc_book3s_radix_page_fault+0xe8/0x380 [kvm_hv]
  [281827.975335] [c01e19ff7700] [c00818b488fc] 
kvmppc_book3s_hv_page_fault+0x294/0xd60 [kvm_hv]
  [281827.975344] [c01e19ff77e0] [c00818b43f5c] 
kvmppc_vcpu_run_hv+0xf94/0x11d0 [kvm_hv]
  [281827.975352] [c01e19ff78a0] [c0081896131c] 
kvmppc_vcpu_run+0x34/0x48 [kvm]
  [281827.975365] [c01e19ff78c0] [c0081895c164] 
kvm_arch_vcpu_ioctl_run+0x39c/0x570 [kvm]
  [281827.975379] [c01e19ff7950] [c0081894a104] 
kvm_vcpu_ioctl+0x20c/0x9a8 [kvm]
  [281827.975391] [c01e19ff7b30] [c0683974] sys_ioctl+0x574/0x16a0
  [281827.975395] [c01e19ff7c30] [c0030838] 
system_call_exception+0x168/0x310
  [281827.975400] [c01e19ff7e50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [281827.975406] --- interrupt: 3000 at 0x7fffb7d4d2bc

  Mirroring to distro as per message in group channel

  
  Please pick these patches for this bug:

  ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks 
under PTL")
  4b88c23ab8c9 ("mm/migrate: make migrate_misplaced_folio() return 0 on 
success")
  d2136d749d76 ("mm: support multi-size THP numa balancing")
  6b0ed7b3c775 ("mm: factor out the numa mapping rebuilding into a new helper")
  ebb34f78d72c ("mm: convert folio_estimated_sharers() to 
folio_likely_mapped_shared()")
  133d04b1eee9 ("mm/numa_balancing: allow migrate on protnone reference with 
MPOL_PREFERRED_MANY policy")
  f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead of cpu_to_node()")

  Thanks,
  Amit

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076866/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076866] Re: ISST-LTE:KOP:1060.1:doodlp1g8:Post Migration Non-MDC L1 eralp1 crashed with migrate_misplaced_folio+0x4cc/0x5d0

2024-08-13 Thread Frank Heimes

** Package changed: kernel-package (Ubuntu) => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076866

Title:
  ISST-LTE:KOP:1060.1:doodlp1g8:Post Migration Non-MDC L1 eralp1 crashed
  with  migrate_misplaced_folio+0x4cc/0x5d0

Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-12 
23:50:17 ==
  +++ This bug was initially created as a clone of Bug #207985 +++

  ---Problem Description---
  Post Migration Non-MDC L1 eralp1 crashed with 
migrate_misplaced_folio+0x4cc/0x5d0 (
   
  Machine Type = na 
   
  Contact Information = sthou...@in.ibm.com 
   
  ---Steps to Reproduce---
   Problem description  : 
  After 1 hour of successful migration from doodlp1 [MDC MODE] to eralp1[NON 
MDC mode],eralp1 guest 
  and dump is collected
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured

  
  [281827.975244] NIP [c05f0620] migrate_misplaced_folio+0x4f0/0x5d0
  [281827.975251] LR [c05f067c] migrate_misplaced_folio+0x54c/0x5d0
  [281827.975258] Call Trace:
  [281827.975260] [c01e19ff7140] [c05f0670] 
migrate_misplaced_folio+0x540/0x5d0 (unreliable)
  [281827.975268] [c01e19ff71d0] [c054c9f0] 
__handle_mm_fault+0xf70/0x28e0
  [281827.975276] [c01e19ff7310] [c054e478] 
handle_mm_fault+0x118/0x400
  [281827.975284] [c01e19ff7360] [c053598c] 
__get_user_pages+0x1ec/0x5b0
  [281827.975291] [c01e19ff7420] [c0536920] 
get_user_pages_unlocked+0x120/0x4f0
  [281827.975298] [c01e19ff74c0] [c0081894ea9c] hva_to_pfn+0xf4/0x630 
[kvm]
  [281827.975316] [c01e19ff7550] [c00818b4efc4] 
kvmppc_book3s_instantiate_page+0xec/0x790 [kvm_hv]
  [281827.975326] [c01e19ff7660] [c00818b4f750] 
kvmppc_book3s_radix_page_fault+0xe8/0x380 [kvm_hv]
  [281827.975335] [c01e19ff7700] [c00818b488fc] 
kvmppc_book3s_hv_page_fault+0x294/0xd60 [kvm_hv]
  [281827.975344] [c01e19ff77e0] [c00818b43f5c] 
kvmppc_vcpu_run_hv+0xf94/0x11d0 [kvm_hv]
  [281827.975352] [c01e19ff78a0] [c0081896131c] 
kvmppc_vcpu_run+0x34/0x48 [kvm]
  [281827.975365] [c01e19ff78c0] [c0081895c164] 
kvm_arch_vcpu_ioctl_run+0x39c/0x570 [kvm]
  [281827.975379] [c01e19ff7950] [c0081894a104] 
kvm_vcpu_ioctl+0x20c/0x9a8 [kvm]
  [281827.975391] [c01e19ff7b30] [c0683974] sys_ioctl+0x574/0x16a0
  [281827.975395] [c01e19ff7c30] [c0030838] 
system_call_exception+0x168/0x310
  [281827.975400] [c01e19ff7e50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [281827.975406] --- interrupt: 3000 at 0x7fffb7d4d2bc

  Mirroring to distro as per message in group channel

  
  Please pick these patches for this bug:

  ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks 
under PTL")
  4b88c23ab8c9 ("mm/migrate: make migrate_misplaced_folio() return 0 on 
success")
  d2136d749d76 ("mm: support multi-size THP numa balancing")
  6b0ed7b3c775 ("mm: factor out the numa mapping rebuilding into a new helper")
  ebb34f78d72c ("mm: convert folio_estimated_sharers() to 
folio_likely_mapped_shared()")
  133d04b1eee9 ("mm/numa_balancing: allow migrate on protnone reference with 
MPOL_PREFERRED_MANY policy")
  f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead of cpu_to_node()")

  Thanks,
  Amit

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2076866/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2071471] Re: [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive throughput degradation for PCI-related network workloads

2024-08-09 Thread Frank Heimes

Many thanks for the successful verification, Barbara!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2071471

Title:
  [UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive
  throughput degradation for PCI-related network workloads

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Fix Released
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  SRU Justification:

  [Impact]

   * With the introduction of c76c067e488c "s390/pci: Use dma-iommu layer"
     (upstream with since kernel v6.7-rc1) there was a move (on s390x only)
     to a different dma-iommu implementation.

   * And with 92bce97f0c34 "s390/pci: Fix reset of IOMMU software counters"
     (again upstream since 6.7(rc-1) the IOMMU_DEFAULT_DMA_LAZY kernel config
     option should now be set to 'yes' by default for s390x.

   * Since CONFIG_IOMMU_DEFAULT_DMA_STRICT and IOMMU_DEFAULT_DMA_LAZY
     are related to each other CONFIG_IOMMU_DEFAULT_DMA_STRICT needs to be
     set to "no" by default, which was upstream done by b2b97a62f055
     "Revert "s390: update defconfigs"".

   * These changes are all upstream, but were not picked up by the Ubuntu
     kernel config.

   * And not having these config options set properly is causing significant
     PCI-related network throughput degradation (up to -72%).

   * This shows for almost all workloads and numbers of connections,
     deteriorating with the number of connections increasing.

   * Especially drastic is the drop for a high number of parallel connections
     (50 and 250) and for small and medium-size transactional workloads.
     However, also for streaming-type workloads the degradation is clearly
     visible (up to 48% degradation).

  [Fix]

   * The (upstream accepted) fix is to set
     IOMMU_DEFAULT_DMA_STRICT=no
     and
     IOMMU_DEFAULT_DMA_LAZY=y
     (which is needed for the changed DAM IOMMU implementation since v6.7).

  [Test Case]

   * Setup two Ubuntu Server 24.04 systems (with kernel 6.8)
     (one acting as server and as client)
     that have (PCIe attached) RoCE Express devices attached
     and that are connected to each other.

   * Verify if the the iommu_group type of the used PCI device is DMA-FQ:
 cat /sys/bus/pci/devices/\:00\:00.0/iommu_group/type
 DMA-FQ

   * Sample workload rr1c-200x1000-250 with rr1c-200x1000-250.xml:
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

   * Install uperf on both systems, client and server.

   * Start uperf at server: uperf -s

   * Start uperf at client: uperf -vai 5 -m uperf-profile.xml

   * Switch from strict to lazy mode
     either using the new kernel (or the test build below)
     or using kernel cmd-line parameter iommu.strict=0.

   * Restart uperf on server and client, like before.

   * Verification will be performed by IBM.

  [Regression Potential]

   * The is a certain regression potential, since the behavior with
     the two modified kernel config options will change significantly.

   * This may solve the (network) throughput issue with PCI devices,
     but may also come with side-effects on other PCIe based devices
     (the old compression adapters or the new NVMe carrier cards).

  [Other]

   * CCW devices are not affected.

   * This is s390x-specific only, hence will not affect any other
  architecture.

  __

  Symptom:
  Comparing Ubuntu 24.04 (kernelversion: 6.8.0-31-generic) against Ubuntu 
22.04, all of our PCI-related network measurements on LPAR show massive 
throughput degradations (up to -72%). This shows for almost all workloads and 
numbers of connections, detereorating with the number of connections 
increasing. Especially drastic is the drop for a high number of parallel 
connections (50 and 250) and for small and medium-size transactional workloads. 
However, also for streaming-type workloads the degradation is clearly visible 
(up to 48% degradation).

  Problem:
  With kernel config setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y, IOMMU DMA mode 
changed from lazy to strict, causing these massive degradations.
  Behavior can also be changed with a kernel commandline parameter 
(iommu.strict) for easy verification.

  The issue is known and was quickly fixed upstream in December 2023, after 
being present for little less than two weeks.
  Upstream fix: 
https://github.com/torvalds/linux/commit/b2b97a62f055dd638f7f02087331a8380d8f139a

  Repro:
  rr1c-200x1000-250 with rr1c-200x1000-250.xml:

[Kernel-packages] [Bug 2074376] Re: Disable PCI_DYNAMIC_OF_NODES in Ubuntu

2024-08-09 Thread Frank Heimes

Many thanks Kowshik Jois for the successful verification.
The verification results are at the duplicate bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2075721/comments/7
re-posting here:

--- Comment From kowshik.j...@in.ibm.com 2024-08-09 11:36 EDT---
I have tested this scenario with the noble-proposed kernel. I could attach and 
detach interfaces successfully. No crash/trace messages found.

Guest Env:
===
Linux ubuntu 6.8.0-43-generic #43-Ubuntu SMP Fri Aug 2 19:46:18 UTC 2024 
ppc64le ppc64le ppc64le GNU/Linux

root@ubuntu:~# cat /boot/config-6.8.0-43-generic | grep PCI_DYNAMIC
# CONFIG_PCI_DYNAMIC_OF_NODES is not set

Before Attaching the Interface:
=

root@ubuntu:~# ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s1:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:24:e5:58 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.187/24 metric 100 brd 192.168.122.255 scope global dynamic 
enp0s1
valid_lft 2749sec preferred_lft 2749sec
inet6 fe80::5054:ff:fe24:e558/64 scope link
valid_lft forever preferred_lft forever

After Attaching the Interface:
=

# virsh attach-interface Ubuntu2404 bridge --source virbr0
Interface attached successfully

root@ubuntu:~# ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s1:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:24:e5:58 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.187/24 metric 100 brd 192.168.122.255 scope global dynamic 
enp0s1
valid_lft 2738sec preferred_lft 2738sec
inet6 fe80::5054:ff:fe24:e558/64 scope link
valid_lft forever preferred_lft forever
3: enp0s7:  mtu 1500 qdisc noop state DOWN group default 
qlen 1000
link/ether 52:54:00:96:d9:83 brd ff:ff:ff:ff:ff:ff

After Detaching the Interface:
=

# virsh detach-interface Ubuntu2404 bridge 52:54:00:96:d9:83
Interface detached successfully

root@ubuntu:~# ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp0s1:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:24:e5:58 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.187/24 metric 100 brd 192.168.122.255 scope global dynamic 
enp0s1
valid_lft 2720sec preferred_lft 2720sec
inet6 fe80::5054:ff:fe24:e558/64 scope link
valid_lft forever preferred_lft forever

** Tags removed: verification-needed-noble-linux
** Tags added: verification-done-noble-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074376

Title:
  Disable PCI_DYNAMIC_OF_NODES in Ubuntu

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Committed

Bug description:
  This came in via KTML from upstream. It is part of a discussion
  between upstream and IBM reporting a bug which occurs in KVM:

  Rob Herring  writes:

  >> On 2024/07/11 06:20 AM, Rob Herring wrote:
  >>> On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal  
wrote:
  
   With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
   of a PCI device attached to a PCI-bridge causes following kernel Oops on
   a pseries KVM guest:
  >>>
  >>> Can I ask why you have this option on in the first place? Do you have
  >>> a use for it or it's just a case of distros turn on every kconfig
  >>> option.
  >>
  >> Yes, this option is turned on in Ubuntu's distro kernel config where the 
issue
  >> was originally reported, while Fedora is keeping this turned off.
  >>
  >> root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
  >> CONFIG_PCI_DYNAMIC_OF_NODES=y
  > 
  > Ubuntu should turn off this option. For starters, it is not complete
  > to be usable. Eventually, it should get removed in favor of some TBD
  > runtime option.
  > 
  > (And we should fix the crash too)

  This option is described in the config system as:

This option enables support for generating device tree nodes for some
PCI devices. Thus, the driver of this kind can load and overlay
flattened device tree for its downstream devices.
.
Once this option is selected, the device tree nodes will be generated

[Kernel-packages] [Bug 2075721] Re: [Ubuntu24.04] virsh detach-interface is crashing the guest

2024-08-09 Thread Frank Heimes

*** This bug is a duplicate of bug 2074376 ***
https://bugs.launchpad.net/bugs/2074376

Many thanks Kowshik Jois for the successful verification!

** Changed in: ubuntu-power-systems
   Status: New => Fix Committed

** Changed in: linux (Ubuntu)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075721

Title:
  [Ubuntu24.04] virsh detach-interface is crashing the guest

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed

Bug description:
  == Comment: #0 - Kowshik Jois B S  - 2024-05-28 
01:07:02 ==
  ---Problem Description---
  While trying virsh attach-interface and virsh detach-interface, It is 
observed that, attaching an interface is successful. But trying to detach the 
same results in the guest crash with the below trace messages on the console.

  
  root@ubuntulp3guest1:~# [ 5363.726428] Kernel attempted to read user page 
(10ec0058) - exploit attempt? (uid: 0)
  [ 5363.726570] BUG: Unable to handle kernel data access on read at 
0x10ec0058
  [ 5363.726662] Faulting instruction address: 0xc12d4828
  [ 5363.726739] Oops: Kernel access of bad area, sig: 11 [#1]
  [ 5363.726800] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  [ 5363.726880] Modules linked in: 8139too 8139cp mii qrtr cfg80211 
binfmt_misc uio_pdrv_genirq vmx_crypto uio dm_multipath nfnetlink ip_tables 
x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
poly1305_p10_crypto chacha_p10_crypto libchacha crct10dif_vpmsum crc32c_vpmsum 
xhci_pci xhci_pci_renesas aes_gcm_p10_crypto
  [ 5363.727302] CPU: 0 PID: 1614 Comm: drmgr Not tainted 6.8.0-31-generic 
#31-Ubuntu
  [ 5363.727426] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [ 5363.727563] NIP:  c12d4828 LR: c12d68f0 CTR: 

  [ 5363.727653] REGS: c000149cb440 TRAP: 0300   Not tainted  
(6.8.0-31-generic)
  [ 5363.727742] MSR:  8280b033   CR: 
44088282  XER: 2004
  [ 5363.727855] CFAR: c12d68ec DAR: 10ec0058 DSISR: 4000 
IRQMASK: 0 
  [ 5363.727855] GPR00: c12d68f0 c000149cb6e0 c2254800 
10ec0048 
  [ 5363.727855] GPR04: c000149cb748   
 
  [ 5363.727855] GPR08:    
 
  [ 5363.727855] GPR12:  c3e8  
 
  [ 5363.727855] GPR16:    
 
  [ 5363.727855] GPR20:    
 
  [ 5363.727855] GPR24:   c48585a0 
c000149cb7d4 
  [ 5363.727855] GPR28: 0001 c00014de9400 10ec0048 
 
  [ 5363.728644] NIP [c12d4828] __of_changeset_entry_invert+0x10/0x1ac
  [ 5363.728732] LR [c12d68f0] __of_changeset_revert_entries+0x98/0x180
  [ 5363.728813] Call Trace:
  [ 5363.728845] [c000149cb7b0] [c12d6b60] 
of_changeset_revert+0x58/0xd8
  [ 5363.728937] [c000149cb800] [c0d0d498] 
of_pci_remove_node+0x74/0xb0
  [ 5363.729029] [c000149cb830] [c0cdbde0] 
pci_stop_bus_device+0xf4/0x138
  [ 5363.729126] [c000149cb870] [c0cdbf40] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
  [ 5363.729232] [c000149cb8a0] [c0cf2950] remove_store+0xf0/0x108
  [ 5363.729311] [c000149cb8f0] [c0e88384] dev_attr_store+0x34/0x78
  [ 5363.729389] [c000149cb910] [c07f8234] sysfs_kf_write+0x70/0xa4
  [ 5363.729467] [c000149cb930] [c07f66a8] 
kernfs_fop_write_iter+0x1d0/0x2e0
  [ 5363.729558] [c000149cb980] [c06c8fc8] vfs_write+0x27c/0x558
  [ 5363.729639] [c000149cba30] [c06c9628] ksys_write+0x90/0x170
  [ 5363.729716] [c000149cba80] [c0033248] 
system_call_exception+0xf8/0x290
  [ 5363.729811] [c000149cbe50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [ 5363.729903] --- interrupt: 3000 at 0x74191e15c720
  [ 5363.729964] NIP:  74191e15c720 LR: 74191e15c720 CTR: 

  [ 5363.730053] REGS: c000149cbe80 TRAP: 3000   Not tainted  
(6.8.0-31-generic)
  [ 5363.730143] MSR:  8280f033   
CR: 48088202  XER: 
  [ 5363.730257] IRQMASK: 0 
  [ 5363.730257] GPR00: 0004 7bdfb730 74191e296d00 
000b 
  [ 5363.730257] GPR04: 0be4ed58d640 0001  
0031 
  [ 5363.730257] GPR08:    
 
  [ 5363.730257] GPR12:  74191e3eb300 0

[Kernel-packages] [Bug 2074380] Re: [UBUNTU 22.04] s390/cpum_cf: make crypto counters upward compatible

2024-08-08 Thread Frank Heimes

** Changed in: ubuntu-z-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074380

Title:
  [UBUNTU 22.04] s390/cpum_cf: make crypto counters upward compatible

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed

Bug description:
  SRU Justification:

  [ Impact ]

   * The CPU Measurement Facility (CPU MF) crypto counter set
     is not listed in the device sysfs tree - it's not exported
     in the sysfs directory /sys/devices/cpum_cf/events.

   * The attribute files for each CPU-MF counter defined
     in the crypto counter set is missing.

   * This is caused by the counter second version number of CPU MF
     hardware being incremented on new machines.

   * This causes a sanity check to fail,
     but the counters are supported by hardware.

   * The solution is to remove the upper limit in counter second
     version number check.

  [ Fix ]

   * f10933cbd2df f10933cbd2dfddf6273698a45f76db9bafd8150f
 "s390/cpum_cf: make crypto counters upward compatible across machine types"

   * The fix was upstream accepted with kernel v6.10(-rc1).

   * Upstream commit applies cleanly on noble master-next, 
 but needed to be backported to jammy master-next due to different code
 and context in kernel 5.15.

  [ Test Plan ]

   * Run the following commands on a new machine generation:
     (hence only doable by IBM)
     # ls -l /sys/devices/cpum_cf/events/ | grep AES

   * If the output is empty than this patch is required.

   * With a patched kernel the output should be like:
     # ls /sys/devices/cpum_cf/events/ | grep AES
     AES_BLOCKED_CYCLES
     AES_BLOCKED_FUNCTIONS
     AES_CYCLES
     AES_FUNCTIONS

  [ Where problems could occur ]

   * This affects s390x only - CPU MF is s390-specific,
     and only s390 specific code is modified.

   * And it furthermore is limited to the crypto counter set
     of CPU MF.

   * So any impact is likely limited to hardware crypto counters
     on s390x only.

   * In s390/kernel/perf_cpum_cf.c the else if case got changed from
     explicitly checking for 6 or 7 to >= 6 which seems to require
     attention for future 8 and more cases.

   * In s390/kernel/perf_cpum_cf_events.c the switch (ci.csvn) statement
     was changed to an if / else if with similar logic.
     Again attentioin for any potential future cases >= 8.

   * It does not look like currently used cases (1..5 and 6..7)
     are affected by the modification, just >7.

   * Test build of patched jammy and noble s390x kernels were build
     and are avaiable here:
     https://launchpad.net/~fheimes/+archive/ubuntu/lp2074380

  [ Other Info ]

   * Since the code/fix was upstream accepted with kernel v6.10(-rc1)
     it does not affect the current development release oracular.

   * This SRU can also be seen under the umbrella of new
  hardware enablement.

   * Since it requires special hw, the verification needs to be
     done by IBM.

  __

  Description:   kernel: s390/cpum_cf: make crypto counters upward
  compatible

  Symptom:   The CPU Measurement facility crypto counter set is not
     listed in the device sysfs tree.

  Problem:   The CPU Measurement facility crypto counter set is not
     exported in the sysfs directory
     /sys/devices/cpum_cf/events.
     The attribute files for each CPU-MF counter defined
     in the crypto counter set is missing. This is caused
     by the counter second version number of the CPU
     Measurement Facility hardware being incremented on
     new machines.  This causes a sanity check to fail,
     but the counters are supported by hardware.

  Solution:  Remove upper limit in counter second version number
     check.

  Reproduction:  Run command on a new machine generation:
  # ls -l /sys/devices/cpum_cf/events/ | grep AES
  #
     If the output is empty than this patch is required.
     The output should be:
  # ls  /sys/devices/cpum_cf/events/ | grep AES
  AES_BLOCKED_CYCLES
  AES_BLOCKED_FUNCTIONS
  AES_CYCLES
  AES_FUNCTIONS
  #

  Upstream-ID of fix:   f10933cbd2dfddf6273698a45f76db9bafd8150f

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2074380/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2064539] Re: Revert back frame pointers for ppc64el (remove -fno-omit-frame-pointer)

2024-08-08 Thread Frank Heimes

Hi @schopin, I agree - that is what is now under option 1) under test
plan in the SRU Justification.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2064539

Title:
  Revert back frame pointers for ppc64el (remove -fno-omit-frame-
  pointer)

Status in The Ubuntu-power-systems project:
  New
Status in dpkg package in Ubuntu:
  Fix Released
Status in glibc package in Ubuntu:
  New
Status in linux package in Ubuntu:
  New
Status in dpkg source package in Noble:
  Fix Committed
Status in glibc source package in Noble:
  New
Status in linux source package in Noble:
  New
Status in dpkg source package in Oracular:
  Fix Released
Status in glibc source package in Oracular:
  New
Status in linux source package in Oracular:
  New

Bug description:
  SRU Justification:

  [ Impact ]

   * Power's Linux ABIs all require an explicit call chain be stored on
  the call stack frames which are all accessible via the stack pointer.

   * Therefore, having a (soft/simulated) frame pointer does not improve
  backtraces at all on Power.

   * However, forcing a frame pointer via the -fno-omit-frame-pointer
  option negatively affects performance for multiple reasons: extra
  prologue/epilogue overhead and fewer shrink-wrapping opportunities.

   * Given -fno-omit-frame-pointer does not provide any improvements
  (backtraces or otherwise) and only reduces performance, -fno-omit-
  frame-pointers should not be used on Power.

   * So we are facing here a performance penalty without any gain - on
  this particular platform.

   * And sometimes (in rare cases like LP#2060108) frame pointers may
  even lead to failed builds.

  [ Test Plan ]

   * Due to the above description of the impact and rationale,
 this pragmatic approach for testing is given:

   * Build the affected packages where frame-pointers should be reverted
 using the updated dpkg package (that incl. the modified build defaults)
 on (or for) this particular platform.

   * Now frame-pointer usage be checked in the following different ways:

   * 1) For the ease of use (and thanks to Julian Klode), there is this python
test script available that allows to verify a binary in regard to
frame pointers:
https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208

   * 2) Another more manual way is to verify based on debug symbols like this:
- find and install the ddeb package
- maybe extract the  file (e.g. unzstd)
- use 'readelf -wi'
- and grep for 'DW_AT_produce' (build options)
- look for entries regarding frame-pointer
The output may look similar to this:
readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
<23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
<129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g 
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common 
-fno-strict-aliasing -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -fno-allow-store-data-races 
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection 
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once 
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow 
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift 
-fsanitize=bool -fsanitize=enum -fPIC

   * 3) And maybe watching the build messages / log for the build options that
were used (but that is probably not sufficient - it's better to inspect
the output.)

  [ Where problems could occur ]

   * The dpkg modifications could have been done erroneously.
 A dpkg test build and/or builds of other packages with the modified dpkg
 version in place would show this.

   * The settings in dpkg might be overwritten by other settings/packages.
 Tests like above, would show this.

   * One may think there could be issues in an environment where some packages
 have frame-pointer enabled and other don't.
 This is fine and was confirmed by IBM toolchain team and ours
 (as well as by a longer running  test system,
  with FP disabled in kernel, that showed no issues - like expected).

  [ Other Info ]
   
   * These changes were implemented during the opening of the oracular series.
 The very same changes are backported to 24.04 LTS.

   * These only affect the ppc64el and s390x architectures,
 for other architectures it's a no-change upload.

   * We didn't see any fallout for these changes during the development
 on the oracular series, and therefore don't expect any f

[Kernel-packages] [Bug 2064539] Re: Revert back frame pointers for ppc64el (remove -fno-omit-frame-pointer)

2024-08-08 Thread Frank Heimes

** Description changed:

- Power's Linux ABIs all require an explicit call chain be stored on the call 
stack frames which are all accessible via the stack pointer.
- Therefore, having a (soft/simulated) frame pointer does not improve 
backtraces at all on Power.
+ SRU Justification:
  
- However, forcing a frame pointer via the -fno-omit-frame-pointer option 
negatively affects performance for multiple reasons: extra prologue/epilogue 
overhead and fewer shrink-wrapping opportunities.
- Given -fno-omit-frame-pointer does not provide any improvements (backtraces 
or otherwise) and only reduces performance, -fno-omit-frame-pointers should not 
be used on Power.
+ [ Impact ]
  
- SRU:
+  * Power's Linux ABIs all require an explicit call chain be stored on
+ the call stack frames which are all accessible via the stack pointer.
  
- these changes were implemented during the opening of the oracular
- series. The very same changes are backported to 24.04 LTS. These only
- affect the ppc64el and s390x architectures, for other architectures it's
- a no-change upload.
+  * Therefore, having a (soft/simulated) frame pointer does not improve
+ backtraces at all on Power.
  
- We didn't see any fallout for these changes during the development on
- the oracular series, and therefore don't expect any fallout or
- regressions in 24.04 LTS either.
+  * However, forcing a frame pointer via the -fno-omit-frame-pointer
+ option negatively affects performance for multiple reasons: extra
+ prologue/epilogue overhead and fewer shrink-wrapping opportunities.
+ 
+  * Given -fno-omit-frame-pointer does not provide any improvements
+ (backtraces or otherwise) and only reduces performance, -fno-omit-frame-
+ pointers should not be used on Power.
+ 
+  * So we are facing here a performance penalty without any gain - on
+ this particular platform.
+ 
+  * And sometimes (in rare cases like LP#2060108) frame pointers may even
+ lead to failed builds.
+ 
+ [ Test Plan ]
+ 
+  * Due to the above description of the impact and rationale,
+this pragmatic approach for testing is given:
+ 
+  * Build the affected packages where frame-pointers should be reverted
+using the updated dpkg package (that incl. the modified build defaults)
+on (or for) this particular platform.
+ 
+  * Now frame-pointer usage be checked in the following different ways:
+ 
+  * 1) For the ease of use (and thanks to Julian Klode), there is this python
+   test script available that allows to verify a binary in regard to
+   frame pointers:
+   https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
+ 
+  * 2) Another more manual way is to verify based on debug symbols like this:
+   - find and install the ddeb package
+   - maybe extract the  file (e.g. unzstd)
+   - use 'readelf -wi'
+   - and grep for 'DW_AT_produce' (build options)
+   - look for entries regarding frame-pointer
+   The output may look similar to this:
+   readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
+   <23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
+   <129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g 
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common 
-fno-strict-aliasing -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -fno-allow-store-data-races 
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection 
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once 
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow 
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift 
-fsanitize=bool -fsanitize=enum -fPIC
+ 
+  * 3) And maybe watching the build messages / log for the build options that
+   were used (but that is probably not sufficient - it's better to inspect
+   the output.)
+ 
+ [ Where problems could occur ]
+ 
+  * The dpkg modifications could have been done erroneously.
+A dpkg test build and/or builds of other packages with the modified dpkg
+version in place would show this.
+ 
+  * The settings in dpkg might be overwritten by other settings/packages.
+Tests like above, would show this.
+ 
+  * One may think there could be issues in an environment where some packages
+have frame-pointer enabled and other don't.
+This is fine and was confirmed by IBM toolchain team and ours
+(as well as by a longer running  test system,
+ with FP disabled in kernel, that showed no issues - like expected).
+ 
+ [ Other Info ]
+  
+  * These changes were implemented during the opening of the oracular series.
+The very same changes are backported to 24.04 LTS.
+ 
+  * T

[Kernel-packages] [Bug 2064538] Re: Revert back frame pointers for s390x (remove -fno-omit-frame-pointer but use -mbackchain)

2024-08-08 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
-  * The preferred way of doing stack unwinding on Linux on Z is via dwarf call 
frame information.
+  * The preferred way of doing stack unwinding on Linux on Z is via dwarf call 
frame information.
  In absence of a dwarf unwinder (as in the Linux kernel) a stack chain can be 
maintained at runtime in addition to the dwarf unwinding information.
  
-  * This allows for simple backtrace implementations, but imposes a small
+  * This allows for simple backtrace implementations, but imposes a small
  runtime overhead. For this to work, all code that might be part of
  backtrace must be built with the -mbackchain GCC option.
  
-  * The -fno-omit-framepointer switch is neither necessary nor helpful in this 
context.
-   Having a (soft/simulated) frame pointer does not improve backtraces at all 
on IBM Z.
+  * The -fno-omit-framepointer switch is neither necessary nor helpful in this 
context.
+   Having a (soft/simulated) frame pointer does not improve backtraces at all 
on IBM Z.
  
-  * However, forcing a frame pointer via the -fno-omit-frame-pointer
+  * However, forcing a frame pointer via the -fno-omit-frame-pointer
  option negatively affects performance for multiple reasons: extra
  prologue/epilogue overhead and fewer shrink-wrapping opportunities.
  
-  * Given -fno-omit-frame-pointer does not provide any improvements
+  * Given -fno-omit-frame-pointer does not provide any improvements
  (backtraces or otherwise) and only reduces performance, -fno-omit-frame-
  pointers should not be used on IBM Z.
  
-  * So we are facing here a performance penalty without any gain - on
+  * So we are facing here a performance penalty without any gain - on
  this particular platform.
  
-  * And sometimes (in rare cases like LP#2060108) frame pointers may even
+  * And sometimes (in rare cases like LP#2060108) frame pointers may even
  lead to failed builds.
  
  [ Test Plan ]
  
-  * Due to the above description of the impact and rationale,
-this pragmatic approach for testing is given:
+  * Due to the above description of the impact and rationale,
+    this pragmatic approach for testing is given:
  
-  * Build the affected packages where frame-pointers should be reverted
-using the updated dpkg package (that incl. the modified build defaults)
-on (or for) this particular platform.
+  * Build the affected packages where frame-pointers should be reverted
+    using the updated dpkg package (that incl. the modified build defaults)
+    on (or for) this particular platform.
  
-  * Now frame-pointer usage be checked in the following different ways:
+  * Now frame-pointer usage be checked in the following different ways:
  
-  * 1) For the ease of use (and thanks to Julian Klode), there is this python
-   test script available that allows to verify a binary in regard to
-   frame pointers:
-   https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
+  * 1) For the ease of use (and thanks to Julian Klode), there is this python
+   test script available that allows to verify a binary in regard to
+   frame pointers:
+   https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
  
-  * 2) Another more manual way is to verify based on debug symbols like this:
-   - find and install the ddeb package
-   - maybe extract the  file (e.g. unzstd)
-   - use 'readelf -wi'
-   - and grep for 'DW_AT_produce' (build options)
-   - look for entries regarding frame-pointer
-   The output may look similar to this:
-   readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
-   <23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
-   <129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g 
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common 
-fno-strict-aliasing -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -fno-allow-store-data-races 
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection 
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once 
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow 
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift 
-fsanitize=bool -fsanitize=enum -fPIC
+  * 2) Another more manual way is to verify based on debug symbols like this:
+   - find and install the ddeb package
+   - maybe extract the  file (e.g. unzstd)
+   - use 'readelf -wi'
+   - and grep for 'DW_AT_produce' (build options)
+   - look for entries regarding frame-pointer
+   The output may look similar to this:
+   readelf -wi 
./usr/lib/d

[Kernel-packages] [Bug 2064538] Re: Revert back frame pointers for s390x (remove -fno-omit-frame-pointer but use -mbackchain)

2024-08-08 Thread Frank Heimes

** Description changed:

- The preferred way of doing stack unwinding on Linux on Z is via dwarf call 
frame information.
+ SRU Justification:
+ 
+ [ Impact ]
+ 
+  * The preferred way of doing stack unwinding on Linux on Z is via dwarf call 
frame information.
  In absence of a dwarf unwinder (as in the Linux kernel) a stack chain can be 
maintained at runtime in addition to the dwarf unwinding information.
- This allows for simple backtrace implementations, but imposes a small runtime 
overhead. For this to work, all code that might be part of backtrace must be 
built with the -mbackchain GCC option.
  
- The -fno-omit-framepointer switch is neither necessary nor helpful in this 
context.
- Having a (soft/simulated) frame pointer does not improve backtraces at all on 
IBM Z.
- However, forcing a frame pointer via the -fno-omit-frame-pointer option 
negatively affects performance for multiple reasons: extra prologue/epilogue 
overhead and fewer shrink-wrapping opportunities.
- Given -fno-omit-frame-pointer does not provide any improvements (backtraces 
or otherwise) and only reduces performance, -fno-omit-frame-pointers should not 
be used on IBM Z.
+  * This allows for simple backtrace implementations, but imposes a small
+ runtime overhead. For this to work, all code that might be part of
+ backtrace must be built with the -mbackchain GCC option.
+ 
+  * The -fno-omit-framepointer switch is neither necessary nor helpful in this 
context.
+   Having a (soft/simulated) frame pointer does not improve backtraces at all 
on IBM Z.
+ 
+  * However, forcing a frame pointer via the -fno-omit-frame-pointer
+ option negatively affects performance for multiple reasons: extra
+ prologue/epilogue overhead and fewer shrink-wrapping opportunities.
+ 
+  * Given -fno-omit-frame-pointer does not provide any improvements
+ (backtraces or otherwise) and only reduces performance, -fno-omit-frame-
+ pointers should not be used on IBM Z.
+ 
+  * So we are facing here a performance penalty without any gain - on
+ this particular platform.
+ 
+  * And sometimes (in rare cases like LP#2060108) frame pointers may even
+ lead to failed builds.
+ 
+ [ Test Plan ]
+ 
+  * Due to the above description of the impact and rationale,
+this pragmatic approach for testing is given:
+ 
+  * Build the affected packages where frame-pointers should be reverted
+using the updated dpkg package (that incl. the modified build defaults)
+on (or for) this particular platform.
+ 
+  * Now frame-pointer usage be checked in the following different ways:
+ 
+  * 1) For the ease of use (and thanks to Julian Klode), there is this python
+   test script available that allows to verify a binary in regard to
+   frame pointers:
+   https://gist.github.com/julian-klode/85e3f85c410a1b856a93dce77208
+ 
+  * 2) Another more manual way is to verify based on debug symbols like this:
+   - find and install the ddeb package
+   - maybe extract the  file (e.g. unzstd)
+   - use 'readelf -wi'
+   - and grep for 'DW_AT_produce' (build options)
+   - look for entries regarding frame-pointer
+   The output may look similar to this:
+   readelf -wi 
./usr/lib/debug/lib/modules/6.8.0-38-generic/kernel/arch/s390/crypto/aes_s390.ko
 | grep DW_AT_produce
+   <23>   DW_AT_producer: (indirect string, offset: 0x7d): GNU AS 
2.42
+   <129>   DW_AT_producer: (indirect string, offset: 0x3eef): GNU 
C11 13.2.0 -m64 -mpacked-stack -mbackchain -msoft-float -march=z13 -mtune=z16 
-mindirect-branch=thunk-extern -mfunction-return=thunk-extern 
-mindirect-branch-table -mrecord-mcount -mnop-mcount -mfentry -mzarch -g 
-gdwarf-5 -O2 -std=gnu11 -p -fshort-wchar -funsigned-char -fno-common 
-fno-strict-aliasing -fno-asynchronous-unwind-tables 
-fno-delete-null-pointer-checks -fno-allow-store-data-races 
-fno-stack-protector -ftrivial-auto-var-init=zero -fno-stack-clash-protection 
-fzero-call-used-regs=used-gpr -fno-inline-functions-called-once 
-falign-functions=8 -fstrict-flex-arrays=3 -fno-strict-overflow 
-fstack-check=no -fconserve-stack -fsanitize=bounds-strict -fsanitize=shift 
-fsanitize=bool -fsanitize=enum -fPIC
+ 
+  * 3) And maybe watching the build messages / log for the build options that
+   were used (but that is probably not sufficient - it's better to inspect
+   the output.)
+ 
+ [ Where problems could occur ]
+ 
+  * The dpkg modifications could have been done erroneously.
+A dpkg test build and/or builds of other packages with the modified dpkg
+version in place would show this.
+ 
+  * The settings in dpkg might be overwritten by other settings/packages.
+Tests like above, would show this.
+ 
+  * One may think there could be issues in an environment where some packages
+have frame-pointer enabled and other don't.
+This is fine and was confirmed by IBM toolchain team and ours
+(as well as by a longer running  test system,
+ with FP disabled in kernel, that

[Kernel-packages] [Bug 2064539] Re: Revert back frame pointers for ppc64el (remove -fno-omit-frame-pointer)

2024-08-08 Thread Frank Heimes

Many thanks doko, the debdiff looks reasonable and good.
I did a test build (based on v1.22.6ubuntu6.1 
https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/ppa/+sourcepub/16274696/+listing-archive-extra)
 and wasn't able to find an indication of FBs (in the debug symbols) anymore 
(using readelf -wi grep-ping for DW_AT_produce).
Same for LP#2064538.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2064539

Title:
  Revert back frame pointers for ppc64el (remove -fno-omit-frame-
  pointer)

Status in The Ubuntu-power-systems project:
  New
Status in dpkg package in Ubuntu:
  Fix Released
Status in glibc package in Ubuntu:
  New
Status in linux package in Ubuntu:
  New
Status in dpkg source package in Noble:
  New
Status in glibc source package in Noble:
  New
Status in linux source package in Noble:
  New
Status in dpkg source package in Oracular:
  Fix Released
Status in glibc source package in Oracular:
  New
Status in linux source package in Oracular:
  New

Bug description:
  Power's Linux ABIs all require an explicit call chain be stored on the call 
stack frames which are all accessible via the stack pointer.
  Therefore, having a (soft/simulated) frame pointer does not improve 
backtraces at all on Power.

  However, forcing a frame pointer via the -fno-omit-frame-pointer option 
negatively affects performance for multiple reasons: extra prologue/epilogue 
overhead and fewer shrink-wrapping opportunities.
  Given -fno-omit-frame-pointer does not provide any improvements (backtraces 
or otherwise) and only reduces performance, -fno-omit-frame-pointers should not 
be used on Power.

  SRU:

  these changes were implemented during the opening of the oracular
  series. The very same changes are backported to 24.04 LTS. These only
  affect the ppc64el and s390x architectures, for other architectures
  it's a no-change upload.

  We didn't see any fallout for these changes during the development on
  the oracular series, and therefore don't expect any fallout or
  regressions in 24.04 LTS either.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2064539/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076147] Re: Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hang during LTP Test

2024-08-07 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
-  * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
-PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.
+  * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
+    PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.
  
-  * It hangs with:
-"Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"
+  * It hangs with:
+    "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"
  
-  * Diagnosing the issues points this this fix/upstream-commit:
-[commit message, by Barry Song ]
-Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
-modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
-it only starts acquiring PTL from the first valid (present) PTE.
-PTE modifications can temporarily set PTEs to pte_none.
-Consequently, the initial PTEs of a large folio might be skipped
-in try_to_unmap_one().
-For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
-still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
-try_to_unmap_one().
-So folio will be still mapped, the folio fails to be reclaimed and is put
-back to LRU in this round.
-This also breaks up PTEs optimization such as CONT-PTE on this large folio
-and may lead to accident folio_split() afterwards.
-And since a part of PTEs are now swap entries, accessing those parts will
-introduce overhead - do_swap_page.
-Although the kernel can withstand all of the above issues, the situation
-still seems quite awkward and warrants making it more ideal.
-The same race also occurs with small folios, but they have only one PTE,
-thus, it won't be possible for them to be partially unmapped.
-This patch [see below] holds PTL from PTE0, allowing us to avoid reading
-PTE values that are in the process of being transformed. With stable PTE
-values, we can ensure that this large folio is either completely reclaimed
-or that all PTEs remain untouched in this round.
-A corner case is that if we hold PTL from PTE0 and most initial PTEs have
-been really unmapped before that, we may increase the duration of holding
-PTL. Thus we only apply this optimization to folios which are still 
entirely
-mapped (not in deferred_split list). 
+  * Diagnosing the issues points this this fix/upstream-commit:
+    [commit message, by Barry Song ]
+    Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
+    modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
+    it only starts acquiring PTL from the first valid (present) PTE.
+    PTE modifications can temporarily set PTEs to pte_none.
+    Consequently, the initial PTEs of a large folio might be skipped
+    in try_to_unmap_one().
+    For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
+    still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
+    try_to_unmap_one().
+    So folio will be still mapped, the folio fails to be reclaimed and is put
+    back to LRU in this round.
+    This also breaks up PTEs optimization such as CONT-PTE on this large folio
+    and may lead to accident folio_split() afterwards.
+    And since a part of PTEs are now swap entries, accessing those parts will
+    introduce overhead - do_swap_page.
+    Although the kernel can withstand all of the above issues, the situation
+    still seems quite awkward and warrants making it more ideal.
+    The same race also occurs with small folios, but they have only one PTE,
+    thus, it won't be possible for them to be partially unmapped.
+    This patch [see below] holds PTL from PTE0, allowing us to avoid reading
+    PTE values that are in the process of being transformed. With stable PTE
+    values, we can ensure that this large folio is either completely reclaimed
+    or that all PTEs remain untouched in this round.
+    A corner case is that if we hold PTL from PTE0 and most initial PTEs have
+    been really unmapped before that, we may increase the duration of holding
+    PTL. Thus we only apply this optimization to folios which are still 
entirely
+    mapped (not in deferred_split list).
  
  [ Fix ]
  
-  * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
-"mm: hold PTL from the first PTE while reclaiming a large folio"
+  * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
+    "mm: hold PTL from the first PTE while reclaiming a large folio"
  
  [ Test Plan ]
  
-  * An IBM Power 10 system (where PowerVM is mandatory)
-running Ubuntu Server 24.04 (kernel 6.8) or later 
-with (nested) KVM setup (so KVM on top of PowerVM).
+  * An IBM Power 10 system (where PowerVM is mandatory)
+    running Ubuntu Server 24.04 (kernel 6.8) or later
+    with (nested) KVM setu

[Kernel-packages] [Bug 2076147] Re: Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix L2 Guest hung during LTP Test

2024-08-07 Thread Frank Heimes

** Summary changed:

- L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab
+ Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix 
L2 Guest hung during LTP Test

** Summary changed:

- Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix 
L2 Guest hung during LTP Test
+ Add 'mm: hold PTL from the first PTE while reclaiming a large folio' to fix 
L2 Guest hang during LTP Test

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  Add 'mm: hold PTL from the first PTE while reclaiming a large folio'
  to fix L2 Guest hang during LTP Test

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Triaged

Bug description:
  SRU Justification:

  [ Impact ]

   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
     PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.

   * It hangs with:
     "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"

   * Diagnosing the issues points this this fix/upstream-commit:
     [commit message, by Barry Song ]
     Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
     modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
     it only starts acquiring PTL from the first valid (present) PTE.
     PTE modifications can temporarily set PTEs to pte_none.
     Consequently, the initial PTEs of a large folio might be skipped
     in try_to_unmap_one().
     For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
     still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
     try_to_unmap_one().
     So folio will be still mapped, the folio fails to be reclaimed and is put
     back to LRU in this round.
     This also breaks up PTEs optimization such as CONT-PTE on this large folio
     and may lead to accident folio_split() afterwards.
     And since a part of PTEs are now swap entries, accessing those parts will
     introduce overhead - do_swap_page.
     Although the kernel can withstand all of the above issues, the situation
     still seems quite awkward and warrants making it more ideal.
     The same race also occurs with small folios, but they have only one PTE,
     thus, it won't be possible for them to be partially unmapped.
     This patch [see below] holds PTL from PTE0, allowing us to avoid reading
     PTE values that are in the process of being transformed. With stable PTE
     values, we can ensure that this large folio is either completely reclaimed
     or that all PTEs remain untouched in this round.
     A corner case is that if we hold PTL from PTE0 and most initial PTEs have
     been really unmapped before that, we may increase the duration of holding
     PTL. Thus we only apply this optimization to folios which are still 
entirely
     mapped (not in deferred_split list).

  [ Fix ]

   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
     "mm: hold PTL from the first PTE while reclaiming a large folio"

  [ Test Plan ]

   * An IBM Power 10 system (where PowerVM is mandatory)
     running Ubuntu Server 24.04 (kernel 6.8) or later
     with (nested) KVM setup (so KVM on top of PowerVM).

   * Run LTP test suite
     Tests running: SLS(io,base)

   * Without the patch the above test will hang with
     Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab

  [ Where problems could occur ]

   * This is a common code change in the memory management sub-system,
     hence great care needs to be taken, even if it was discussed upfront
     at the https://lore.kernel.org/ mailing list and the upstream commit
     provenance shows that many eyes had a look at this.

   * The modification is relatively small with just one if statement
     (across two lines) in mm/vmscan.c.

   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
     from the first page table entry (PTE) and to eliminate the influence of
     temporary and volatile PTE values.

   * If done wrong it can especially have a negative impact in case of large 
folios.
     and wrong hints might be given to try_to_unmap
     which may lead to bad page swapping.

   * In case of an issue with this patch the result can also be decreased
     performance and efficiency in the page table handling - the opposite
     of what the patch is supposed to address.

   * Fortunately several developers had their eyes on this commit,
     as the provenance of the patch and the discussion at lkml shows.

  [ Other Info ]

   * The commit is upstream since v6.10(-rc1), hence it will be included
     in oracular with the planned target kernel.

  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as

[Kernel-packages] [Bug 2076147] Re: L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab

2024-08-07 Thread Frank Heimes

** Changed in: ubuntu-power-systems
   Status: New => Triaged

** Changed in: linux (Ubuntu)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1
  (0xc00c1bc8bb00) (possibly stale) @ new_slab

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Triaged

Bug description:
  SRU Justification:

  [ Impact ]

   * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
 PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.

   * It hangs with:
 "Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"

   * Diagnosing the issues points this this fix/upstream-commit:
 [commit message, by Barry Song ]
 Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
 modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
 it only starts acquiring PTL from the first valid (present) PTE.
 PTE modifications can temporarily set PTEs to pte_none.
 Consequently, the initial PTEs of a large folio might be skipped
 in try_to_unmap_one().
 For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
 still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
 try_to_unmap_one().
 So folio will be still mapped, the folio fails to be reclaimed and is put
 back to LRU in this round.
 This also breaks up PTEs optimization such as CONT-PTE on this large folio
 and may lead to accident folio_split() afterwards.
 And since a part of PTEs are now swap entries, accessing those parts will
 introduce overhead - do_swap_page.
 Although the kernel can withstand all of the above issues, the situation
 still seems quite awkward and warrants making it more ideal.
 The same race also occurs with small folios, but they have only one PTE,
 thus, it won't be possible for them to be partially unmapped.
 This patch [see below] holds PTL from PTE0, allowing us to avoid reading
 PTE values that are in the process of being transformed. With stable PTE
 values, we can ensure that this large folio is either completely reclaimed
 or that all PTEs remain untouched in this round.
 A corner case is that if we hold PTL from PTE0 and most initial PTEs have
 been really unmapped before that, we may increase the duration of holding
 PTL. Thus we only apply this optimization to folios which are still 
entirely
 mapped (not in deferred_split list). 

  [ Fix ]

   * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
 "mm: hold PTL from the first PTE while reclaiming a large folio"

  [ Test Plan ]

   * An IBM Power 10 system (where PowerVM is mandatory)
 running Ubuntu Server 24.04 (kernel 6.8) or later 
 with (nested) KVM setup (so KVM on top of PowerVM).

   * Run LTP test suite
 Tests running: SLS(io,base)

   * Without the patch the above test will hang with
 Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab

  [ Where problems could occur ]

   * This is a common code change in the memory management sub-system,
 hence great care needs to be taken, even if it was discussed upfront
 at the https://lore.kernel.org/ mailing list and the upstream commit
 provenance shows that many eyes had a look at this.

   * The modification is relatively small with just one if statement
 (across two lines) in mm/vmscan.c.

   * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
 from the first page table entry (PTE) and to eliminate the influence of
 temporary and volatile PTE values.

   * If done wrong it can especially have a negative impact in case of large 
folios.
 and wrong hints might be given to try_to_unmap
 which may lead to bad page swapping.

   * In case of an issue with this patch the result can also be decreased
 performance and efficiency in the page table handling - the opposite
 of what the patch is supposed to address.

   * Fortunately several developers had their eyes on this commit,
 as the provenance of the patch and the discussion ot lkml shows.

  [ Other Info ]
   
   * The commit is upstream since v6.10(-rc1), hence it will be included
 in oracular with the planned target kernel.

  __

  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as a clone of Bug #206372 +++

  ---Problem Description---
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab (edit)

  ---uname output---
  NA

  ---Additional Hardware Info---
  NA

  Contact Information = na

  ---Debugger Data---
  NA

  ---Patches Installed---
  NA

  ---Steps to R

[Kernel-packages] [Bug 2076147] Re: L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab

2024-08-07 Thread Frank Heimes

** Description changed:

+ SRU Justification:
+ 
+ [ Impact ]
+ 
+  * KVM 2nd level guest (means KVM VM that runs nested on top of a Power 10
+PowerVM hypervisor) hangs during LTP (Linux Test Projects) test suite.
+ 
+  * It hangs with:
+"Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab"
+ 
+  * Diagnosing the issues points this this fix/upstream-commit:
+[commit message, by Barry Song ]
+Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
+modifications preceded by pte clear. While iterating over PTEs of a large 
folio,
+it only starts acquiring PTL from the first valid (present) PTE.
+PTE modifications can temporarily set PTEs to pte_none.
+Consequently, the initial PTEs of a large folio might be skipped
+in try_to_unmap_one().
+For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
+still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
+try_to_unmap_one().
+So folio will be still mapped, the folio fails to be reclaimed and is put
+back to LRU in this round.
+This also breaks up PTEs optimization such as CONT-PTE on this large folio
+and may lead to accident folio_split() afterwards.
+And since a part of PTEs are now swap entries, accessing those parts will
+introduce overhead - do_swap_page.
+Although the kernel can withstand all of the above issues, the situation
+still seems quite awkward and warrants making it more ideal.
+The same race also occurs with small folios, but they have only one PTE,
+thus, it won't be possible for them to be partially unmapped.
+This patch [see below] holds PTL from PTE0, allowing us to avoid reading
+PTE values that are in the process of being transformed. With stable PTE
+values, we can ensure that this large folio is either completely reclaimed
+or that all PTEs remain untouched in this round.
+A corner case is that if we hold PTL from PTE0 and most initial PTEs have
+been really unmapped before that, we may increase the duration of holding
+PTL. Thus we only apply this optimization to folios which are still 
entirely
+mapped (not in deferred_split list). 
+ 
+ [ Fix ]
+ 
+  * 73bc32875ee9 73bc32875ee9b1881dd780308c6793fe463fe803
+"mm: hold PTL from the first PTE while reclaiming a large folio"
+ 
+ [ Test Plan ]
+ 
+  * An IBM Power 10 system (where PowerVM is mandatory)
+running Ubuntu Server 24.04 (kernel 6.8) or later 
+with (nested) KVM setup (so KVM on top of PowerVM).
+ 
+  * Run LTP test suite
+Tests running: SLS(io,base)
+ 
+  * Without the patch the above test will hang with
+Back trace of paca->saved_r1 (0xc00c1bc8bb00) (possibly stale) @ 
new_slab
+ 
+ [ Where problems could occur ]
+ 
+  * This is a common code change in the memory management sub-system,
+hence great care needs to be taken, even if it was discussed upfront
+at the https://lore.kernel.org/ mailing list and the upstream commit
+provenance shows that many eyes had a look at this.
+ 
+  * The modification is relatively small with just one if statement
+(across two lines) in mm/vmscan.c.
+ 
+  * This change is to assist 'try_to_unmap' to acquire page table locks (PTL)
+from the first page table entry (PTE) and to eliminate the influence of
+temporary and volatile PTE values.
+ 
+  * If done wrong it can especially have a negative impact in case of large 
folios.
+and wrong hints might be given to try_to_unmap
+which may lead to bad page swapping.
+ 
+  * In case of an issue with this patch the result can also be decreased
+performance and efficiency in the page table handling - the opposite
+of what the patch is supposed to address.
+ 
+  * Fortunately several developers had their eyes on this commit,
+as the provenance of the patch and the discussion ot lkml shows.
+ 
+ [ Other Info ]
+  
+  * The commit is upstream since v6.10(-rc1), hence it will be included
+in oracular with the planned target kernel.
+ 
+ __
+ 
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as a clone of Bug #206372 +++
  
  ---Problem Description---
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab (edit)
  
-  
  ---uname output---
  NA
-  
+ 
  ---Additional Hardware Info---
- NA 
+ NA
  
-  
- Contact Information = na 
-  
+ Contact Information = na
+ 
  ---Debugger Data---
- NA 
-  
+ NA
+ 
  ---Patches Installed---
  NA
-  
+ 
  ---Steps to Reproduce---
-  
+ 
  Tests running: SLS(io,base)
  LPAR Config:
  
  PHYP Environment:  PowerVM
  LPAR Hostname/IP: 10.33.2.107
  Rootvg Filesystem: xfs
  Network Interface: Shiner-T
  vNIC/SR-IOV Config: n/a
  IO Type: SAN
  IO Disk Type: raw
  Multipath Enabled: No
  
-
  DUMP Config:
  ===

[Kernel-packages] [Bug 2076147] Re: L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab

2024-08-06 Thread Frank Heimes

I picked the commit and started a test build of the patched kernel that is 
currently building here:
launchpad.net/~fheimes/+archive/ubuntu/lp2076147

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1
  (0xc00c1bc8bb00) (possibly stale) @ new_slab

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as a clone of Bug #206372 +++

  ---Problem Description---
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab (edit)

   
  ---uname output---
  NA
   
  ---Additional Hardware Info---
  NA 

   
  Contact Information = na 
   
  ---Debugger Data---
  NA 
   
  ---Patches Installed---
  NA
   
  ---Steps to Reproduce---
   
  Tests running: SLS(io,base)
  LPAR Config:
  
  PHYP Environment:  PowerVM
  LPAR Hostname/IP: 10.33.2.107
  Rootvg Filesystem: xfs
  Network Interface: Shiner-T
  vNIC/SR-IOV Config: n/a
  IO Type: SAN
  IO Disk Type: raw
  Multipath Enabled: No
  
-
  DUMP Config:
  
  KDUMP configured: Yes
  XMON enabled no
  DUMP Available: no
   
  Machine Type = na 

  Userspace rpm: NA 
   
  The userspace tool has the following bit modes: NA 

  Userspace tool obtained from project website:  na 
   
  Userspace tool common name: NA 
   
  *Additional Instructions for na: 
  -Post a private note with access information to the machine that is currently 
in the debugger.
  -Attach ltrace and strace of userspace application.

  
  please include this commit in Ubuntu 24.04

  upstream commit  which is solving these data store lockups:
  73bc32875ee9b1881dd780308c6793fe463fe803 mm: hold PTL from the first PTE 
while reclaiming a large folio

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076147/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2076147] Re: L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab

2024-08-06 Thread Frank Heimes

Hello and thanks for raising this issue.

Do you have a reproducer on this issue?
Since it's common code (memory mgnt) this needs to be handled with care, since 
it will affect all installations.
Is it correct that you faced this issue while running the LTP Test suite?
Could you provide more details, like if it was LTP running inside of an KVM 
guest on P10 (I assume yes), at which test did the issue occurred, how did you 
called LTP etc.?

Since we are asked to integrate this into 24.04, we have to follow the SRU 
process,
which requires that we have this template filled out 
(https://wiki.ubuntu.com/StableReleaseUpdates#SRU_Bug_Template - we can help 
with that)
and one section is about a test plan for this.

The level of information here is probably not sufficient to completely
fill out the SRU template.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1
  (0xc00c1bc8bb00) (possibly stale) @ new_slab

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as a clone of Bug #206372 +++

  ---Problem Description---
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab (edit)

   
  ---uname output---
  NA
   
  ---Additional Hardware Info---
  NA 

   
  Contact Information = na 
   
  ---Debugger Data---
  NA 
   
  ---Patches Installed---
  NA
   
  ---Steps to Reproduce---
   
  Tests running: SLS(io,base)
  LPAR Config:
  
  PHYP Environment:  PowerVM
  LPAR Hostname/IP: 10.33.2.107
  Rootvg Filesystem: xfs
  Network Interface: Shiner-T
  vNIC/SR-IOV Config: n/a
  IO Type: SAN
  IO Disk Type: raw
  Multipath Enabled: No
  
-
  DUMP Config:
  
  KDUMP configured: Yes
  XMON enabled no
  DUMP Available: no
   
  Machine Type = na 

  Userspace rpm: NA 
   
  The userspace tool has the following bit modes: NA 

  Userspace tool obtained from project website:  na 
   
  Userspace tool common name: NA 
   
  *Additional Instructions for na: 
  -Post a private note with access information to the machine that is currently 
in the debugger.
  -Attach ltrace and strace of userspace application.

  
  please include this commit in Ubuntu 24.04

  upstream commit  which is solving these data store lockups:
  73bc32875ee9b1881dd780308c6793fe463fe803 mm: hold PTL from the first PTE 
while reclaiming a large folio

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076147/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2070253] Re: KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low performance. Possible tuning opportunity.

2024-08-05 Thread Frank Heimes

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=7be6ce7043b4cf293c8826a48fd9f56931cef2cf
DOes not seem to have been upstream tagged for stable updates, hence manual 
submission needed.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2070253

Title:
  KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low
  performance. Possible tuning opportunity.

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  KVM on PowerVM: L2 Guest-Aggressively entering CEDE results in low
  performance. Possible tuning opportunity.

  ---uname output---
  Linux rhel86edb1 #1 SMP Sun Jan 21 11:45:44 EST 2024 ppc64le ppc64le ppc64le 
GNU/Linux
   
  ---Steps to Reproduce---
  Example: run READ only Test using EDB-PGBENCH and DT7 workloads on
   1. L1-Host 
   2. L2-Guest CEDE ON
   3. L2-Guest CEDE OFF

  significant performance drop is observed in L2-Guest CEDE on vs
  L2-Guest CEDE off case.

  Note: Host and Guest configuration  used performance experiments are
  listed below.

  Location of EDB-PGBENCH: 
  #wget 
http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/pgbench_install.sh
  #chmod 777 pgbench_install.sh
  #./pgbench_install.sh -->> it will install EDB(pgbench) and run edb on target 
lpar. 

  Location of DT7 workload:

  #wget 
http://ci-http-results.aus.stglabs.ibm.com/perfTest/scripts/Bug_Scripts/DT7-Install.sh
  #chmod 777 DT7-Install.sh
  #./DT7-Install.sh -->> It will install DT7.

  Sample Commands : Once installation was successful run below commands
  on target lpar.

  EDB-PGBENCH Commands :

  # su - enterprisedb
  # vi t1.tc -->> copy below lines to t1.tc file . 

  ##t1.tc##
  runname=select
  SCALE=100
  runtime=300
  thread="40"
  smtlist="8"
  mode=select
  recreateinstance=yes
  recreateduringrun=yes
  warmup=no
  perf_stat=yes
  PGSQL=/usr/local/pgsql/bin
  #PGSQL=/usr/edb/as14/bin
  #PGPORT=5432
  cores=5
  ##t1.tc##

  #cp t1.tc tc/
  #./auto-run-test.sh

  DT7 Commands :

  After installation of DT7 run below command :
  #cd /root
  #./DayTrader7_Run.sh -u 20 -l 900 -i 2  

  ##
  Machine Type: Power 10  LPAR (RHEL9.3)
  gcc   : 11.4.1
  Memory: 300GB
  Test type : pgbench-edb, DT7
  ##
  KVM Host lscpu output : 

  # lscpu
  Architecture:ppc64le
Byte Order:Little Endian
  CPU(s):  96
On-line CPU(s) list:   0-39
Off-line CPU(s) list:  40-95
  Model name:  POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core:8
Core(s) per socket:5
Socket(s): 1
Physical sockets:  1
Physical chips:4
Physical cores/chip:   12
  Virtualization features:
Hypervisor vendor: pHyp
Virtualization type:   para
  Caches (sum of all):
L1d:   320 KiB (10 instances)
L1i:   480 KiB (10 instances)
L2:10 MiB (10 instances)
L3:40 MiB (10 instances)
  NUMA:
NUMA node(s):  1
NUMA node2 CPU(s): 0-39
  Vulnerabilities:
Gather data sampling:  Not affected
Itlb multihit: Not affected
L1tf:  Not affected
Mds:   Not affected
Meltdown:  Not affected
Mmio stale data:   Not affected
Retbleed:  Not affected
Spec rstack overflow:  Not affected
Spec store bypass: Not affected
Spectre v1:Vulnerable, ori31 speculation barrier enabled
Spectre v2:Vulnerable
Srbds: Not affected
Tsx async abort:   Not affected

  
  ##

  KVM on PowerVM setup:

  KVM (Kernel Virtual Machine) is a virtualization module for Linux that
  provides the ability of virtualization to Linux i.e. it allows the
  kernel to function as a hypervisor.

  We used P10 2S4U system for this experiment.

  Workloads: DT7 and PGBENCH in details:

  DT7 is an open source benchmark application emulating an online stock trading 
system.
  DT7 consist of 3 components 
  1) Jmeter 
  2) WAS (WebSphere Application Server)
  3) DB2

  DayTrader benchmark/application will be installed/deployed on WAS and
  this used DB2 as a backbone database.  Jmeter generate the request and
  interact with the WAS. which would be kind of middle ware.

  PGBENCH : 
  pgbench is a simple program for running benchmark tests on PostgreSQL. It 
runs the same sequence of SQL commands over and over, possibly in multiple 
concurrent database sessions, and then calculates the average transaction rate 
(transactio

[Kernel-packages] [Bug 2076147] Re: L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 (0xc000000c1bc8bb00) (possibly stale) @ new_slab

2024-08-05 Thread Frank Heimes

** Package changed: kernel-package (Ubuntu) => linux (Ubuntu)

** Also affects: ubuntu-power-systems
   Importance: Undecided
   Status: New

** Changed in: ubuntu-power-systems
 Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: ubuntu-power-systems
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2076147

Title:
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1
  (0xc00c1bc8bb00) (possibly stale) @ new_slab

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-06 
00:20:57 ==
  +++ This bug was initially created as a clone of Bug #206372 +++

  ---Problem Description---
  L2 Guest hung during LTP Tests. Back trace of paca->saved_r1 
(0xc00c1bc8bb00) (possibly stale) @ new_slab (edit)

   
  ---uname output---
  NA
   
  ---Additional Hardware Info---
  NA 

   
  Contact Information = na 
   
  ---Debugger Data---
  NA 
   
  ---Patches Installed---
  NA
   
  ---Steps to Reproduce---
   
  Tests running: SLS(io,base)
  LPAR Config:
  
  PHYP Environment:  PowerVM
  LPAR Hostname/IP: 10.33.2.107
  Rootvg Filesystem: xfs
  Network Interface: Shiner-T
  vNIC/SR-IOV Config: n/a
  IO Type: SAN
  IO Disk Type: raw
  Multipath Enabled: No
  
-
  DUMP Config:
  
  KDUMP configured: Yes
  XMON enabled no
  DUMP Available: no
   
  Machine Type = na 

  Userspace rpm: NA 
   
  The userspace tool has the following bit modes: NA 

  Userspace tool obtained from project website:  na 
   
  Userspace tool common name: NA 
   
  *Additional Instructions for na: 
  -Post a private note with access information to the machine that is currently 
in the debugger.
  -Attach ltrace and strace of userspace application.

  
  please include this commit in Ubuntu 24.04

  upstream commit  which is solving these data store lockups:
  73bc32875ee9b1881dd780308c6793fe463fe803 mm: hold PTL from the first PTE 
while reclaiming a large folio

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076147/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075721] Re: [Ubuntu24.04] virsh detach-interface is crashing the guest

2024-08-05 Thread Frank Heimes

*** This bug is a duplicate of bug 2074376 ***
https://bugs.launchpad.net/bugs/2074376

I just noticed that the Canonical kernel team has already a Launchpad bug open 
on this and already started to work on it (it's meanwhile Fix Committed for 
noble/24.04 and oracular/24.10).
So I'm marking this bug as a duplicate of the kernel team's Launchpad bug: 
https://bugs.launchpad.net/bugs/2074376

I'll keep this bug status updated and aligned with LP#2074376,
so that the synched IBM BZ entry with be updated accordingly.



** This bug has been marked a duplicate of bug 2074376
   Disable PCI_DYNAMIC_OF_NODES in Ubuntu

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075721

Title:
  [Ubuntu24.04] virsh detach-interface is crashing the guest

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - Kowshik Jois B S  - 2024-05-28 
01:07:02 ==
  ---Problem Description---
  While trying virsh attach-interface and virsh detach-interface, It is 
observed that, attaching an interface is successful. But trying to detach the 
same results in the guest crash with the below trace messages on the console.

  
  root@ubuntulp3guest1:~# [ 5363.726428] Kernel attempted to read user page 
(10ec0058) - exploit attempt? (uid: 0)
  [ 5363.726570] BUG: Unable to handle kernel data access on read at 
0x10ec0058
  [ 5363.726662] Faulting instruction address: 0xc12d4828
  [ 5363.726739] Oops: Kernel access of bad area, sig: 11 [#1]
  [ 5363.726800] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  [ 5363.726880] Modules linked in: 8139too 8139cp mii qrtr cfg80211 
binfmt_misc uio_pdrv_genirq vmx_crypto uio dm_multipath nfnetlink ip_tables 
x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
poly1305_p10_crypto chacha_p10_crypto libchacha crct10dif_vpmsum crc32c_vpmsum 
xhci_pci xhci_pci_renesas aes_gcm_p10_crypto
  [ 5363.727302] CPU: 0 PID: 1614 Comm: drmgr Not tainted 6.8.0-31-generic 
#31-Ubuntu
  [ 5363.727426] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [ 5363.727563] NIP:  c12d4828 LR: c12d68f0 CTR: 

  [ 5363.727653] REGS: c000149cb440 TRAP: 0300   Not tainted  
(6.8.0-31-generic)
  [ 5363.727742] MSR:  8280b033   CR: 
44088282  XER: 2004
  [ 5363.727855] CFAR: c12d68ec DAR: 10ec0058 DSISR: 4000 
IRQMASK: 0 
  [ 5363.727855] GPR00: c12d68f0 c000149cb6e0 c2254800 
10ec0048 
  [ 5363.727855] GPR04: c000149cb748   
 
  [ 5363.727855] GPR08:    
 
  [ 5363.727855] GPR12:  c3e8  
 
  [ 5363.727855] GPR16:    
 
  [ 5363.727855] GPR20:    
 
  [ 5363.727855] GPR24:   c48585a0 
c000149cb7d4 
  [ 5363.727855] GPR28: 0001 c00014de9400 10ec0048 
 
  [ 5363.728644] NIP [c12d4828] __of_changeset_entry_invert+0x10/0x1ac
  [ 5363.728732] LR [c12d68f0] __of_changeset_revert_entries+0x98/0x180
  [ 5363.728813] Call Trace:
  [ 5363.728845] [c000149cb7b0] [c12d6b60] 
of_changeset_revert+0x58/0xd8
  [ 5363.728937] [c000149cb800] [c0d0d498] 
of_pci_remove_node+0x74/0xb0
  [ 5363.729029] [c000149cb830] [c0cdbde0] 
pci_stop_bus_device+0xf4/0x138
  [ 5363.729126] [c000149cb870] [c0cdbf40] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
  [ 5363.729232] [c000149cb8a0] [c0cf2950] remove_store+0xf0/0x108
  [ 5363.729311] [c000149cb8f0] [c0e88384] dev_attr_store+0x34/0x78
  [ 5363.729389] [c000149cb910] [c07f8234] sysfs_kf_write+0x70/0xa4
  [ 5363.729467] [c000149cb930] [c07f66a8] 
kernfs_fop_write_iter+0x1d0/0x2e0
  [ 5363.729558] [c000149cb980] [c06c8fc8] vfs_write+0x27c/0x558
  [ 5363.729639] [c000149cba30] [c06c9628] ksys_write+0x90/0x170
  [ 5363.729716] [c000149cba80] [c0033248] 
system_call_exception+0xf8/0x290
  [ 5363.729811] [c000149cbe50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [ 5363.729903] --- interrupt: 3000 at 0x74191e15c720
  [ 5363.729964] NIP:  74191e15c720 LR: 74191e15c720 CTR: 

  [ 5363.730053] REGS: c000149cbe80 TRAP: 3000   Not tainted  
(6.8.0-31-generic)
  [ 5363.730143] MSR:  8280f033   
CR: 48088202  XER: 
  [ 5363.730257] IRQMASK: 0 
  [ 5363.730257] GPR00: 000

[Kernel-packages] [Bug 2074380] Re: [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward compatible

2024-08-05 Thread Frank Heimes

** Description changed:

  SRU Justification:
  
  [ Impact ]
  
-  * The CPU Measurement Facility (CPU MF) crypto counter set
-is not listed in the device sysfs tree - it's not exported
-in the sysfs directory /sys/devices/cpum_cf/events.
+  * The CPU Measurement Facility (CPU MF) crypto counter set
+    is not listed in the device sysfs tree - it's not exported
+    in the sysfs directory /sys/devices/cpum_cf/events.
  
-  * The attribute files for each CPU-MF counter defined
-in the crypto counter set is missing.
+  * The attribute files for each CPU-MF counter defined
+    in the crypto counter set is missing.
  
-  * This is caused by the counter second version number of CPU MF
-hardware being incremented on new machines.
+  * This is caused by the counter second version number of CPU MF
+    hardware being incremented on new machines.
  
-  * This causes a sanity check to fail,
-but the counters are supported by hardware.
+  * This causes a sanity check to fail,
+    but the counters are supported by hardware.
  
-  * The solution is to remove the upper limit in counter second
-version number check.
+  * The solution is to remove the upper limit in counter second
+    version number check.
  
  [ Fix ]
  
-  * f10933cbd2df f10933cbd2dfddf6273698a45f76db9bafd8150f "s390/cpum_cf:
- make crypto counters upward compatible across machine types"
+  * f10933cbd2df f10933cbd2dfddf6273698a45f76db9bafd8150f
+"s390/cpum_cf: make crypto counters upward compatible across machine types"
  
-  * The fix was upstream accepted with kernel v6.10(-rc1).
+  * The fix was upstream accepted with kernel v6.10(-rc1).
+ 
+  * Upstream commit applies cleanly on noble master-next, 
+but needed to be backported to jammy master-next due to different code
+and context in kernel 5.15.
  
  [ Test Plan ]
  
-  * Run the following commands on a new machine generation:
-(hence only doable by IBM)
-# ls -l /sys/devices/cpum_cf/events/ | grep AES
-
-  * If the output is empty than this patch is required.
+  * Run the following commands on a new machine generation:
+    (hence only doable by IBM)
+    # ls -l /sys/devices/cpum_cf/events/ | grep AES
  
-  * With a patched kernel the output should be like:
-# ls /sys/devices/cpum_cf/events/ | grep AES
-AES_BLOCKED_CYCLES
-AES_BLOCKED_FUNCTIONS
-AES_CYCLES
-AES_FUNCTIONS
+  * If the output is empty than this patch is required.
+ 
+  * With a patched kernel the output should be like:
+    # ls /sys/devices/cpum_cf/events/ | grep AES
+    AES_BLOCKED_CYCLES
+    AES_BLOCKED_FUNCTIONS
+    AES_CYCLES
+    AES_FUNCTIONS
  
  [ Where problems could occur ]
  
-  * This affects s390x only - CPU MF is s390-specific,
-and only s390 specific code is modified.
+  * This affects s390x only - CPU MF is s390-specific,
+    and only s390 specific code is modified.
  
-  * And it furthermore is limited to the crypto counter set
-of CPU MF.
+  * And it furthermore is limited to the crypto counter set
+    of CPU MF.
  
-  * So any impact is likely limited to hardware crypto counters
-on s390x only.
+  * So any impact is likely limited to hardware crypto counters
+    on s390x only.
  
-  * In s390/kernel/perf_cpum_cf.c the else if case got changed from
-explicitly checking for 6 or 7 to >= 6 which seems to require
-attention for future 8 and more cases.
+  * In s390/kernel/perf_cpum_cf.c the else if case got changed from
+    explicitly checking for 6 or 7 to >= 6 which seems to require
+    attention for future 8 and more cases.
  
-  * In s390/kernel/perf_cpum_cf_events.c the switch (ci.csvn) statement
-was changed to an if / else if with similar logic.
-Again attentioin for any potential future cases >= 8.
+  * In s390/kernel/perf_cpum_cf_events.c the switch (ci.csvn) statement
+    was changed to an if / else if with similar logic.
+    Again attentioin for any potential future cases >= 8.
  
-  * It does not look like currently used cases (1..5 and 6..7)
-are affected by the modification, just >7.
+  * It does not look like currently used cases (1..5 and 6..7)
+    are affected by the modification, just >7.
  
-  * Test build of patched jammy and noble s390x kernels were build
-and are avaiable here:
-https://launchpad.net/~fheimes/+archive/ubuntu/lp2074380
+  * Test build of patched jammy and noble s390x kernels were build
+    and are avaiable here:
+    https://launchpad.net/~fheimes/+archive/ubuntu/lp2074380
  
  [ Other Info ]
-  
-  * Since the code/fix was upstream accepted with kernel v6.10(-rc1)
-it does not affect the current development release oracular.
  
-  * This SRU can also be seen under the umbrella of new
- hardware enablement.
+  * Since the code/fix was upstream accepted with kernel v6.10(-rc1)
+    it does not affect the current development release oracular.
  
-  * Since it requires special hw, the verification needs to be
-done by IBM.
+  * This SRU can also

[Kernel-packages] [Bug 2074380] Re: [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward compatible

2024-08-05 Thread Frank Heimes

The commit applies fine to the noble master-next tree (kernel 6.8),
and a test build was triggered here: 
https://launchpad.net/~fheimes/+archive/ubuntu/lp2074380

However, the cpumf code in jammy master-next (kernel 5.15) is quite different.
git blame tells me that probably the following commits are needed as well:
111783-1a33aee1dc24 s390/cpum_cf: remove function validate_ctr_auth() by inline 
code
111784-9ae9b868aeaa s390/cpum_cf: provide counter number to 
validate_ctr_version()
111785:46c4d945ea1f s390/cpum_cf: introduce static CPU counter facility 
information
but they (first of all 46c4d945ea1f) also do not apply cleanly.

Would you please let us know a minimal set of patches (based on your insights 
of the code) that let us apply f10933cbd2df to the jammy master-next tree (git 
clone git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy 
--branch master-next --single-branch)
or alternatively a backport of f10933cbd2df to this tree?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074380

Title:
  [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward
  compatible

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  New
Status in linux source package in Noble:
  New

Bug description:
  Description:   kernel: s390/cpum_cf: make crypto counters upward
  compatible

  Symptom:   The CPU Measurement facility crypto counter set is not
 listed in the device sysfs tree.  

  Problem:   The CPU Measurement facility crypto counter set is not
 exported in the sysfs directory
 /sys/devices/cpum_cf/events.
 The attribute files for each CPU-MF counter defined
 in the crypto counter set is missing. This is caused
 by the counter second version number of the CPU
 Measurement Facility hardware being incremented on
 new machines.  This causes a sanity check to fail,
 but the counters are supported by hardware.

  Solution:  Remove upper limit in counter second version number
 check.

  Reproduction:  Run command on a new machine generation:
  # ls -l /sys/devices/cpum_cf/events/ | grep AES
  # 
 If the output is empty than this patch is required.
 The output should be:
  # ls  /sys/devices/cpum_cf/events/ | grep AES
  AES_BLOCKED_CYCLES
  AES_BLOCKED_FUNCTIONS
  AES_CYCLES
  AES_FUNCTIONS
  #

  Upstream-ID of fix:   f10933cbd2dfddf6273698a45f76db9bafd8150f

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2074380/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075721] Re: [Ubuntu24.04] virsh detach-interface is crashing the guest

2024-08-05 Thread Frank Heimes

** Package changed: ubuntu => linux

** Project changed: linux => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075721

Title:
  [Ubuntu24.04] virsh detach-interface is crashing the guest

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  == Comment: #0 - Kowshik Jois B S  - 2024-05-28 
01:07:02 ==
  ---Problem Description---
  While trying virsh attach-interface and virsh detach-interface, It is 
observed that, attaching an interface is successful. But trying to detach the 
same results in the guest crash with the below trace messages on the console.

  
  root@ubuntulp3guest1:~# [ 5363.726428] Kernel attempted to read user page 
(10ec0058) - exploit attempt? (uid: 0)
  [ 5363.726570] BUG: Unable to handle kernel data access on read at 
0x10ec0058
  [ 5363.726662] Faulting instruction address: 0xc12d4828
  [ 5363.726739] Oops: Kernel access of bad area, sig: 11 [#1]
  [ 5363.726800] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  [ 5363.726880] Modules linked in: 8139too 8139cp mii qrtr cfg80211 
binfmt_misc uio_pdrv_genirq vmx_crypto uio dm_multipath nfnetlink ip_tables 
x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
poly1305_p10_crypto chacha_p10_crypto libchacha crct10dif_vpmsum crc32c_vpmsum 
xhci_pci xhci_pci_renesas aes_gcm_p10_crypto
  [ 5363.727302] CPU: 0 PID: 1614 Comm: drmgr Not tainted 6.8.0-31-generic 
#31-Ubuntu
  [ 5363.727426] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf06 of:SLOF,HEAD hv:linux,kvm pSeries
  [ 5363.727563] NIP:  c12d4828 LR: c12d68f0 CTR: 

  [ 5363.727653] REGS: c000149cb440 TRAP: 0300   Not tainted  
(6.8.0-31-generic)
  [ 5363.727742] MSR:  8280b033   CR: 
44088282  XER: 2004
  [ 5363.727855] CFAR: c12d68ec DAR: 10ec0058 DSISR: 4000 
IRQMASK: 0 
  [ 5363.727855] GPR00: c12d68f0 c000149cb6e0 c2254800 
10ec0048 
  [ 5363.727855] GPR04: c000149cb748   
 
  [ 5363.727855] GPR08:    
 
  [ 5363.727855] GPR12:  c3e8  
 
  [ 5363.727855] GPR16:    
 
  [ 5363.727855] GPR20:    
 
  [ 5363.727855] GPR24:   c48585a0 
c000149cb7d4 
  [ 5363.727855] GPR28: 0001 c00014de9400 10ec0048 
 
  [ 5363.728644] NIP [c12d4828] __of_changeset_entry_invert+0x10/0x1ac
  [ 5363.728732] LR [c12d68f0] __of_changeset_revert_entries+0x98/0x180
  [ 5363.728813] Call Trace:
  [ 5363.728845] [c000149cb7b0] [c12d6b60] 
of_changeset_revert+0x58/0xd8
  [ 5363.728937] [c000149cb800] [c0d0d498] 
of_pci_remove_node+0x74/0xb0
  [ 5363.729029] [c000149cb830] [c0cdbde0] 
pci_stop_bus_device+0xf4/0x138
  [ 5363.729126] [c000149cb870] [c0cdbf40] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
  [ 5363.729232] [c000149cb8a0] [c0cf2950] remove_store+0xf0/0x108
  [ 5363.729311] [c000149cb8f0] [c0e88384] dev_attr_store+0x34/0x78
  [ 5363.729389] [c000149cb910] [c07f8234] sysfs_kf_write+0x70/0xa4
  [ 5363.729467] [c000149cb930] [c07f66a8] 
kernfs_fop_write_iter+0x1d0/0x2e0
  [ 5363.729558] [c000149cb980] [c06c8fc8] vfs_write+0x27c/0x558
  [ 5363.729639] [c000149cba30] [c06c9628] ksys_write+0x90/0x170
  [ 5363.729716] [c000149cba80] [c0033248] 
system_call_exception+0xf8/0x290
  [ 5363.729811] [c000149cbe50] [c000d05c] 
system_call_vectored_common+0x15c/0x2ec
  [ 5363.729903] --- interrupt: 3000 at 0x74191e15c720
  [ 5363.729964] NIP:  74191e15c720 LR: 74191e15c720 CTR: 

  [ 5363.730053] REGS: c000149cbe80 TRAP: 3000   Not tainted  
(6.8.0-31-generic)
  [ 5363.730143] MSR:  8280f033   
CR: 48088202  XER: 
  [ 5363.730257] IRQMASK: 0 
  [ 5363.730257] GPR00: 0004 7bdfb730 74191e296d00 
000b 
  [ 5363.730257] GPR04: 0be4ed58d640 0001  
0031 
  [ 5363.730257] GPR08:    
 
  [ 5363.730257] GPR12:  74191e3eb300  
 
  [ 5363.730257] GPR16:  0be4b90f2de0 0be4b90f0298 
0be4b90f2da0 
  [ 5363.730257] GPR20: 0be4b90f11b8 0be4b90eff08 7bdfb910 
0be4b90f2220 
  [ 5363.7

[Kernel-packages] [Bug 2075575] Re: kexec fails in LPAR when some cpus are disabled

2024-08-05 Thread Frank Heimes

Hello Seeteena, thanks for having this reported.

The referenced commit is upstream accepted in kernel v6.10(-rc7) and
v6.11(-rc1), hence it will be included in the planned target kernel for
'oracular' / 24.10.

And I am glad to see that the commit was upstream also tagged as stable update, 
for kernel v5.9 and newer
(Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv 
instructions")
Cc: sta...@vger.kernel.org # v5.9+)
that means it will be automatically picked by the Canonical kernel teams' 
" update: v upstream stable release" process and with 
that will find it's way into noble/24.04 kernel 6.8 and jammy/22.04 kernel 5.15.

We'll use this LP bug for tracking ...

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: High
 Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
   Status: New

** Changed in: ubuntu-power-systems
   Status: New => Triaged

** Changed in: linux (Ubuntu Oracular)
   Status: New => In Progress

** Changed in: linux (Ubuntu Jammy)
   Status: New => Triaged

** Changed in: linux (Ubuntu Noble)
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075575

Title:
  kexec fails in LPAR when some cpus are disabled

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Triaged
Status in linux source package in Noble:
  Triaged
Status in linux source package in Oracular:
  In Progress

Bug description:
  == Comment: #0 - SEETEENA THOUFEEK  - 2024-08-02 
03:11:31 ==
  +++ This bug was initially created as a clone of Bug #206083 +++

  ---Problem Description---
  kexec fails in LPAR when some cpus are disabled
   
  Contact Information = sthou...@in.ibm.com 
   
  Machine Type = na 
   
  ---uname output---
  na
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Summary:
  At L1 level, kexec fails if some of the cpus in the machine are disabled.

  
  Distros and kernel  versions used:
  1. Distro versions used

a. L1 LPAR :

b. L2 :

  
  Repro steps:
  1. Boot into an L1 lpar
  2. Disable some cpus (eg: ppc64_cpu --cores-on=3)
  3. Try to kexec. 

  
  This bug is reproducible only when we load the target kernel/initrd and use 
"kexec -e" as follows:

  kexec -l --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  kexec -e

  
  kexec works fine if we do a normal kexec without skipping the shutdown path

  kexec --initrd initramfs-$(uname -r).img vmlinuz-$(uname -r)
  --append="$(cat /proc/cmdline)"

  
  Fix is upstream now:
  
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=21a741eb75f80397e5f7d3739e24d7d75e619011

  Thanks,
  Sourabh Jain

  please include in Ubuntu

   
  Oops output:
   no
   
  Stack trace output:
   no
   
  System Dump Info:
The system is not configured to capture a system dump.
   
  *Additional Instructions for sthou...@in.ibm.com: 
  -Attach sysctl -a output output to the bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2075575/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2074380] Re: [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward compatible

2024-08-05 Thread Frank Heimes

Hello Thomas, thank you!
Was also my impression that a backport is probably best.
And the backport applies cleanly on jammy master-next (so yes, that is what we 
have to use for the next and upcoming kernel of an Ubuntu release).

Proceeding now with the kernel SRU ...

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074380

Title:
  [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward
  compatible

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  New
Status in linux source package in Noble:
  New

Bug description:
  Description:   kernel: s390/cpum_cf: make crypto counters upward
  compatible

  Symptom:   The CPU Measurement facility crypto counter set is not
 listed in the device sysfs tree.  

  Problem:   The CPU Measurement facility crypto counter set is not
 exported in the sysfs directory
 /sys/devices/cpum_cf/events.
 The attribute files for each CPU-MF counter defined
 in the crypto counter set is missing. This is caused
 by the counter second version number of the CPU
 Measurement Facility hardware being incremented on
 new machines.  This causes a sanity check to fail,
 but the counters are supported by hardware.

  Solution:  Remove upper limit in counter second version number
 check.

  Reproduction:  Run command on a new machine generation:
  # ls -l /sys/devices/cpum_cf/events/ | grep AES
  # 
 If the output is empty than this patch is required.
 The output should be:
  # ls  /sys/devices/cpum_cf/events/ | grep AES
  AES_BLOCKED_CYCLES
  AES_BLOCKED_FUNCTIONS
  AES_CYCLES
  AES_FUNCTIONS
  #

  Upstream-ID of fix:   f10933cbd2dfddf6273698a45f76db9bafd8150f

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2074380/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2074380] Re: [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward compatible

2024-08-05 Thread Frank Heimes

evices/cpum_cf/events/ | grep AES
+ AES_BLOCKED_CYCLES
+ AES_BLOCKED_FUNCTIONS
+ AES_CYCLES
+ AES_FUNCTIONS
+ #
  
  Upstream-ID of fix:   f10933cbd2dfddf6273698a45f76db9bafd8150f

** Changed in: linux (Ubuntu Jammy)
     Assignee: (unassigned) => Frank Heimes (fheimes)

** Changed in: linux (Ubuntu Noble)
 Assignee: (unassigned) => Frank Heimes (fheimes)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2074380

Title:
  [UBUNTU 22.04] kernel: s390/cpum_cf: make crypto counters upward
  compatible

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  SRU Justification:

  [ Impact ]

   * The CPU Measurement Facility (CPU MF) crypto counter set
 is not listed in the device sysfs tree - it's not exported
 in the sysfs directory /sys/devices/cpum_cf/events.

   * The attribute files for each CPU-MF counter defined
 in the crypto counter set is missing.

   * This is caused by the counter second version number of CPU MF
 hardware being incremented on new machines.

   * This causes a sanity check to fail,
 but the counters are supported by hardware.

   * The solution is to remove the upper limit in counter second
 version number check.

  [ Fix ]

   * f10933cbd2df f10933cbd2dfddf6273698a45f76db9bafd8150f
  "s390/cpum_cf: make crypto counters upward compatible across machine
  types"

   * The fix was upstream accepted with kernel v6.10(-rc1).

  [ Test Plan ]

   * Run the following commands on a new machine generation:
 (hence only doable by IBM)
 # ls -l /sys/devices/cpum_cf/events/ | grep AES
 
   * If the output is empty than this patch is required.

   * With a patched kernel the output should be like:
 # ls /sys/devices/cpum_cf/events/ | grep AES
 AES_BLOCKED_CYCLES
 AES_BLOCKED_FUNCTIONS
 AES_CYCLES
 AES_FUNCTIONS

  [ Where problems could occur ]

   * This affects s390x only - CPU MF is s390-specific,
 and only s390 specific code is modified.

   * And it furthermore is limited to the crypto counter set
 of CPU MF.

   * So any impact is likely limited to hardware crypto counters
 on s390x only.

   * In s390/kernel/perf_cpum_cf.c the else if case got changed from
 explicitly checking for 6 or 7 to >= 6 which seems to require
 attention for future 8 and more cases.

   * In s390/kernel/perf_cpum_cf_events.c the switch (ci.csvn) statement
 was changed to an if / else if with similar logic.
 Again attentioin for any potential future cases >= 8.

   * It does not look like currently used cases (1..5 and 6..7)
 are affected by the modification, just >7.

   * Test build of patched jammy and noble s390x kernels were build
 and are avaiable here:
 https://launchpad.net/~fheimes/+archive/ubuntu/lp2074380

  [ Other Info ]
   
   * Since the code/fix was upstream accepted with kernel v6.10(-rc1)
 it does not affect the current development release oracular.

   * This SRU can also be seen under the umbrella of new
  hardware enablement.

   * Since it requires special hw, the verification needs to be
 done by IBM.

  __

  Description:   kernel: s390/cpum_cf: make crypto counters upward
  compatible

  Symptom:   The CPU Measurement facility crypto counter set is not
     listed in the device sysfs tree.

  Problem:   The CPU Measurement facility crypto counter set is not
     exported in the sysfs directory
     /sys/devices/cpum_cf/events.
     The attribute files for each CPU-MF counter defined
     in the crypto counter set is missing. This is caused
     by the counter second version number of the CPU
     Measurement Facility hardware being incremented on
     new machines.  This causes a sanity check to fail,
     but the counters are supported by hardware.

  Solution:  Remove upper limit in counter second version number
     check.

  Reproduction:  Run command on a new machine generation:
  # ls -l /sys/devices/cpum_cf/events/ | grep AES
  #
     If the output is empty than this patch is required.
     The output should be:
  # ls  /sys/devices/cpum_cf/events/ | grep AES
  AES_BLOCKED_CYCLES
  AES_BLOCKED_FUNCTIONS
  AES_CYCLES
  AES_FUNCTIONS
  #

  Upstream-ID of fix:   f10933cbd2dfddf6273698a45f76db9bafd8150f

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2074380/+

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2356 matches

Mail list logo