[PATCH v7 0/1] Interface to represent PAPR firmware attributes

2021-07-22 Thread Pratik R. Sampat
RFC: https://lkml.org/lkml/2021/6/4/791
PATCH v1: https://lkml.org/lkml/2021/6/16/805
PATCH v2: https://lkml.org/lkml/2021/7/6/138
PATCH v3: https://lkml.org/lkml/2021/7/12/2799
PATCH v4: https://lkml.org/lkml/2021/7/16/532
PATCH v5: https://lkml.org/lkml/2021/7/19/247
PATCH v6: https://lkml.org/lkml/2021/7/20/36

Changelog v6-->v7
1. Optimize free for pgs attributes to iterate only over the currently
   allocated attributes "idx" instead of free'ing for all

Also, have implemented a POC using this interface for the powerpc-utils'
ppc64_cpu --frequency command-line tool to utilize this information
in userspace.
The POC for the new interface has been hosted here:
https://github.com/pratiksampat/powerpc-utils/tree/H_GET_ENERGY_SCALE_INFO_v2

Sample output from the powerpc-utils tool is as follows:

# ppc64_cpu --frequency
Power and Performance Mode: 
Idle Power Saver Status   : 
Processor Folding Status  :  --> Printed if Idle power save status is 
supported

Platform reported frequencies --> Frequencies reported from the platform's 
H_CALL i.e PAPR interface
min: GHz
max: GHz
static : GHz

Tool Computed frequencies
min: GHz (cpu XX)
max: GHz (cpu XX)
avg: GHz

Pratik R. Sampat (1):
  powerpc/pseries: Interface to represent PAPR firmware attributes

 .../sysfs-firmware-papr-energy-scale-info |  26 ++
 arch/powerpc/include/asm/hvcall.h |  24 +-
 arch/powerpc/kvm/trace_hv.h   |   1 +
 arch/powerpc/platforms/pseries/Makefile   |   3 +-
 .../pseries/papr_platform_attributes.c| 312 ++
 5 files changed, 364 insertions(+), 2 deletions(-)
 create mode 100644 
Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
 create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

-- 
2.31.1



[PATCH v7 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-07-22 Thread Pratik R. Sampat
Adds a generic interface to represent the energy and frequency related
PAPR attributes on the system using the new H_CALL
"H_GET_ENERGY_SCALE_INFO".

H_GET_EM_PARMS H_CALL was previously responsible for exporting this
information in the lparcfg, however the H_GET_EM_PARMS H_CALL
will be deprecated P10 onwards.

The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
hcall(
  uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
  uint64 flags,   // Per the flag request
  uint64 firstAttributeId,// The attribute id
  uint64 bufferAddress,   // Guest physical address of the output buffer
  uint64 bufferSize   // The size in bytes of the output buffer
);

This H_CALL can query either all the attributes at once with
firstAttributeId = 0, flags = 0 as well as query only one attribute
at a time with firstAttributeId = id, flags = 1.

The output buffer consists of the following
1. number of attributes  - 8 bytes
2. array offset to the data location - 8 bytes
3. version info  - 1 byte
4. A data array of size num attributes, which contains the following:
  a. attribute ID  - 8 bytes
  b. attribute value in number - 8 bytes
  c. attribute name in string  - 64 bytes
  d. attribute value in string - 64 bytes

The new H_CALL exports information in direct string value format, hence
a new interface has been introduced in
/sys/firmware/papr/energy_scale_info to export this information to
userspace in an extensible pass-through format.

The H_CALL returns the name, numeric value and string value (if exists)

The format of exposing the sysfs information is as follows:
/sys/firmware/papr/energy_scale_info/
   |-- /
 |-- desc
 |-- value
 |-- value_desc (if exists)
   |-- /
 |-- desc
 |-- value
 |-- value_desc (if exists)
...

The energy information that is exported is useful for userspace tools
such as powerpc-utils. Currently these tools infer the
"power_mode_data" value in the lparcfg, which in turn is obtained from
the to be deprecated H_GET_EM_PARMS H_CALL.
On future platforms, such userspace utilities will have to look at the
data returned from the new H_CALL being populated in this new sysfs
interface and report this information directly without the need of
interpretation.

Signed-off-by: Pratik R. Sampat 
Reviewed-by: Gautham R. Shenoy 
---
 .../sysfs-firmware-papr-energy-scale-info |  26 ++
 arch/powerpc/include/asm/hvcall.h |  24 +-
 arch/powerpc/kvm/trace_hv.h   |   1 +
 arch/powerpc/platforms/pseries/Makefile   |   3 +-
 .../pseries/papr_platform_attributes.c| 312 ++
 5 files changed, 364 insertions(+), 2 deletions(-)
 create mode 100644 
Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
 create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info 
b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
new file mode 100644
index ..139a576c7c9d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
@@ -0,0 +1,26 @@
+What:  /sys/firmware/papr/energy_scale_info
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   Directory hosting a set of platform attributes like
+   energy/frequency on Linux running as a PAPR guest.
+
+   Each file in a directory contains a platform
+   attribute hierarchy pertaining to performance/
+   energy-savings mode and processor frequency.
+
+What:  /sys/firmware/papr/energy_scale_info/
+   /sys/firmware/papr/energy_scale_info//desc
+   /sys/firmware/papr/energy_scale_info//value
+   /sys/firmware/papr/energy_scale_info//value_desc
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   Energy, frequency attributes directory for POWERVM servers
+
+   This directory provides energy, frequency, folding information. 
It
+   contains below sysfs attributes:
+
+   - desc: String description of the attribute 
+
+   - value: Numeric value of attribute 
+
+   - value_desc: String value of attribute 
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e3b29eda8074..c91714ea6719 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -316,7 +316,8 @@
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
 #define H_SCM_FLUSH0x44C
-#define MAX_HCALL_OPCODE   H_SCM_FLUSH
+#define H_GET_ENERGY_SCALE_INFO0x450
+#define MAX_HCALL_OPCODE   H_GET_ENERGY_SCALE_INFO
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -631,6 +632,27 @@ struct hv_gpci_request_buffer {
uint8_t bytes[HGPCI_MAX_DATA_BYTES];
 } __packed;
 

Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-22 Thread Alexey Kardashevskiy




On 22/07/2021 01:04, Frederic Barrat wrote:



On 21/07/2021 05:32, Alexey Kardashevskiy wrote:

+    struct iommu_table *newtbl;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
+    const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;

+
+    /* Look for MMIO32 */
+    if ((pci->phb->mem_resources[i].flags & mask) == 
IORESOURCE_MEM)

+    break;
+    }
+
+    if (i == ARRAY_SIZE(pci->phb->mem_resources))
+    goto out_del_list;



So we exit and do nothing if there's no MMIO32 bar?
Isn't the intent just to figure out the MMIO32 area to reserve it 
when init'ing the table? In which case we could default to 0,0


I'm actually not clear why we are reserving this area on pseries.




If we do not reserve it, then the iommu code will allocate DMA pages 
from there and these addresses are MMIO32 from the kernel pov at 
least. I saw crashes when (I think) a device tried DMAing to the top 
2GB of the bus space which happened to be a some other device's BAR.



hmmm... then figuring out the correct range needs more work. We could 
have more than one MMIO32 bar. And they don't have to be adjacent. 


They all have to be within the MMIO32 window of a PHB and we reserve the 
entire window here.


I 
don't see that we are reserving any range on the initial table though 
(on pseries).
True, we did not need to, as the hypervisor always took care of DMA and 
MMIO32 regions to not overlap.


And in this series we do not (strictly speaking) need this either as 
phyp never allocates more than one window dynamically and that only 
window is always the second one starting from 0x800.... It 
is probably my mistake that KVM allows a new window to start from 0 - 
PAPR did not prohibit this explicitly.


And for the KVM case, we do not need to remove the default window as KVM 
can pretty much always allocate as many TCE as the VM wants. But we 
still allow removing the default window and creating a huge one instead 
at 0x0 as this way we can allow 1:1 for every single PCI device even if 
it only allows 48 (or similar but less than 64bit) DMA. Hope this makes 
sense. Thanks,



--
Alexey


Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend

2021-07-22 Thread Christoph Hellwig
On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> Marking both of these as "not applicable" for now because I don't
> think we really understand what's going on.
> 
> Apparently a DMA occurs during suspend or resume and triggers an ACS
> violation.  I don't think think such a DMA should occur in the first
> place.
> 
> Or maybe, since you say the problem happens right after ACS is enabled
> during resume, we're doing the ACS enable incorrectly?  Although I
> would think we should not be doing DMA at the same time we're enabling
> ACS, either.
> 
> If this really is a system firmware issue, both HP and Dell should
> have the knowledge and equipment to figure out what's going on.

DMA on resume sounds really odd.  OTOH the below mentioned case of
a DMA during suspend seems very like in some setup.  NVMe has the
concept of a host memory buffer (HMB) that allows the PCIe device
to use arbitrary host memory for internal purposes.  Combine this
with the "Storage D3" misfeature in modern x86 platforms that force
a slot into d3cold without consulting the driver first and you'd see
symptoms like this.  Another case would be the NVMe equivalent of the
AER which could lead to a completion without host activity.

We now have quirks in the ACPI layer and NVMe to fully shut down the
NVMe controllers on these messed up systems with the "Storage D3"
misfeature which should avoid such "spurious" DMAs at the cost of
wearning out the device much faster.


Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend

2021-07-22 Thread Bjorn Helgaas
On Fri, Feb 05, 2021 at 11:17:32PM +0800, Kai-Heng Feng wrote:
> On Fri, Feb 5, 2021 at 7:28 AM Bjorn Helgaas  wrote:
> >
> > [+cc Alex]
> >
> > On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote:
> > > On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas  wrote:
> > > > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > > > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > > > > hint") enables ACS, and some platforms lose its NVMe after resume from
> > > > > firmware:
> > > > > [   50.947816] pcieport :00:1b.0: DPC: containment event, 
> > > > > status:0x1f01 source:0x
> > > > > [   50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable 
> > > > > error detected
> > > > > [   50.947829] pcieport :00:1b.0: PCIe Bus Error: 
> > > > > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver 
> > > > > ID)
> > > > > [   50.947830] pcieport :00:1b.0:   device [8086:06ac] error 
> > > > > status/mask=0020/0001
> > > > > [   50.947831] pcieport :00:1b.0:[21] ACSViol
> > > > > (First)
> > > > > [   50.947841] pcieport :00:1b.0: AER: broadcast error_detected 
> > > > > message
> > > > > [   50.947843] nvme nvme0: frozen state error detected, reset 
> > > > > controller
> > > > >
> > > > > It happens right after ACS gets enabled during resume.
> > > > >
> > > > > To prevent that from happening, disable AER interrupt and enable it on
> > > > > system suspend and resume, respectively.
> > > >
> > > > Lots of questions here.  Maybe this is what we'll end up doing, but I
> > > > am curious about why the error is reported in the first place.
> > > >
> > > > Is this a consequence of the link going down and back up?
> > >
> > > Could be. From the observations, it only happens when firmware suspend
> > > (S3) is used.
> > > Maybe it happens when it's gets powered up, but I don't have equipment
> > > to debug at hardware level.
> > >
> > > If we use non-firmware suspend method, enabling ACS after resume won't
> > > trip AER and DPC.
> > >
> > > > Is it consequence of the device doing a DMA when it shouldn't?
> > >
> > > If it's doing DMA while suspending, the same error should also happen
> > > after NVMe is suspended and before PCIe port suspending.
> > > Furthermore, if non-firmware suspend method is used, there's so such
> > > issue, so less likely to be any DMA operation.
> > >
> > > > Are we doing something in the wrong order during suspend?  Or maybe
> > > > resume, since I assume the error is reported during resume?
> > >
> > > Yes the error is reported during resume. The suspend/resume order
> > > seems fine as non-firmware suspend doesn't have this issue.
> >
> > I really feel like we need a better understanding of what's going on
> > here.  Disabling the AER interrupt is like closing our eyes and
> > pretending that because we don't see it, it didn't happen.
> >
> > An ACS error is triggered by a DMA, right?  I'm assuming an MMIO
> > access from the CPU wouldn't trigger this error.  And it sounds like
> > the error is triggered before we even start running the driver after
> > resume.
> >
> > If we're powering up an NVMe device from D3cold and it DMAs before the
> > driver touches it, something would be seriously broken.  I doubt
> > that's what's happening.  Maybe a device could resume some previously
> > programmed DMA after powering up from D3hot.
> 
> I am not that familiar with PCIe ACS/AER/DPC, so I can't really answer
> questions you raised.
> PCIe spec doesn't say the suspend/resume order is also not helping here.
> 
> However, I really think it's a system firmware issue.
> I've seen some suspend-to-idle platforms with NVMe can reach D3cold,
> those are unaffected.

Marking both of these as "not applicable" for now because I don't
think we really understand what's going on.

Apparently a DMA occurs during suspend or resume and triggers an ACS
violation.  I don't think think such a DMA should occur in the first
place.

Or maybe, since you say the problem happens right after ACS is enabled
during resume, we're doing the ACS enable incorrectly?  Although I
would think we should not be doing DMA at the same time we're enabling
ACS, either.

If this really is a system firmware issue, both HP and Dell should
have the knowledge and equipment to figure out what's going on.

> > Or maybe the error occurred on suspend, like if the device wasn't
> > quiesced or something, but we didn't notice it until resume?  The
> > AER error status bits are RW1CS, which means they can be preserved
> > across hot/warm/cold resets.
> >
> > Can you instrument the code to see whether the AER error status bit is
> > set before enabling ACS?  I'm not sure that merely enabling ACS (I
> > assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL)
> > should cause an interrupt for a previously-logged error.  I suspect
> > that could happen when enabling *AER*, but I wouldn't think it would
> > happen when enabling 

[PATCH v4 0/2] KVM: PPC: Book3S HV: Nested guest state sanitising changes

2021-07-22 Thread Fabiano Rosas
This series still aims to stop contaminating the l2_hv structure with
bits that might have come from L1 state.

Patch 1 makes l2_hv read-only (mostly). It is now only changed when we
explicitly want to pass information to L1.

Patch 2 makes sure that L1 is not forwarded HFU interrupts when the
host has decided to disable any facilities (theoretical for now, since
HFSCR bits are always the same between L1/Ln).

Changes since v3:

- now passing lpcr separately into load_l2_hv_regs to solve the
  conflict with commit a19b70abc69a ("KVM: PPC: Book3S HV: Nested move
  LPCR sanitising to sanitise_hv_regs");

- patch 2 now forwards a HEAI instead of injecting a Program.

v3:

- removed the sanitise functions;
- moved the entry code into a new load_l2_hv_regs and the exit code
  into the existing save_hv_return_state;
- new patch: removes the cause bits when L0 has disabled the
  corresponding facility.

https://lkml.kernel.org/r/20210415230948.3563415-1-faro...@linux.ibm.com

v2:

- made the change more generic, not only applies to hfscr anymore;
- sanitisation is now done directly on the vcpu struct, l2_hv is left unchanged.

https://lkml.kernel.org/r/20210406214645.3315819-1-faro...@linux.ibm.com

v1:
https://lkml.kernel.org/r/20210305231055.2913892-1-faro...@linux.ibm.com

Fabiano Rosas (2):
  KVM: PPC: Book3S HV: Sanitise vcpu registers in nested path
  KVM: PPC: Book3S HV: Stop forwarding all HFUs to L1

 arch/powerpc/kvm/book3s_hv_nested.c | 125 +---
 1 file changed, 75 insertions(+), 50 deletions(-)

-- 
2.29.2



[PATCH v4 2/2] KVM: PPC: Book3S HV: Stop forwarding all HFUs to L1

2021-07-22 Thread Fabiano Rosas
If the nested hypervisor has no access to a facility because it has
been disabled by the host, it should also not be able to see the
Hypervisor Facility Unavailable that arises from one of its guests
trying to access the facility.

This patch turns a HFU that happened in L2 into a Hypervisor Emulation
Assistance interrupt and forwards it to L1 for handling. The ones that
happened because L1 explicitly disabled the facility for L2 are still
let through, along with the corresponding Cause bits in the HFSCR.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 3804dc50ebe8..d171a400e4d5 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -99,7 +99,7 @@ static void byteswap_hv_regs(struct hv_guest_state *hr)
hr->dawrx1 = swab64(hr->dawrx1);
 }
 
-static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap,
+static void save_hv_return_state(struct kvm_vcpu *vcpu,
 struct hv_guest_state *hr)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
@@ -128,7 +128,7 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int 
trap,
hr->pidr = vcpu->arch.pid;
hr->cfar = vcpu->arch.cfar;
hr->ppr = vcpu->arch.ppr;
-   switch (trap) {
+   switch (vcpu->arch.trap) {
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
hr->hdar = vcpu->arch.fault_dar;
hr->hdsisr = vcpu->arch.fault_dsisr;
@@ -137,6 +137,27 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
int trap,
case BOOK3S_INTERRUPT_H_INST_STORAGE:
hr->asdr = vcpu->arch.fault_gpa;
break;
+   case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
+   {
+   u8 cause = vcpu->arch.hfscr >> 56;
+
+   WARN_ON_ONCE(cause >= BITS_PER_LONG);
+
+   if (!(hr->hfscr & (1UL << cause)))
+   break;
+
+   /*
+* We have disabled this facility, so it does not
+* exist from L1's perspective. Turn it into a HEAI.
+*/
+   vcpu->arch.trap = BOOK3S_INTERRUPT_H_EMUL_ASSIST;
+   kvmppc_load_last_inst(vcpu, INST_GENERIC, 
>arch.emul_inst);
+
+   /* Don't leak the cause field */
+   hr->hfscr &= ~HFSCR_INTR_CAUSE;
+
+   fallthrough;
+   }
case BOOK3S_INTERRUPT_H_EMUL_ASSIST:
hr->heir = vcpu->arch.emul_inst;
break;
@@ -374,7 +395,7 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
delta_spurr = vcpu->arch.spurr - l2_hv.spurr;
delta_ic = vcpu->arch.ic - l2_hv.ic;
delta_vtb = vc->vtb - l2_hv.vtb;
-   save_hv_return_state(vcpu, vcpu->arch.trap, _hv);
+   save_hv_return_state(vcpu, _hv);
 
/* restore L1 state */
vcpu->arch.nested = NULL;
-- 
2.29.2



[PATCH v4 1/2] KVM: PPC: Book3S HV: Sanitise vcpu registers in nested path

2021-07-22 Thread Fabiano Rosas
As one of the arguments of the H_ENTER_NESTED hypercall, the nested
hypervisor (L1) prepares a structure containing the values of various
hypervisor-privileged registers with which it wants the nested guest
(L2) to run. Since the nested HV runs in supervisor mode it needs the
host to write to these registers.

To stop a nested HV manipulating this mechanism and using a nested
guest as a proxy to access a facility that has been made unavailable
to it, we have a routine that sanitises the values of the HV registers
before copying them into the nested guest's vcpu struct.

However, when coming out of the guest the values are copied as they
were back into L1 memory, which means that any sanitisation we did
during guest entry will be exposed to L1 after H_ENTER_NESTED returns.

This patch alters this sanitisation to have effect on the vcpu->arch
registers directly before entering and after exiting the guest,
leaving the structure that is copied back into L1 unchanged (except
when we really want L1 to access the value, e.g the Cause bits of
HFSCR).

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 100 +++-
 1 file changed, 52 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 8543ad538b0c..3804dc50ebe8 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -104,8 +104,17 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
int trap,
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
 
+   /*
+* When loading the hypervisor-privileged registers to run L2,
+* we might have used bits from L1 state to restrict what the
+* L2 state is allowed to be. Since L1 is not allowed to read
+* the HV registers, do not include these modifications in the
+* return state.
+*/
+   hr->hfscr = ((~HFSCR_INTR_CAUSE & hr->hfscr) |
+(HFSCR_INTR_CAUSE & vcpu->arch.hfscr));
+
hr->dpdes = vc->dpdes;
-   hr->hfscr = vcpu->arch.hfscr;
hr->purr = vcpu->arch.purr;
hr->spurr = vcpu->arch.spurr;
hr->ic = vcpu->arch.ic;
@@ -134,49 +143,7 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, 
int trap,
}
 }
 
-/*
- * This can result in some L0 HV register state being leaked to an L1
- * hypervisor when the hv_guest_state is copied back to the guest after
- * being modified here.
- *
- * There is no known problem with such a leak, and in many cases these
- * register settings could be derived by the guest by observing behaviour
- * and timing, interrupts, etc., but it is an issue to consider.
- */
-static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
-{
-   struct kvmppc_vcore *vc = vcpu->arch.vcore;
-   u64 mask;
-
-   /*
-* Don't let L1 change LPCR bits for the L2 except these:
-*/
-   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
-   LPCR_LPES | LPCR_MER;
-
-   /*
-* Additional filtering is required depending on hardware
-* and configuration.
-*/
-   hr->lpcr = kvmppc_filter_lpcr_hv(vcpu->kvm,
-   (vc->lpcr & ~mask) | (hr->lpcr & mask));
-
-   /*
-* Don't let L1 enable features for L2 which we've disabled for L1,
-* but preserve the interrupt cause field.
-*/
-   hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
-
-   /* Don't let data address watchpoint match in hypervisor state */
-   hr->dawrx0 &= ~DAWRX_HYP;
-   hr->dawrx1 &= ~DAWRX_HYP;
-
-   /* Don't let completed instruction address breakpt match in HV state */
-   if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
-   hr->ciabr &= ~CIABR_PRIV;
-}
-
-static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
+static void restore_hv_regs(struct kvm_vcpu *vcpu, const struct hv_guest_state 
*hr)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
 
@@ -288,6 +255,43 @@ static int kvmhv_write_guest_state_and_regs(struct 
kvm_vcpu *vcpu,
 sizeof(struct pt_regs));
 }
 
+static void load_l2_hv_regs(struct kvm_vcpu *vcpu,
+   const struct hv_guest_state *l2_hv,
+   const struct hv_guest_state *l1_hv, u64 *lpcr)
+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   u64 mask;
+
+   restore_hv_regs(vcpu, l2_hv);
+
+   /*
+* Don't let L1 change LPCR bits for the L2 except these:
+*/
+   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
+   LPCR_LPES | LPCR_MER;
+
+   /*
+* Additional filtering is required depending on hardware
+* and configuration.
+*/
+   *lpcr = kvmppc_filter_lpcr_hv(vcpu->kvm,
+ (vc->lpcr & ~mask) | (*lpcr & mask));
+
+   /*
+* Don't let L1 enable 

Re: [PATCH] soc: fsl: qe: convert QE interrupt controller to platform_device

2021-07-22 Thread Li Yang
On Mon, Jul 19, 2021 at 1:57 AM Maxim Kochetkov  wrote:
>
> 15.07.2021 01:29, Li Yang wrote:
> >  From the original code, this should be type = "qeic".  It is not
> > defined in current binding but probably needed for backward
> > compatibility.
>
> I took these strings from this part:
>
> np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic");
>
> if (!np) {
>
> np = of_find_node_by_type(NULL, "qeic");
>
> if (!np)
>
> return -ENODEV;
>
> }
>
> However I can't find usage of "qeic" in any dts, so I will drop this in V2

It is really a bit hard to find as it is pretty old.  But it was
really used up until this commit below in 2008.  So probably it will
be better to keep it just for backward compatibility?

commit a2dd70a11d4c9cb8a4e4bb41f53a9b430e08559b
Author: Anton Vorontsov 
Date:   Thu Jan 24 18:39:59 2008 +0300

[POWERPC] QE: get rid of most device_types and model

Now we're searching for "fsl,qe", "fsl,qe-muram", "fsl,qe-muram-data"
and "fsl,qe-ic".

Unfortunately it's still impossible to remove device_type = "qe"
from the existing device trees because older u-boots are looking for it.

Signed-off-by: Anton Vorontsov 
Signed-off-by: Kumar Gala 

Regards,
Leo


Re: [PATCH v4 0/5] bus: Make remove callback return void

2021-07-22 Thread Greg Kroah-Hartman
On Wed, Jul 21, 2021 at 12:09:41PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Jul 13, 2021 at 09:35:17PM +0200, Uwe Kleine-König wrote:
> > Hello,
> > 
> > this is v4 of the final patch set for my effort to make struct
> > bus_type::remove return void.
> > 
> > The first four patches contain cleanups that make some of these
> > callbacks (more obviously) always return 0. They are acked by the
> > respective maintainers. Bjorn Helgaas explicitly asked to include the
> > pci patch (#1) into this series, so Greg taking this is fine. I assume
> > the s390 people are fine with Greg taking patches #2 to #4, too, they
> > didn't explicitly said so though.
> > 
> > The last patch actually changes the prototype and so touches quite some
> > drivers and has the potential to conflict with future developments, so I
> > consider it beneficial to put these patches into next soon. I expect
> > that it will be Greg who takes the complete series, he already confirmed
> > via irc (for v2) to look into this series.
> > 
> > The only change compared to v3 is in the fourth patch where I modified a
> > few more drivers to fix build failures. Some of them were found by build
> > bots (thanks!), some of them I found myself using a regular expression
> > search. The newly modified files are:
> > 
> >  arch/sparc/kernel/vio.c
> >  drivers/nubus/bus.c
> >  drivers/sh/superhyway/superhyway.c
> >  drivers/vlynq/vlynq.c
> >  drivers/zorro/zorro-driver.c
> >  sound/ac97/bus.c
> > 
> > Best regards
> > Uwe
> 
> Now queued up.  I can go make a git tag that people can pull from after
> 0-day is finished testing this to verify all is good, if others need it.

Ok, here's a tag that any other subsystem can pull from if they want
these changes in their tree before 5.15-rc1 is out.  I might pull it
into my char-misc-next tree as well just to keep that tree sane as it
seems to pick up new busses on a regular basis...

thanks,

greg k-h

---


The following changes since commit 2734d6c1b1a089fb593ef6a23d4b70903526fe0c:

  Linux 5.14-rc2 (2021-07-18 14:13:49 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 
tags/bus_remove_return_void-5.15

for you to fetch changes up to fc7a6209d5710618eb4f72a77cd81b8d694ecf89:

  bus: Make remove callback return void (2021-07-21 11:53:42 +0200)


Bus: Make remove callback return void tag

Tag for other trees/branches to pull from in order to have a stable
place to build off of if they want to add new busses for 5.15.

Signed-off-by: Greg Kroah-Hartman 


Uwe Kleine-König (5):
  PCI: endpoint: Make struct pci_epf_driver::remove return void
  s390/cio: Make struct css_driver::remove return void
  s390/ccwgroup: Drop if with an always false condition
  s390/scm: Make struct scm_driver::remove return void
  bus: Make remove callback return void

 arch/arm/common/locomo.c  | 3 +--
 arch/arm/common/sa.c  | 4 +---
 arch/arm/mach-rpc/ecard.c | 4 +---
 arch/mips/sgi-ip22/ip22-gio.c | 3 +--
 arch/parisc/kernel/drivers.c  | 5 ++---
 arch/powerpc/platforms/ps3/system-bus.c   | 3 +--
 arch/powerpc/platforms/pseries/ibmebus.c  | 3 +--
 arch/powerpc/platforms/pseries/vio.c  | 3 +--
 arch/s390/include/asm/eadm.h  | 2 +-
 arch/sparc/kernel/vio.c   | 4 +---
 drivers/acpi/bus.c| 3 +--
 drivers/amba/bus.c| 4 +---
 drivers/base/auxiliary.c  | 4 +---
 drivers/base/isa.c| 4 +---
 drivers/base/platform.c   | 4 +---
 drivers/bcma/main.c   | 6 ++
 drivers/bus/sunxi-rsb.c   | 4 +---
 drivers/cxl/core.c| 3 +--
 drivers/dax/bus.c | 4 +---
 drivers/dma/idxd/sysfs.c  | 4 +---
 drivers/firewire/core-device.c| 4 +---
 drivers/firmware/arm_scmi/bus.c   | 4 +---
 drivers/firmware/google/coreboot_table.c  | 4 +---
 drivers/fpga/dfl.c| 4 +---
 drivers/hid/hid-core.c| 4 +---
 drivers/hid/intel-ish-hid/ishtp/bus.c | 4 +---
 drivers/hv/vmbus_drv.c| 5 +
 drivers/hwtracing/intel_th/core.c | 4 +---
 drivers/i2c/i2c-core-base.c   | 5 +
 drivers/i3c/master.c  | 4 +---
 drivers/input/gameport/gameport.c | 3 +--
 drivers/input/serio/serio.c   | 3 +--
 drivers/ipack/ipack.c | 4 +---
 drivers/macintosh/macio_asic.c| 4 +---
 drivers/mcb/mcb-core.c| 4 +---
 drivers/media/pci/bt8xx/bttv-gpio.c   | 3 +--
 drivers/memstick/core/memstick.c  | 3 +--
 drivers/mfd/mcp-core.c| 

Re: [PATCH v5 6/6] powerpc/pseries: Add support for FORM2 associativity

2021-07-22 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Jun 28, 2021 at 08:41:17PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  Documentation/powerpc/associativity.rst   | 103 ++
>>  arch/powerpc/include/asm/firmware.h   |   3 +-
>>  arch/powerpc/include/asm/prom.h   |   1 +
>>  arch/powerpc/kernel/prom_init.c   |   3 +-
>>  arch/powerpc/mm/numa.c| 157 ++
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 242 insertions(+), 26 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index ..31cc7da2c7a6
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,103 @@
>> +
>> +NUMA resource associativity
>> +=
>> +
>> +Associativity represents the groupings of the various platform resources 
>> into
>> +domains of substantially similar mean performance relative to resources 
>> outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux 
>> kernel.
>> +From the platform view, these groups are also referred to as domains.
>
> Pretty hard to decipher, but that's typical for PAPR.
>
>> +PAPR interface currently supports different ways of communicating these 
>> resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and 
>> Form2
>> +associativity grouping. Form 0 is the older format and is now considered 
>> deprecated.
>
> Nit: s/older/oldest/ since there are now >2 forms.

updated.

>
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,architecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
>> Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-
>> +Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-
>> +With Form 1 a combination of ibm,associativity-reference-points, and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance between 
>> resource groups/domains.
>> +
>> +The “ibm,associativity” property contains a list of one or more numbers 
>> (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains a list of one or 
>> more numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity 
>> lists.
>> +The list of domainID indexes represents an increasing hierarchy of resource 
>> grouping.
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>> +
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA 
>> node id.
>> +Linux kernel computes NUMA distance between two domains by recursively 
>> comparing
>> +if they belong to the same higher-level domains. For mismatch at every 
>> higher
>> +level of the resource group, the kernel doubles the NUMA distance between 
>> the
>> +comparing domains.
>> +
>> +Form 2
>> +---
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows 
>> flexible primary
>> +domain numbering. With numa distance computation now detached from the 
>> index value in
>> +"ibm,associativity-reference-points" property, Form 2 allows a large number 
>> of primary domain
>> +ids at the same domainID index representing resource groups of different 
>> performance/latency
>> +characteristics.
>> +
>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 
>> in the
>> +"ibm,architecture-vec-5" property.
>> +
>> +"ibm,numa-lookup-index-table" property contains a list of one or more 
>> numbers representing
>> +the domainIDs present in the system. The offset of the domainID in this 
>> property is
>> +used as an index while computing numa distance information via 
>> "ibm,numa-distance-table".
>> +
>> +prop-encoded-array: The number N of the domainIDs encoded as with 
>> encode-int, followed by
>> +N 

Re: [PATCH v5 5/6] powerpc/pseries: Add a helper for form1 cpu distance

2021-07-22 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Jun 28, 2021 at 08:41:16PM +0530, Aneesh Kumar K.V wrote:
>> This helper is only used with the dispatch trace log collection.
>> A later patch will add Form2 affinity support and this change helps
>> in keeping that simpler. Also add a comment explaining we don't expect
>> the code to be called with FORM0
>> 
>> Reviewed-by: David Gibson 
>> Signed-off-by: Aneesh Kumar K.V 
>
> What makes it a "relative_distance" rather than just a "distance"?

I added that to indicate that the function is not returning the actual
distance but a number indicative of 'near', 'far' etc. (it actually returns
1, 2 etc).

>
>> ---
>>  arch/powerpc/include/asm/topology.h   |  4 ++--
>>  arch/powerpc/mm/numa.c| 10 +-
>>  arch/powerpc/platforms/pseries/lpar.c |  4 ++--
>>  3 files changed, 13 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/topology.h 
>> b/arch/powerpc/include/asm/topology.h
>> index e4db64c0e184..ac8b5ed79832 100644
>> --- a/arch/powerpc/include/asm/topology.h
>> +++ b/arch/powerpc/include/asm/topology.h
>> @@ -36,7 +36,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
>>   cpu_all_mask : \
>>   cpumask_of_node(pcibus_to_node(bus)))
>>  
>> -extern int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc);
>>  extern int __node_distance(int, int);
>>  #define node_distance(a, b) __node_distance(a, b)
>>  
>> @@ -83,7 +83,7 @@ static inline void sysfs_remove_device_from_node(struct 
>> device *dev,
>>  
>>  static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) 
>> {}
>>  
>> -static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +static inline int cpu_relative_distance(__be32 *cpu1_assoc, __be32 
>> *cpu2_assoc)
>>  {
>>  return 0;
>>  }
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 7b142f79d600..c6293037a103 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -166,7 +166,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
>>  }
>>  #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
>>  
>> -int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +static int __cpu_form1_relative_distance(__be32 *cpu1_assoc, __be32 
>> *cpu2_assoc)
>>  {
>>  int dist = 0;
>>  
>> @@ -182,6 +182,14 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>>  return dist;
>>  }
>>  
>> +int cpu_relative_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>> +{
>> +/* We should not get called with FORM0 */
>> +VM_WARN_ON(affinity_form == FORM0_AFFINITY);
>> +
>> +return __cpu_form1_relative_distance(cpu1_assoc, cpu2_assoc);
>> +}
>> +
>>  /* must hold reference to node during call */
>>  static const __be32 *of_get_associativity(struct device_node *dev)
>>  {
>> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
>> b/arch/powerpc/platforms/pseries/lpar.c
>> index dab356e3ff87..afefbdfe768d 100644
>> --- a/arch/powerpc/platforms/pseries/lpar.c
>> +++ b/arch/powerpc/platforms/pseries/lpar.c
>> @@ -261,7 +261,7 @@ static int cpu_relative_dispatch_distance(int 
>> last_disp_cpu, int cur_disp_cpu)
>>  if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
>>  return -EIO;
>>  
>> -return cpu_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>> +return cpu_relative_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
>>  }
>>  
>>  static int cpu_home_node_dispatch_distance(int disp_cpu)
>> @@ -281,7 +281,7 @@ static int cpu_home_node_dispatch_distance(int disp_cpu)
>>  if (!disp_cpu_assoc || !vcpu_assoc)
>>  return -EIO;
>>  
>> -return cpu_distance(disp_cpu_assoc, vcpu_assoc);
>> +return cpu_relative_distance(disp_cpu_assoc, vcpu_assoc);
>>  }
>>  
>>  static void update_vcpu_disp_stat(int disp_cpu)
>
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson


Re: [PATCH v5 4/6] powerpc/pseries: Consolidate different NUMA distance update code paths

2021-07-22 Thread Aneesh Kumar K.V
David Gibson  writes:

> On Mon, Jun 28, 2021 at 08:41:15PM +0530, Aneesh Kumar K.V wrote:
>> The associativity details of the newly added resourced are collected from
>> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
>> distance details of the newly added numa node after the above call.
>> 
>> Instead of updating NUMA distance every time we lookup a node id
>> from the associativity property, add helpers that can be used
>> during boot which does this only once. Also remove the distance
>> update from node id lookup helpers.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/mm/numa.c| 173 +-
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 +
>>  .../platforms/pseries/hotplug-memory.c|   2 +
>>  arch/powerpc/platforms/pseries/pseries.h  |   1 +
>>  4 files changed, 132 insertions(+), 46 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 0ec16999beef..7b142f79d600 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -208,22 +208,6 @@ int __node_distance(int a, int b)
>>  }
>>  EXPORT_SYMBOL(__node_distance);
>>  
>> -static void initialize_distance_lookup_table(int nid,
>> -const __be32 *associativity)
>> -{
>> -int i;
>> -
>> -if (affinity_form != FORM1_AFFINITY)
>> -return;
>> -
>> -for (i = 0; i < max_associativity_domain_index; i++) {
>> -const __be32 *entry;
>> -
>> -entry = [be32_to_cpu(distance_ref_points[i]) - 1];
>> -distance_lookup_table[nid][i] = of_read_number(entry, 1);
>> -}
>> -}
>> -
>>  /*
>>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>>   * info is found.
>> @@ -241,15 +225,6 @@ static int associativity_to_nid(const __be32 
>> *associativity)
>>  /* POWER4 LPAR uses 0x as invalid node */
>>  if (nid == 0x || nid >= nr_node_ids)
>>  nid = NUMA_NO_NODE;
>> -
>> -if (nid > 0 &&
>> -of_read_number(associativity, 1) >= 
>> max_associativity_domain_index) {
>> -/*
>> - * Skip the length field and send start of associativity array
>> - */
>> -initialize_distance_lookup_table(nid, associativity + 1);
>> -}
>> -
>>  out:
>>  return nid;
>>  }
>> @@ -287,6 +262,49 @@ int of_node_to_nid(struct device_node *device)
>>  }
>>  EXPORT_SYMBOL(of_node_to_nid);
>>  
>> +static void __initialize_form1_numa_distance(const __be32 *associativity)
>> +{
>> +int i, nid;
>> +
>> +if (affinity_form != FORM1_AFFINITY)
>
> Since this shouldn't be called on a !form1 system, this could be a WARN_ON().

The way we call functions currently, instead of doing

if (affinity_form == FORM1_AFFINITY)
__initialize_form1_numa_distance()

We avoid doing the if check in multiple places. For example
parse_numa_properties will fetch the associativity array to find the
details of online node and set it online. We use the same code path to
initialize distance.

if (__vphn_get_associativity(i, vphn_assoc) == 0) {
nid = associativity_to_nid(vphn_assoc);
__initialize_form1_numa_distance(vphn_assoc);
} else {

cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);

associativity = of_get_associativity(cpu);
if (associativity) {
nid = associativity_to_nid(associativity);
__initialize_form1_numa_distance(associativity);
}

We avoid the the if (affinity_form == FORM1_AFFINITY) check there by
moving the check inside __initialize_form1_numa_distance().


>
>> +return;
>> +
>> +if (of_read_number(associativity, 1) >= primary_domain_index) {
>> +nid = of_read_number([primary_domain_index], 1);
>
> This computes the nid from the assoc array independently of
> associativity_to_nid, which doesn't seem like a good idea.  Wouldn't
> it be better to call assocaitivity_to_nid(), then make the next bit
> conditional on nid !== NUMA_NO_NODE?

@@ -302,9 +302,8 @@ static void __initialize_form1_numa_distance(const __be32 
*associativity)
if (affinity_form != FORM1_AFFINITY)
return;
 
-   if (of_read_number(associativity, 1) >= primary_domain_index) {
-   nid = of_read_number([primary_domain_index], 1);
-
+   nid = associativity_to_nid(associativity);
+   if (nid != NUMA_NO_NODE) {
for (i = 0; i < distance_ref_points_depth; i++) {
const __be32 *entry;
 

>
>> +
>> +for (i = 0; i < max_associativity_domain_index; i++) {
>> +const __be32 *entry;
>> +
>> +entry = 
>> [be32_to_cpu(distance_ref_points[i])];
>> +distance_lookup_table[nid][i] =