[PATCH] powerpc/chrp: Revert "Move PHB discovery" and "Make hydra_init() static"

2021-07-16 Thread Guenter Roeck
This patch reverts commit 407d418f2fd4 ("powerpc/chrp: Move PHB
discovery") and commit 9634afa67bfd ("powerpc/chrp: Make hydra_init()
static").

Running the upstream kernel on Qemu's brand new "pegasos2" emulation
results in a variety of backtraces such as

Kernel attempted to write user page (a1) - exploit attempt? (uid: 0)
[ cut here ]
Bug: Write fault blocked by KUAP!
WARNING: CPU: 0 PID: 0 at arch/powerpc/mm/fault.c:230 do_page_fault+0x4f4/0x920
CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.2 #40
NIP:  c0021824 LR: c0021824 CTR: 
REGS: c1085d50 TRAP: 0700   Not tainted  (5.13.2)
MSR:  00021032   CR: 24042254  XER: 

GPR00: c0021824 c1085e10 c0f8c520 0021 3fffefff c1085c60 c1085c58 
GPR08: 1032   c0ffb3ec 44042254   0004
GPR16:   00c4 00d0 0188c6e0 01006000 0001 40b14000
GPR24: c0ec000c 0300 0200  4200 00a1  c1085e60
NIP [c0021824] do_page_fault+0x4f4/0x920
LR [c0021824] do_page_fault+0x4f4/0x920
Call Trace:
[c1085e10] [c0021824] do_page_fault+0x4f4/0x920 (unreliable)
[c1085e50] [c0004254] DataAccess_virt+0xd4/0xe4

and the system fails to boot. Bisect points to commit 407d418f2fd4
("powerpc/chrp: Move PHB discovery"). Reverting this patch together with
commit 9634afa67bfd ("powerpc/chrp: Make hydra_init() static") fixes
the problem.

Cc: Oliver O'Halloran 
Cc: Geert Uytterhoeven 
Fixes: 407d418f2fd4 ("powerpc/chrp: Move PHB discovery")
Signed-off-by: Guenter Roeck 
---
 arch/powerpc/include/asm/hydra.h|  2 ++
 arch/powerpc/platforms/chrp/pci.c   | 11 ++-
 arch/powerpc/platforms/chrp/setup.c | 12 +++-
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/hydra.h b/arch/powerpc/include/asm/hydra.h
index d024447283a0..ae02eb53d6ef 100644
--- a/arch/powerpc/include/asm/hydra.h
+++ b/arch/powerpc/include/asm/hydra.h
@@ -94,6 +94,8 @@ extern volatile struct Hydra __iomem *Hydra;
 #define HYDRA_INT_EXT7 18  /* Power Off Request */
 #define HYDRA_INT_SPARE19
 
+extern int hydra_init(void);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASMPPC_HYDRA_H */
diff --git a/arch/powerpc/platforms/chrp/pci.c 
b/arch/powerpc/platforms/chrp/pci.c
index 76e6256cb0a7..b2c2bf35b76c 100644
--- a/arch/powerpc/platforms/chrp/pci.c
+++ b/arch/powerpc/platforms/chrp/pci.c
@@ -131,7 +131,8 @@ static struct pci_ops rtas_pci_ops =
 
 volatile struct Hydra __iomem *Hydra = NULL;
 
-static int __init hydra_init(void)
+int __init
+hydra_init(void)
 {
struct device_node *np;
struct resource r;
@@ -313,14 +314,6 @@ chrp_find_bridges(void)
}
}
of_node_put(root);
-
-   /*
-*  "Temporary" fixes for PCI devices.
-*  -- Geert
-*/
-   hydra_init();   /* Mac I/O */
-
-   pci_create_OF_bus_map();
 }
 
 /* SL82C105 IDE Control/Status Register */
diff --git a/arch/powerpc/platforms/chrp/setup.c 
b/arch/powerpc/platforms/chrp/setup.c
index 3cfc382841e5..c45435aa5e36 100644
--- a/arch/powerpc/platforms/chrp/setup.c
+++ b/arch/powerpc/platforms/chrp/setup.c
@@ -334,11 +334,22 @@ static void __init chrp_setup_arch(void)
/* On pegasos, enable the L2 cache if not already done by OF */
pegasos_set_l2cr();
 
+   /* Lookup PCI host bridges */
+   chrp_find_bridges();
+
+   /*
+*  Temporary fixes for PCI devices.
+*  -- Geert
+*/
+   hydra_init();   /* Mac I/O */
+
/*
 *  Fix the Super I/O configuration
 */
sio_init();
 
+   pci_create_OF_bus_map();
+
/*
 * Print the banner, then scroll down so boot progress
 * can be printed.  -- Cort
@@ -571,7 +582,6 @@ define_machine(chrp) {
.name   = "CHRP",
.probe  = chrp_probe,
.setup_arch = chrp_setup_arch,
-   .discover_phbs  = chrp_find_bridges,
.init   = chrp_init2,
.show_cpuinfo   = chrp_show_cpuinfo,
.init_IRQ   = chrp_init_IRQ,
-- 
2.25.1



[PATCH] ibmvfc: fix command state accounting and stale response detection

2021-07-16 Thread Tyrel Datwyler
Prior to commit 1f4a4a19508d ("scsi: ibmvfc: Complete commands outside
the host/queue lock") responses to commands were completed sequentially
with the host lock held such that a command had a basic binary state of
active or free. It was therefore a simple affair of ensuring the
assocaiated ibmvfc_event to a VIOS response was valid by testing that it
was not already free. The lock relexation work to complete commands
outside the lock inadverdently made it a trinary command state such that
a command is either in flight, received and being completed, or
completed and now free. This breaks the stale command detection logic as
a command may be still marked active and been placed on the delayed
completion list when a second stale response for the same command
arrives. This can lead to double completions and list corruption. This
issue was exposed by a recent VIOS regression were a missing memory
barrier could occasionally result in the ibmvfc client receiveing a
duplicate response for the same command.

Fix the issue by introducing the atomic ibmvfc_event.active to track the
trinary state of a command. The state is explicitly set to 1 when a
command is successfully sent. The CRQ response handlers use
atomic_dec_if_positive() to test for stale responses and correctly
transition to the completion state when a active command is received.
Finally, atomic_dec_and_test() is used to sanity check transistions
when commands are freed as a result of a completion, or moved to the
purge list as a result of error handling or adapter reset.

Cc: sta...@vger.kernel.org
Fixes: 1f4a4a19508d ("scsi: ibmvfc: Complete commands outside the host/queue 
lock")
Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 19 +--
 drivers/scsi/ibmvscsi/ibmvfc.h |  1 +
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index bee1bec49c09..935b01ee44b7 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -807,6 +807,13 @@ static int ibmvfc_init_event_pool(struct ibmvfc_host 
*vhost,
for (i = 0; i < size; ++i) {
struct ibmvfc_event *evt = >events[i];
 
+   /*
+* evt->active states
+*  1 = in flight
+*  0 = being completed
+* -1 = free/freed
+*/
+   atomic_set(>active, -1);
atomic_set(>free, 1);
evt->crq.valid = 0x80;
evt->crq.ioba = cpu_to_be64(pool->iu_token + 
(sizeof(*evt->xfer_iu) * i));
@@ -1017,6 +1024,7 @@ static void ibmvfc_free_event(struct ibmvfc_event *evt)
 
BUG_ON(!ibmvfc_valid_event(pool, evt));
BUG_ON(atomic_inc_return(>free) != 1);
+   BUG_ON(atomic_dec_and_test(>active));
 
spin_lock_irqsave(>queue->l_lock, flags);
list_add_tail(>queue_list, >queue->free);
@@ -1072,6 +1080,12 @@ static void ibmvfc_complete_purge(struct list_head 
*purge_list)
  **/
 static void ibmvfc_fail_request(struct ibmvfc_event *evt, int error_code)
 {
+   /*
+* Anything we are failing should still be active. Otherwise, it
+* implies we already got a response for the command and are doing
+* something bad like double completing it.
+*/
+   BUG_ON(!atomic_dec_and_test(>active));
if (evt->cmnd) {
evt->cmnd->result = (error_code << 16);
evt->done = ibmvfc_scsi_eh_done;
@@ -1723,6 +1737,7 @@ static int ibmvfc_send_event(struct ibmvfc_event *evt,
 
evt->done(evt);
} else {
+   atomic_set(>active, 1);
spin_unlock_irqrestore(>queue->l_lock, flags);
ibmvfc_trc_start(evt);
}
@@ -3251,7 +3266,7 @@ static void ibmvfc_handle_crq(struct ibmvfc_crq *crq, 
struct ibmvfc_host *vhost,
return;
}
 
-   if (unlikely(atomic_read(>free))) {
+   if (unlikely(atomic_dec_if_positive(>active))) {
dev_err(vhost->dev, "Received duplicate correlation_token 
0x%08llx!\n",
crq->ioba);
return;
@@ -3778,7 +3793,7 @@ static void ibmvfc_handle_scrq(struct ibmvfc_crq *crq, 
struct ibmvfc_host *vhost
return;
}
 
-   if (unlikely(atomic_read(>free))) {
+   if (unlikely(atomic_dec_if_positive(>active))) {
dev_err(vhost->dev, "Received duplicate correlation_token 
0x%08llx!\n",
crq->ioba);
return;
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index 4f0f3baefae4..92fb889d7eb0 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -745,6 +745,7 @@ struct ibmvfc_event {
struct ibmvfc_target *tgt;
struct scsi_cmnd *cmnd;
atomic_t free;
+   atomic_t active;
union ibmvfc_iu *xfer_iu;
void (*done)(struct ibmvfc_event *evt);
  

Re: [PATCH v4 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-07-16 Thread Fabiano Rosas
"Pratik R. Sampat"  writes:

> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Do you need all of these headers? Sorry to mention just now, I seem to have
dropped this comment from a previous review.

> +
> +#include "pseries.h"
> +
> +/*
> + * Flag attributes to fetch either all or one attribute from the HCALL
> + * flag = BE(0) => fetch all attributes with firstAttributeId = 0
> + * flag = BE(1) => fetch a single attribute with firstAttributeId = id
> + */
> +#define ESI_FLAGS_ALL0
> +#define ESI_FLAGS_SINGLE PPC_BIT(0)
> +
> +#define MAX_ATTRS3
> +
> +struct papr_attr {
> + u64 id;
> + struct kobj_attribute kobj_attr;
> +};
> +struct papr_group {
> + struct attribute_group pg;
> + struct papr_attr pgattrs[MAX_ATTRS];
> +} *pgs;
> +
> +/* /sys/firmware/papr */
> +struct kobject *papr_kobj;
> +/* /sys/firmware/papr/energy_scale_info */
> +struct kobject *esi_kobj;
> +
> +/*
> + * Extract and export the description of the energy scale attribute
> + *

Extra line here.

> + */
> +static ssize_t papr_show_desc(struct kobject *kobj,
> +struct kobj_attribute *kobj_attr,
> +char *buf)
> +{
> + struct papr_attr *pattr = container_of(kobj_attr, struct papr_attr,
> +kobj_attr);
> + struct h_energy_scale_info_hdr *t_hdr;
> + struct energy_scale_attribute *t_esi;
> + char *t_buf;
> + int ret = 0;
> +
> + t_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> + if (t_buf == NULL)
> + return -ENOMEM;
> +
> + ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_SINGLE,
> +  pattr->id, virt_to_phys(t_buf),
> +  MAX_BUF_SZ);
> +
> + if (ret != H_SUCCESS) {
> + pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> + goto out;
> + }
> +
> + t_hdr = (struct h_energy_scale_info_hdr *) t_buf;
> + t_esi = (struct energy_scale_attribute *)
> + (t_buf + be64_to_cpu(t_hdr->array_offset));
> +
> + ret = snprintf(buf, sizeof(t_esi->desc), "%s\n", t_esi->desc);
> + if (ret < 0)
> + ret = -EIO;
> +out:
> + kfree(t_buf);
> +
> + return ret;
> +}
> +
> +/*
> + * Extract and export the numeric value of the energy scale attributes
> + */
> +static ssize_t papr_show_value(struct kobject *kobj,
> + struct kobj_attribute *kobj_attr,
> + char *buf)
> +{
> + struct papr_attr *pattr = container_of(kobj_attr, struct papr_attr,
> +kobj_attr);
> + struct h_energy_scale_info_hdr *t_hdr;
> + struct energy_scale_attribute *t_esi;
> + char *t_buf;
> + int ret = 0;
> +
> + t_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> + if (t_buf == NULL)
> + return -ENOMEM;
> +
> + ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_SINGLE,
> +  pattr->id, virt_to_phys(t_buf),
> +  MAX_BUF_SZ);
> +
> + if (ret != H_SUCCESS) {
> + pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> + goto out;
> + }
> +
> + t_hdr = (struct h_energy_scale_info_hdr *) t_buf;
> + t_esi = (struct energy_scale_attribute *)
> + (t_buf + be64_to_cpu(t_hdr->array_offset));
> +
> + ret = snprintf(buf, sizeof(t_esi->value), "%llu\n",
> +be64_to_cpu(t_esi->value));
> + if (ret < 0)
> + ret = -EIO;
> +out:
> + kfree(t_buf);
> +
> + return ret;
> +}
> +
> +/*
> + * Extract and export the value description in string format of the energy
> + * scale attributes
> + */
> +static ssize_t papr_show_value_desc(struct kobject *kobj,
> +  struct kobj_attribute *kobj_attr,
> +  char *buf)
> +{
> + struct papr_attr *pattr = container_of(kobj_attr, struct papr_attr,
> +kobj_attr);
> + struct h_energy_scale_info_hdr *t_hdr;
> + struct energy_scale_attribute *t_esi;
> + char *t_buf;
> + int ret = 0;
> +
> + t_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> + if (t_buf == NULL)
> + return -ENOMEM;
> +
> + ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_SINGLE,
> +  pattr->id, virt_to_phys(t_buf),
> +  MAX_BUF_SZ);
> +
> + if (ret != H_SUCCESS) {
> + pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> + goto out;
> + }
> +
> + t_hdr = (struct h_energy_scale_info_hdr *) t_buf;
> + t_esi = (struct energy_scale_attribute *)
> + (t_buf + 

Re: [PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error

2021-07-16 Thread Logan Gunthorpe



On 2021-07-16 12:33 a.m., Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 10:45:44AM -0600, Logan Gunthorpe wrote:
>> @@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
>> scatterlist *sg,
>>  else
>>  ents = ops->map_sg(dev, sg, nents, dir, attrs);
>>  
>> +WARN_ON_ONCE(ents == 0);
> 
> Turns this into a negative error code while we're at it, just to keep
> the callers sane?
> 

Sure thing. All the feedback makes sense, we'll fix it up and send a v2
in due course.

Thanks,

Logan


[PATCH v4 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-07-16 Thread Pratik R. Sampat
Adds a generic interface to represent the energy and frequency related
PAPR attributes on the system using the new H_CALL
"H_GET_ENERGY_SCALE_INFO".

H_GET_EM_PARMS H_CALL was previously responsible for exporting this
information in the lparcfg, however the H_GET_EM_PARMS H_CALL
will be deprecated P10 onwards.

The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
hcall(
  uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
  uint64 flags,   // Per the flag request
  uint64 firstAttributeId,// The attribute id
  uint64 bufferAddress,   // Guest physical address of the output buffer
  uint64 bufferSize   // The size in bytes of the output buffer
);

This H_CALL can query either all the attributes at once with
firstAttributeId = 0, flags = 0 as well as query only one attribute
at a time with firstAttributeId = id, flags = 1.

The output buffer consists of the following
1. number of attributes  - 8 bytes
2. array offset to the data location - 8 bytes
3. version info  - 1 byte
4. A data array of size num attributes, which contains the following:
  a. attribute ID  - 8 bytes
  b. attribute value in number - 8 bytes
  c. attribute name in string  - 64 bytes
  d. attribute value in string - 64 bytes

The new H_CALL exports information in direct string value format, hence
a new interface has been introduced in
/sys/firmware/papr/energy_scale_info to export this information to
userspace in an extensible pass-through format.

The H_CALL returns the name, numeric value and string value (if exists)

The format of exposing the sysfs information is as follows:
/sys/firmware/papr/energy_scale_info/
   |-- /
 |-- desc
 |-- value
 |-- value_desc (if exists)
   |-- /
 |-- desc
 |-- value
 |-- value_desc (if exists)
...

The energy information that is exported is useful for userspace tools
such as powerpc-utils. Currently these tools infer the
"power_mode_data" value in the lparcfg, which in turn is obtained from
the to be deprecated H_GET_EM_PARMS H_CALL.
On future platforms, such userspace utilities will have to look at the
data returned from the new H_CALL being populated in this new sysfs
interface and report this information directly without the need of
interpretation.

Signed-off-by: Pratik R. Sampat 
Reviewed-by: Gautham R. Shenoy 
---
 .../sysfs-firmware-papr-energy-scale-info |  26 ++
 arch/powerpc/include/asm/hvcall.h |  24 +-
 arch/powerpc/kvm/trace_hv.h   |   1 +
 arch/powerpc/platforms/pseries/Makefile   |   3 +-
 .../pseries/papr_platform_attributes.c| 331 ++
 5 files changed, 383 insertions(+), 2 deletions(-)
 create mode 100644 
Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
 create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info 
b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
new file mode 100644
index ..139a576c7c9d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
@@ -0,0 +1,26 @@
+What:  /sys/firmware/papr/energy_scale_info
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   Directory hosting a set of platform attributes like
+   energy/frequency on Linux running as a PAPR guest.
+
+   Each file in a directory contains a platform
+   attribute hierarchy pertaining to performance/
+   energy-savings mode and processor frequency.
+
+What:  /sys/firmware/papr/energy_scale_info/
+   /sys/firmware/papr/energy_scale_info//desc
+   /sys/firmware/papr/energy_scale_info//value
+   /sys/firmware/papr/energy_scale_info//value_desc
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   Energy, frequency attributes directory for POWERVM servers
+
+   This directory provides energy, frequency, folding information. 
It
+   contains below sysfs attributes:
+
+   - desc: String description of the attribute 
+
+   - value: Numeric value of attribute 
+
+   - value_desc: String value of attribute 
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e3b29eda8074..c91714ea6719 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -316,7 +316,8 @@
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
 #define H_SCM_FLUSH0x44C
-#define MAX_HCALL_OPCODE   H_SCM_FLUSH
+#define H_GET_ENERGY_SCALE_INFO0x450
+#define MAX_HCALL_OPCODE   H_GET_ENERGY_SCALE_INFO
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -631,6 +632,27 @@ struct hv_gpci_request_buffer {
uint8_t bytes[HGPCI_MAX_DATA_BYTES];
 } __packed;
 

[PATCH v4 0/1] Interface to represent PAPR firmware attributes

2021-07-16 Thread Pratik R. Sampat
RFC: https://lkml.org/lkml/2021/6/4/791
PATCH v1: https://lkml.org/lkml/2021/6/16/805
PATCH v2: https://lkml.org/lkml/2021/7/6/138
PATCH v3: https://lkml.org/lkml/2021/7/12/2799 

Changelog v3 --> v4
Based on a comment from Fabiano:
1. Resolved typo in documentation
2. Statically allocate "pgattrs" instead of kmalloc as size known at
   compile time
3. Converted sprintf calls to snprintf as size of buffer is known
4. Changed the implementation of "papr_show_desc" function to make an
   extra H_CALL as a result of making the scope of esi_hdr and
   esi_attrs local
5. Removed bailing out on version mismatch as the documentation states
   that the attribute structure can never change, only the data and that
   will not break the implementation. Hence just a warning is issued on
   version mismatch
6. Avoid passing a statically allocated buf to "pgs[idx].pg.name",
   instead used kasprintf to allocate and pass the data through

Also, have implemented a POC using this interface for the powerpc-utils'
ppc64_cpu --frequency command-line tool to utilize this information
in userspace.
The POC for the new interface has been hosted here:
https://github.com/pratiksampat/powerpc-utils/tree/H_GET_ENERGY_SCALE_INFO_v2

Sample output from the powerpc-utils tool is as follows:

# ppc64_cpu --frequency
Power and Performance Mode: 
Idle Power Saver Status   : 
Processor Folding Status  :  --> Printed if Idle power save status is 
supported

Platform reported frequencies --> Frequencies reported from the platform's 
H_CALL i.e PAPR interface
min: GHz
max: GHz
static : GHz

Tool Computed frequencies
min: GHz (cpu XX)
max: GHz (cpu XX)
avg: GHz

Pratik R. Sampat (1):
  powerpc/pseries: Interface to represent PAPR firmware attributes

 .../sysfs-firmware-papr-energy-scale-info |  26 ++
 arch/powerpc/include/asm/hvcall.h |  24 +-
 arch/powerpc/kvm/trace_hv.h   |   1 +
 arch/powerpc/platforms/pseries/Makefile   |   3 +-
 .../pseries/papr_platform_attributes.c| 331 ++
 5 files changed, 383 insertions(+), 2 deletions(-)
 create mode 100644 
Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
 create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

-- 
2.31.1



Re: [PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error

2021-07-16 Thread Robin Murphy

On 2021-07-16 07:33, Christoph Hellwig wrote:

On Thu, Jul 15, 2021 at 10:45:44AM -0600, Logan Gunthorpe wrote:

@@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
  
+	WARN_ON_ONCE(ents == 0);


Turns this into a negative error code while we're at it, just to keep
the callers sane?


Right, by this point returning the 0 would pass through 
dma_map_sg_attrs() OK, but AFAICS dma_map_sgtable() would now get 
confused and return success but with sgt->nents = 0. Coercing it to an 
error code (which dma_map_sg_attrs() would then just change right back) 
seems sensible for the sake of easy robustness.


Robin.


Re: [PATCH v1 14/16] x86/amd_gart: return error code from gart_map_sg()

2021-07-16 Thread Robin Murphy

On 2021-07-16 07:32, Christoph Hellwig wrote:

On Thu, Jul 15, 2021 at 10:45:42AM -0600, Logan Gunthorpe wrote:

@@ -458,7 +460,7 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
iommu_full(dev, pages << PAGE_SHIFT, dir);
for_each_sg(sg, s, nents, i)
s->dma_address = DMA_MAPPING_ERROR;
-   return 0;
+   return ret;


While we're at it - setting the ->dma_address to DMA_MAPPING_ERROR is
not part of the ->map_sg calling convention.  Might be worth to fix
up while we're at it.


Especially since it's not even zeroing dma_length, which at least is a 
documented indicator of validity (even if it's not strictly defined for 
failure cases either).


Robin.


[PATCH v5 11/11] powerpc/pseries/iommu: Rename "direct window" to "dma window"

2021-07-16 Thread Leonardo Bras
A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

This should cause no behavioural change, just adjust naming.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 91 +-
 1 file changed, 47 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index a67e71c49aeb..52548dfb8b45 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -349,7 +349,7 @@ struct dynamic_dma_window_prop {
__be32  window_shift;   /* ilog2(tce_window_size) */
 };
 
-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -369,11 +369,11 @@ struct ddw_create_response {
u32 addr_lo;
 };
 
-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
 /* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
 /* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
 #define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
@@ -713,7 +713,10 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
 dn);
 
-   /* Find nearest ibm,dma-window, walking up the device tree */
+   /*
+* Find nearest ibm,dma-window (default DMA window), walking up the
+* device tree
+*/
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
if (dma_window != NULL)
@@ -822,11 +825,11 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
 
ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
-   pr_warn("%pOF: failed to remove direct window: rtas returned "
+   pr_warn("%pOF: failed to remove DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
-   pr_debug("%pOF: successfully removed direct window: rtas 
returned "
+   pr_debug("%pOF: successfully removed DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
@@ -854,37 +857,37 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
 
ret = of_remove_property(np, win);
if (ret)
-   pr_warn("%pOF: failed to remove direct window property: %d\n",
+   pr_warn("%pOF: failed to remove DMA window property: %d\n",
np, ret);
return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
 {
-   struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
+   struct dma_win *window;
+   const struct dynamic_dma_window_prop *dma64;
bool found = false;
 
-   spin_lock(_window_list_lock);
+   spin_lock(_win_list_lock);
/* check if we already created a window and dupe that config if so */
-   list_for_each_entry(window, _window_list, list) {
+   list_for_each_entry(window, _win_list, list) {
if (window->device == pdn) {
-   direct64 = window->prop;
-   *dma_addr = be64_to_cpu(direct64->dma_base);
-   *window_shift = be32_to_cpu(direct64->window_shift);
+   dma64 = window->prop;
+   *dma_addr = be64_to_cpu(dma64->dma_base);
+   *window_shift = be32_to_cpu(dma64->window_shift);
found = true;
break;
}
}
-   spin_unlock(_window_list_lock);
+   spin_unlock(_win_list_lock);
 
return found;
 }
 
-static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
-   const struct 
dynamic_dma_window_prop *dma64)
+static struct dma_win *ddw_list_new_entry(struct device_node *pdn,
+ const struct dynamic_dma_window_prop 
*dma64)
 {
-   struct direct_window *window;
+   struct dma_win *window;
 
window = 

[PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-16 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 87 +-
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 22d251e15b61..a67e71c49aeb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -925,6 +926,7 @@ static int find_existing_ddw_windows(void)
return 0;
 
find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+   find_existing_ddw_windows_named(DMA64_PROPNAME);
 
return 0;
 }
@@ -1211,14 +1213,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+   const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
 
dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1227,6 +1232,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
mutex_lock(_window_init_mutex);
 
if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1307,8 +1313,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_failed;
}
-   /* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
+
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1324,13 +1329,25 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "Skipping ibm,pmemory");
}
 
+   /* check if the available block * number of ptes will map everything */
if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(>dev,
"can't map partition max 0x%llx with %llu %llu-sized 
pages\n",
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
-   goto out_failed;
+
+   /* DDW + IOMMU on single window may fail if there is any 
allocation */
+   if (default_win_removed && iommu_table_in_use(tbl)) {
+   

[PATCH v5 09/11] powerpc/pseries/iommu: Find existing DDW with given property name

2021-07-16 Thread Leonardo Bras
At the moment pseries stores information about created directly mapped
DDW window in DIRECT64_PROPNAME.

With the objective of implementing indirect DMA mapping with DDW, it's
necessary to have another propriety name to make sure kexec'ing into older
kernels does not break, as it would if we reuse DIRECT64_PROPNAME.

In order to have this, find_existing_ddw_windows() needs to be able to
look for different property names.

Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 17c6f4706e76..22d251e15b61 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -895,24 +895,21 @@ static struct direct_window *ddw_list_new_entry(struct 
device_node *pdn,
return window;
 }
 
-static int find_existing_ddw_windows(void)
+static void find_existing_ddw_windows_named(const char *name)
 {
int len;
struct device_node *pdn;
struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
-
-   if (!firmware_has_feature(FW_FEATURE_LPAR))
-   return 0;
+   const struct dynamic_dma_window_prop *dma64;
 
-   for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
-   direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true, DIRECT64_PROPNAME);
+   for_each_node_with_property(pdn, name) {
+   dma64 = of_get_property(pdn, name, );
+   if (!dma64 || len < sizeof(*dma64)) {
+   remove_ddw(pdn, true, name);
continue;
}
 
-   window = ddw_list_new_entry(pdn, direct64);
+   window = ddw_list_new_entry(pdn, dma64);
if (!window)
break;
 
@@ -920,6 +917,14 @@ static int find_existing_ddw_windows(void)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
}
+}
+
+static int find_existing_ddw_windows(void)
+{
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   find_existing_ddw_windows_named(DIRECT64_PROPNAME);
 
return 0;
 }
-- 
2.32.0



[PATCH v5 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2021-07-16 Thread Leonardo Bras
Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 108c3dcca686..17c6f4706e76 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -830,31 +830,32 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static int remove_ddw(struct device_node *np, bool remove_prop, const char 
*win_name)
 {
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;
 
+   win = of_find_property(np, win_name, NULL);
+   if (!win)
+   return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
 _avail[0], DDW_APPLICABLE_SIZE);
if (ret)
-   return;
-
-   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win)
-   return;
+   return 0;
 
if (win->length >= sizeof(struct dynamic_dma_window_prop))
remove_dma_window(np, ddw_avail, win);
 
if (!remove_prop)
-   return;
+   return 0;
 
ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+   return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
@@ -907,7 +908,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}
 
@@ -1382,7 +1383,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
kfree(win64);
 
 out_remove_win:
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
 
 out_failed:
if (default_win_removed)
@@ -1547,7 +1548,7 @@ static int iommu_reconfig_notifier(struct notifier_block 
*nb, unsigned long acti
 * we have to remove the property when releasing
 * the device node.
 */
-   remove_ddw(np, false);
+   remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);
-- 
2.32.0



[PATCH v5 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

2021-07-16 Thread Leonardo Bras
Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.

Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, declare iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 72 ++
 1 file changed, 38 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 7ca79a04fa52..108c3dcca686 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -501,6 +501,24 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long 
start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
+static void iommu_table_setparms_common(struct iommu_table *tbl, unsigned long 
busno,
+   unsigned long liobn, unsigned long 
win_addr,
+   unsigned long window_size, unsigned 
long page_shift,
+   void *base, struct iommu_table_ops 
*table_ops)
+{
+   tbl->it_busno = busno;
+   tbl->it_index = liobn;
+   tbl->it_offset = win_addr >> page_shift;
+   tbl->it_size = window_size >> page_shift;
+   tbl->it_page_shift = page_shift;
+   tbl->it_base = (unsigned long)base;
+   tbl->it_blocksize = 16;
+   tbl->it_type = TCE_PCI;
+   tbl->it_ops = table_ops;
+}
+
+struct iommu_table_ops iommu_table_pseries_ops;
+
 static void iommu_table_setparms(struct pci_controller *phb,
 struct device_node *dn,
 struct iommu_table *tbl)
@@ -509,8 +527,13 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
const unsigned long *basep;
const u32 *sizep;
 
-   node = phb->dn;
+   /* Test if we are going over 2GB of DMA space */
+   if (phb->dma_window_base_cur + phb->dma_window_size > SZ_2G) {
+   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
+   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+   }
 
+   node = phb->dn;
basep = of_get_property(node, "linux,tce-base", NULL);
sizep = of_get_property(node, "linux,tce-size", NULL);
if (basep == NULL || sizep == NULL) {
@@ -519,33 +542,18 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
return;
}
 
-   tbl->it_base = (unsigned long)__va(*basep);
+   iommu_table_setparms_common(tbl, phb->bus->number, 0, 
phb->dma_window_base_cur,
+   phb->dma_window_size, IOMMU_PAGE_SHIFT_4K,
+   __va(*basep), _table_pseries_ops);
 
if (!is_kdump_kernel())
memset((void *)tbl->it_base, 0, *sizep);
 
-   tbl->it_busno = phb->bus->number;
-   tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-
-   /* Units of tce entries */
-   tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
-
-   /* Test if we are going over 2GB of DMA space */
-   if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) {
-   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
-   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
-   }
-
phb->dma_window_base_cur += phb->dma_window_size;
-
-   /* Set the tce table size - measured in entries */
-   tbl->it_size = phb->dma_window_size >> tbl->it_page_shift;
-
-   tbl->it_index = 0;
-   tbl->it_blocksize = 16;
-   tbl->it_type = TCE_PCI;
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops;
+
 /*
  * iommu_table_setparms_lpar
  *
@@ -557,17 +565,13 @@ static void iommu_table_setparms_lpar(struct 
pci_controller *phb,
  struct iommu_table_group *table_group,
  const __be32 *dma_window)
 {
-   unsigned long offset, size;
+   unsigned long offset, size, liobn;
 
-   of_parse_dma_window(dn, dma_window, >it_index, , );
+   of_parse_dma_window(dn, dma_window, , , );
+
+   iommu_table_setparms_common(tbl, phb->bus->number, liobn, offset, size, 
IOMMU_PAGE_SHIFT_4K, NULL,
+   _table_lpar_multi_ops);
 
-   tbl->it_busno = phb->bus->number;
-   tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-   tbl->it_base   = 0;
-   tbl->it_blocksize  = 16;
-   tbl->it_type = TCE_PCI;
-   tbl->it_offset = offset >> tbl->it_page_shift;
-   tbl->it_size = size >> tbl->it_page_shift;
 
table_group->tce32_start = offset;
table_group->tce32_size = size;
@@ -647,7 +651,7 @@ static void 

[PATCH v5 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-07-16 Thread Leonardo Bras
Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 93 --
 1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index b34b473bbdc1..7ca79a04fa52 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1153,6 +1153,35 @@ static int iommu_get_page_shift(u32 query_page_size)
return 0;
 }
 
+static struct property *ddw_property_create(const char *propname, u32 liobn, 
u64 dma_addr,
+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64->name);
+   kfree(win64->value);
+   kfree(win64);
+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1171,12 +1200,12 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+   u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
-   struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1293,72 +1322,64 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-   "couldn't allocate property for 64bit dma window\n");
-   goto out_failed;
-   }
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
-   win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
-   win64->length = sizeof(*ddwprop);
-   if (!win64->name || !win64->value) {
-   dev_info(>dev,
-   "couldn't allocate property name and value\n");
-   goto out_free_prop;
-   }
 
ret = create_ddw(dev, ddw_avail, , page_shift, len);
if (ret != 0)
-   goto out_free_prop;
-
-   ddwprop->liobn = cpu_to_be32(create.liobn);
-   ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
-   create.addr_lo);
-   ddwprop->tce_shift = cpu_to_be32(page_shift);
-   ddwprop->window_shift = cpu_to_be32(len);
+   goto out_failed;
 
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = ddw_list_new_entry(pdn, ddwprop);
+   win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+   page_shift, len);
+   if (!win64) {
+   dev_info(>dev,
+"couldn't allocate property, property name, or 
value\n");
+   goto out_remove_win;
+   }
+
+   ret = of_add_property(pdn, win64);
+   if (ret) {
+ 

[PATCH v5 05/11] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2021-07-16 Thread Leonardo Bras
enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.

Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.

Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.

This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 36 +-
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 712d1667144a..b34b473bbdc1 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -853,25 +853,26 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
 {
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
-   u64 dma_addr = 0;
+   bool found = false;
 
spin_lock(_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, _window_list, list) {
if (window->device == pdn) {
direct64 = window->prop;
-   dma_addr = be64_to_cpu(direct64->dma_base);
+   *dma_addr = be64_to_cpu(direct64->dma_base);
*window_shift = be32_to_cpu(direct64->window_shift);
+   found = true;
break;
}
}
spin_unlock(_window_list_lock);
 
-   return dma_addr;
+   return found;
 }
 
 static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
@@ -1161,20 +1162,20 @@ static int iommu_get_page_shift(u32 query_page_size)
  * pdn: the parent pe node with the ibm,dma_window property
  * Future: also check if we can remap the base window for our base page size
  *
- * returns the dma offset for use by the direct mapped DMA code.
+ * returns true if can map all pages (direct mapping), false otherwise..
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
int len = 0, ret;
int max_ram_len = order_base_2(ddw_memory_hotplug_max());
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 dma_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
+   bool ddw_enabled = false;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
@@ -1186,9 +1187,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   dma_addr = find_existing_ddw(pdn, );
-   if (dma_addr != 0)
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   ddw_enabled = true;
goto out_unlock;
+   }
 
/*
 * If we already went through this for a previous function of
@@ -1342,7 +1344,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
 
-   dma_addr = be64_to_cpu(ddwprop->dma_base);
+   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
+   ddw_enabled = true;
goto out_unlock;
 
 out_free_window:
@@ -1374,10 +1377,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * as RAM, then we failed to create a window to cover persistent
 * memory and need to set the DMA limit.
 */
-   if (pmem_present && dma_addr && (len == max_ram_len))
-   dev->dev.bus_dma_limit = dma_addr + (1ULL << len);
+   if (pmem_present && ddw_enabled && (len == max_ram_len))
+   dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL 
<< len);
 
-   return dma_addr;
+   return ddw_enabled;
 }
 
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1456,11 +1459,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct 
pci_dev *pdev, u64 dma_mask)
break;
}
 
-   if (pdn && PCI_DN(pdn)) {
-   pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
-   if (pdev->dev.archdata.dma_offset)
-   return true;
-   }
+   if (pdn && 

[PATCH v5 04/11] powerpc/pseries/iommu: Add ddw_list_new_entry() helper

2021-07-16 Thread Leonardo Bras
There are two functions creating direct_window_list entries in a
similar way, so create a ddw_list_new_entry() to avoid duplicity and
simplify those functions.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 32 +-
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 33d82865d6e6..712d1667144a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -874,6 +874,21 @@ static u64 find_existing_ddw(struct device_node *pdn, int 
*window_shift)
return dma_addr;
 }
 
+static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
+   const struct 
dynamic_dma_window_prop *dma64)
+{
+   struct direct_window *window;
+
+   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   if (!window)
+   return NULL;
+
+   window->device = pdn;
+   window->prop = dma64;
+
+   return window;
+}
+
 static int find_existing_ddw_windows(void)
 {
int len;
@@ -886,18 +901,15 @@ static int find_existing_ddw_windows(void)
 
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64)
-   continue;
-
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
-   if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
-   kfree(window);
+   if (!direct64 || len < sizeof(*direct64)) {
remove_ddw(pdn, true);
continue;
}
 
-   window->device = pdn;
-   window->prop = direct64;
+   window = ddw_list_new_entry(pdn, direct64);
+   if (!window)
+   break;
+
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
@@ -1307,7 +1319,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   window = ddw_list_new_entry(pdn, ddwprop);
if (!window)
goto out_clear_window;
 
@@ -1326,8 +1338,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_free_window;
}
 
-   window->device = pdn;
-   window->prop = ddwprop;
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
-- 
2.32.0



[PATCH v5 03/11] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

2021-07-16 Thread Leonardo Bras
Creates a helper to allow allocating a new iommu_table without the need
to reallocate the iommu_group.

This will be helpful for replacing the iommu_table for the new DMA window,
after we remove the old one with iommu_tce_table_put().

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index b1b8d12bab39..33d82865d6e6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,28 +53,31 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
-static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
-   struct iommu_table_group *table_group;
struct iommu_table *tbl;
 
-   table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
-  node);
-   if (!table_group)
-   return NULL;
-
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
if (!tbl)
-   goto free_group;
+   return NULL;
 
INIT_LIST_HEAD_RCU(>it_group_list);
kref_init(>it_kref);
+   return tbl;
+}
 
-   table_group->tables[0] = tbl;
+static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+{
+   struct iommu_table_group *table_group;
+
+   table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
+   if (!table_group)
+   return NULL;
 
-   return table_group;
+   table_group->tables[0] = iommu_pseries_alloc_table(node);
+   if (table_group->tables[0])
+   return table_group;
 
-free_group:
kfree(table_group);
return NULL;
 }
-- 
2.32.0



[PATCH v5 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2021-07-16 Thread Leonardo Bras
Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c  | 65 
 2 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..bf3b84128525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2af89a5e379f..b10bf58ae467 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -690,32 +690,24 @@ static void iommu_table_reserve_pages(struct iommu_table 
*tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
 
-   tbl->it_reserved_start = res_start;
-   tbl->it_reserved_end = res_end;
-
-   /* Check if res_start..res_end isn't empty and overlaps the table */
-   if (res_start && res_end &&
-   (tbl->it_offset + tbl->it_size < res_start ||
-res_end < tbl->it_offset))
-   return;
+   if (res_start < tbl->it_offset)
+   res_start = tbl->it_offset;
 
-   for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   set_bit(i - tbl->it_offset, tbl->it_map);
-}
+   if (res_end > (tbl->it_offset + tbl->it_size))
+   res_end = tbl->it_offset + tbl->it_size;
 
-static void iommu_table_release_pages(struct iommu_table *tbl)
-{
-   int i;
+   /* Check if res_start..res_end is a valid range in the table */
+   if (res_start >= res_end) {
+   tbl->it_reserved_start = tbl->it_offset;
+   tbl->it_reserved_end = tbl->it_offset;
+   return;
+   }
 
-   /*
-* In case we have reserved the first bit, we should not emit
-* the warning below.
-*/
-   if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
+   tbl->it_reserved_start = res_start;
+   tbl->it_reserved_end = res_end;
 
for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   clear_bit(i - tbl->it_offset, tbl->it_map);
+   set_bit(i - tbl->it_offset, tbl->it_map);
 }
 
 /*
@@ -779,6 +771,22 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return tbl;
 }
 
+bool iommu_table_in_use(struct iommu_table *tbl)
+{
+   unsigned long start = 0, end;
+
+   /* ignore reserved bit0 */
+   if (tbl->it_offset == 0)
+   start = 1;
+   end = tbl->it_reserved_start - tbl->it_offset;
+   if (find_next_bit(tbl->it_map, end, start) != end)
+   return true;
+
+   start = tbl->it_reserved_end - tbl->it_offset;
+   end = tbl->it_size;
+   return find_next_bit(tbl->it_map, end, start) != end;
+}
+
 static void iommu_table_free(struct kref *kref)
 {
struct iommu_table *tbl;
@@ -795,10 +803,8 @@ static void iommu_table_free(struct kref *kref)
 
iommu_debugfs_del(tbl);
 
-   iommu_table_release_pages(tbl);
-
/* verify that table contains no entries */
-   if (!bitmap_empty(tbl->it_map, tbl->it_size))
+   if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* free bitmap */
@@ -1099,18 +1105,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
 
-   iommu_table_release_pages(tbl);
-
-   if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+   if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
-   /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
-   } else {
-   memset(tbl->it_map, 0xff, sz);
}
 
+   memset(tbl->it_map, 0xff, sz);
+
for (i = 0; i < 

[PATCH v5 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

2021-07-16 Thread Leonardo Bras
Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().

Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/tce.h |  8 --
 arch/powerpc/platforms/pseries/iommu.c | 39 +++---
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
 #define TCE_VB 0
 #define TCE_PCI1
 
-/* TCE page size is 4096 bytes (1 << 12) */
-
-#define TCE_SHIFT  12
-#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
-
 #define TCE_ENTRY_SIZE 8   /* each TCE is 64 bits */
-
-#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT  12
 #define TCE_VALID  0x800   /* TCE valid */
 #define TCE_ALLIO  0x400   /* TCE valid for all lpars */
 #define TCE_PCI_WRITE  0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0c55b991f665..b1b8d12bab39 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+   const unsigned long tceshift = tbl->it_page_shift;
+   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
 
proto_tce = TCE_PCI_READ; // Read allowed
 
@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
long index,
 
while (npages--) {
/* can't move this out since we might cross MEMBLOCK boundary */
-   rpn = __pa(uaddr) >> TCE_SHIFT;
-   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
TCE_RPN_SHIFT);
+   rpn = __pa(uaddr) >> tceshift;
+   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
 
-   uaddr += TCE_PAGE_SIZE;
+   uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
 }
 
-static void tce_free_pSeriesLP(unsigned long liobn, long, long);
+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;
 
while (npages--) {
-   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+   tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
-   tce_free_pSeriesLP(liobn, tcenum_start,
+   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
   (npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+   const unsigned long tceshift = tbl->it_page_shift;
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  tbl->it_page_shift, npages, uaddr,
+  tceshift, npages, uaddr,
   direction, attrs);
}
 
@@ -225,13 +228,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  

[PATCH v5 00/11] DDW + Indirect Mapping

2021-07-16 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

Using the DDW instead of the default DMA window may allow to expand the
amount of memory that can be DMA-mapped, given the number of pages (TCEs)
may stay the same (or increase) and the default DMA window offers only
4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).

Patch #1 replaces hard-coded 4K page size with a variable containing the
correct page size for the window.

Patch #2 introduces iommu_table_in_use(), and replace manual bit-field
checking where it's used. It will be used for aborting enable_ddw() if
there is any current iommu allocation and we are trying single window
indirect mapping.

Patch #3 introduces iommu_pseries_alloc_table() that will be helpful
when indirect mapping needs to replace the iommu_table.

Patch #4 adds helpers for adding DDWs in the list.

Patch #5 refactors enable_ddw() so it returns if direct mapping is
possible, instead of DMA offset. It helps for next patches on
indirect DMA mapping and also allows DMA windows starting at 0x00.

Patch #6 bring new helper to simplify enable_ddw(), allowing
some reorganization for introducing indirect mapping DDW.

Patch #7 adds new helper _iommu_table_setparms() and use it in other
*setparams*() to fill iommu_table. It will also be used for creating a
new iommu_table for indirect mapping.

Patch #8 updates remove_dma_window() to accept different property names,
so we can introduce a new property for indirect mapping.

Patch #9 extracts find_existing_ddw_windows() into
find_existing_ddw_windows_named(), and calls it by it's property name.
This will be useful when the property for indirect mapping is created,
so we can search the device-tree for both properties.

Patch #10:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance. Also, update the iommu_table and re-generate the pools.
It introduces a new property name for DDW with indirect DMA mapping.

Patch #11:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an virtio-net interface that
allows default DMA window and DDW to coexist.

Changes since v4:
- Solve conflicts with new upstream versions
- Avoid unecessary code moving by doing variable declaration before definition
- Rename _iommu_table_setparms to iommu_table_setparms_common and changed base
  parameter from unsigned long to void* in order to avoid unecessary casting.
- Fix breaking case for existing direct-mapping.
- Fix IORESOURCE_MEM bound issue
- Move new tbl to pci->table_group->tables[1] instead of replacing [0]
v4 Link: 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=241597=%2A=both

Changes since v3:
- Fixed inverted free order at ddw_property_create()
- Updated goto tag naming
v3 Link: 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=240287=%2A=both

Changes since v2:
- Some patches got removed from the series and sent by themselves,
- New tbl created for DDW + indirect mapping reserves MMIO32 space,
- Improved reserved area algorithm,
- Improved commit messages,
- Removed define for default DMA window prop name,
- Avoided some unnecessary renaming,
- Removed some unnecessary empty lines,
- Changed some code moving to forward declarations.
v2 Link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=201210=%2A=both


Leonardo Bras (11):
  powerpc/pseries/iommu: Replace hard-coded page shift
  powerpc/kernel/iommu: Add new iommu_table_in_use() helper
  powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
  powerpc/pseries/iommu: Add ddw_list_new_entry() helper
  powerpc/pseries/iommu: Allow DDW windows starting at 0x00
  powerpc/pseries/iommu: Add ddw_property_create() and refactor
enable_ddw()
  powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new
helper
  powerpc/pseries/iommu: Update remove_dma_window() to accept property
name
  powerpc/pseries/iommu: Find existing DDW with given property name
  powerpc/pseries/iommu: Make use of DDW for indirect mapping
  powerpc/pseries/iommu: Rename "direct window" to "dma window"

 arch/powerpc/include/asm/iommu.h   |   1 +
 arch/powerpc/include/asm/tce.h |   8 -
 arch/powerpc/kernel/iommu.c|  65 ++--
 arch/powerpc/platforms/pseries/iommu.c | 481 +++--
 4 files changed, 330 insertions(+), 225 deletions(-)

-- 
2.32.0



Re: [PATCH v1 10/16] s390/pci: return error code from s390_dma_map_sg()

2021-07-16 Thread Niklas Schnelle
On Thu, 2021-07-15 at 10:45 -0600, Logan Gunthorpe wrote:
> From: Martin Oliveira 
> 
> The .map_sg() op now expects an error code instead of zero on failure.
> 
> So propagate the error from __s390_dma_map_sg() up.
> 
> Signed-off-by: Martin Oliveira 
> Signed-off-by: Logan Gunthorpe 
> Cc: Niklas Schnelle 
> Cc: Gerald Schaefer 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> ---
>  arch/s390/pci/pci_dma.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
> index ebc9a49523aa..c78b02012764 100644
> --- a/arch/s390/pci/pci_dma.c
> +++ b/arch/s390/pci/pci_dma.c
> @@ -487,7 +487,7 @@ static int s390_dma_map_sg(struct device *dev, struct 
> scatterlist *sg,
>   unsigned int max = dma_get_max_seg_size(dev);
>   unsigned int size = s->offset + s->length;
>   unsigned int offset = s->offset;
> - int count = 0, i;
> + int count = 0, i, ret;
>  
>   for (i = 1; i < nr_elements; i++) {
>   s = sg_next(s);
> @@ -497,8 +497,9 @@ static int s390_dma_map_sg(struct device *dev, struct 
> scatterlist *sg,
>  
>   if (s->offset || (size & ~PAGE_MASK) ||
>   size + s->length > max) {
> - if (__s390_dma_map_sg(dev, start, size,
> -   >dma_address, dir))
> + ret = __s390_dma_map_sg(dev, start, size,
> + >dma_address, dir);
> + if (ret)
>   goto unmap;
>  
>   dma->dma_address += offset;
> @@ -511,7 +512,8 @@ static int s390_dma_map_sg(struct device *dev, struct 
> scatterlist *sg,
>   }
>   size += s->length;
>   }
> - if (__s390_dma_map_sg(dev, start, size, >dma_address, dir))
> + ret = __s390_dma_map_sg(dev, start, size, >dma_address, dir);
> + if (ret)
>   goto unmap;
>  
>   dma->dma_address += offset;
> @@ -523,7 +525,7 @@ static int s390_dma_map_sg(struct device *dev, struct 
> scatterlist *sg,
>   s390_dma_unmap_pages(dev, sg_dma_address(s), sg_dma_len(s),
>dir, attrs);
>  
> - return 0;
> + return ret;
>  }
>  
>  static void s390_dma_unmap_sg(struct device *dev, struct scatterlist *sg,

So the error codes we return are -ENOMEM if allocating a DMA
translation entry fails and -EINVAL if the DMA translation table hasn't
been initialized or the caller tries to map 0 sized memory. Are these
error codes that you would expect? If yes then this change looks good
to me.

Acked-by: Niklas Schnelle 



Re: [PATCH] crypto: DRBG - select SHA512

2021-07-16 Thread Herbert Xu
Stephan Mueller  wrote:
> With the swtich to use HMAC(SHA-512) as the default DRBG type, the
> configuration must now also select SHA-512.
> 
> Fixes: 9b7b94683a9b "crypto: DRBG - switch to HMAC SHA512 DRBG as default
> DRBG"
> Reported-by: Sachin Sant 
> Signed-off-by: Stephan Mueller 
> ---
> crypto/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v3 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-07-16 Thread Pratik Sampat




On 16/07/21 1:16 am, Fabiano Rosas wrote:

Pratik Sampat  writes:


Hello,

On 12/07/21 9:13 pm, Fabiano Rosas wrote:

"Pratik R. Sampat"  writes:

Hi, have you seen Documentation/core-api/kobject.rst, particularly the
part that says:

"When you see a sysfs directory full of other directories, generally each
 of those directories corresponds to a kobject in the same kset."

Taking a look at samples/kobject/kset-example.c, it seems to provide an
overall structure that is closer to what other modules do when creating
sysfs entries. It uses less dynamic allocations and deals a bit better
with cleaning up the state afterwards.


Thank you for pointing me towards this example, the kset approach is
interesting and the example indeed does handle cleanups better.

Currently, we use "machine_device_initcall()" to register this
functionality, do you suggest I convert this into a tristate module
instead where I can include a "module_exit" for cleanups?

Ugh.. I was hoping we could get away with having all cleanups done at
kobject release time. But now I see that it is not called unless we
decrement the reference count. Nevermind then.


Sure I can keep the current approach as-is, while incorporating the rest of your
comments.


+   ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_ALL, 0,
+virt_to_phys(esi_buf), MAX_BUF_SZ);
+   esi_hdr = (struct h_energy_scale_info_hdr *) esi_buf;
+   if (ret != H_SUCCESS || esi_hdr->data_header_version != ESI_VERSION) {

I really dislike this. If you want to bail due to version change, then
at least include in the ABI document that we might not give the
userspace any data at all.

My only concern for having a version check is that, the attribute list
can change as well as the attributes itself may change.
If that is the case, then in a newer version if we do not bail out we
may parse data into our structs incorrectly.

Sure, that is a valid concern. But the documentation for the header
version field says:

   "Version of the Header. The header will be always backward compatible,
   and changes will not impact the Array of attributes. Current version =
   0x01"

I guess this is a bit vague still, but I understood that:

1- header elements continue to exist at the same position;
2- the format of the array of attributes will not change.

Are you saying that my interpretation above is not correct or that you
don't trust the HV to enforce it?


Thanks for the clarification.
I understand now that my interpretation was incorrect. The version change
should not break anything as our code in kernel acts just as a pass through.


My argument only hinges on that we should likely give no data at all
instead of junk or incorrect data.

I agree. I just don't think it would be possible to end up with
incorrect data, unless the HV has a bug.


Maybe I could make this check after the return check and give out a
version mismatch message like the following?
pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO VER MISMATCH - EXP: 0x%x, REC: 
0x%x",
  ESI_VERSION, esi_hdr->data_header_version);

Yes, this will help with debug if we ever end up in this situation.


Understood, In case of a version mismatch and IDs are introduced, it can help
the userspace know that something has changed.


+   pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
+   goto out;
+   }
+
+   num_attrs = be64_to_cpu(esi_hdr->num_attrs);
+   /*
+* Typecast the energy buffer to the attribute structure at the offset
+* specified in the buffer
+*/

I think the code is now simple enough that this comment could be
removed.

ack


+   esi_attrs = (struct energy_scale_attribute *)
+   (esi_buf + be64_to_cpu(esi_hdr->array_offset));
+
+   pgs = kcalloc(num_attrs, sizeof(*pgs), GFP_KERNEL);

This is never freed.


+   if (!pgs)
+   goto out_pgs;
+
+   papr_kobj = kobject_create_and_add("papr", firmware_kobj);
+   if (!papr_kobj) {
+   pr_warn("kobject_create_and_add papr failed\n");
+   goto out_kobj;
+   }
+
+   esi_kobj = kobject_create_and_add("energy_scale_info", papr_kobj);
+   if (!esi_kobj) {
+   pr_warn("kobject_create_and_add energy_scale_info failed\n");
+   goto out_ekobj;
+   }
+
+   for (idx = 0; idx < num_attrs; idx++) {
+   char buf[4];
+   bool show_val_desc = true;
+
+   pgs[idx].pgattrs = kcalloc(MAX_ATTRS,
+  sizeof(*pgs[idx].pgattrs),
+  GFP_KERNEL);
+   if (!pgs[idx].pgattrs)
+   goto out_kobj;
+
+   pgs[idx].pg.attrs = kcalloc(MAX_ATTRS + 1,
+   sizeof(*pgs[idx].pg.attrs),
+   GFP_KERNEL);

I think the kobject code expects this to be statically 

Re: [PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error

2021-07-16 Thread Christoph Hellwig
On Thu, Jul 15, 2021 at 10:45:44AM -0600, Logan Gunthorpe wrote:
> @@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
> scatterlist *sg,
>   else
>   ents = ops->map_sg(dev, sg, nents, dir, attrs);
>  
> + WARN_ON_ONCE(ents == 0);

Turns this into a negative error code while we're at it, just to keep
the callers sane?


Re: [PATCH v1 14/16] x86/amd_gart: return error code from gart_map_sg()

2021-07-16 Thread Christoph Hellwig
On Thu, Jul 15, 2021 at 10:45:42AM -0600, Logan Gunthorpe wrote:
> @@ -458,7 +460,7 @@ static int gart_map_sg(struct device *dev, struct 
> scatterlist *sg, int nents,
>   iommu_full(dev, pages << PAGE_SHIFT, dir);
>   for_each_sg(sg, s, nents, i)
>   s->dma_address = DMA_MAPPING_ERROR;
> - return 0;
> + return ret;

While we're at it - setting the ->dma_address to DMA_MAPPING_ERROR is
not part of the ->map_sg calling convention.  Might be worth to fix
up while we're at it.


Re: [PATCH v1 04/16] dma-iommu: Return error code from iommu_dma_map_sg()

2021-07-16 Thread Christoph Hellwig
Careful here. What do all these errors from the low-level code mean
here?  I think we need to clearly standardize on what we actually
return from ->map_sg and possibly document what the callers expect and
can do, and enforce that only those error are reported.


Re: [PATCH v1 01/16] dma-mapping: Allow map_sg() ops to return negative error codes

2021-07-16 Thread Christoph Hellwig
On Thu, Jul 15, 2021 at 10:45:29AM -0600, Logan Gunthorpe wrote:
> +  * dma_map_sgtable() will return the error code returned and convert
> +  * a zero return (for legacy implementations) into -EINVAL.
> +  *
> +  * dma_map_sg() will always return zero on any negative or zero
> +  * return to satisfy its own calling convention.
>*/

I don't think this belongs here.

> +EXPORT_SYMBOL(dma_map_sgtable);

EXPORT_SYMBOL_GPL, please.