Re: [Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array

2017-01-20 Thread Stefano Stabellini
On Fri, 20 Jan 2017, Julien Grall wrote:
> Hi Stefano,
> 
> Sorry for the late answer, still going through my e-mail backlog.
> 
> On 06/01/2017 21:20, Stefano Stabellini wrote:
> > On Fri, 6 Jan 2017, Andre Przywara wrote:
> > > > It is also possible to end up calling mapti with an inexistent eventid
> > > > for host_devid. Could that be a problem?
> > > 
> > > Not at all. Actually there is no such thing as a "nonexistent event ID",
> > > because the event ID will be written by the device as the payload to the
> > > MSI doorbell address, probably because it learned about it by the
> > > driver. So if we provision an ITTE with an event ID which the device
> > > will never send, that LPI will just never fire.
> > > Since Xen (in contrast to the driver in the domain) has no idea how many
> > > and which MSIs the device will use, we just allocate a bunch of them.
> > > The upper limit (32 atm) is something we probably need to still think
> > > about, though.
> > > I tried to learn a limit from Linux ("nvecs" in its_create_device()
> > > seems to be the source), but couldn't find anything useful other than 32.
> > > We will learn about exceeding this as soon as a domain tries to map a
> > > virtual LPI with an event ID higher than 31, however it's too late to
> > > fix this then. We can bark when this happens to learn if any device ever
> > > does this during our testing to get some heuristic data.
> > > 
> > > Eventually all boils down to Xen getting more information from Dom0
> > > about the required number of MSIs. We could then even limit the
> > > allocation to less than 32, if that helps.
> > 
> > Originally Julien and I thought that Xen should map events up to the
> > theoretically maximum for each device, but we realized that they were
> > too many: an MSI-X capable device can generate up to 2048 different
> > events.
> > 
> > Xen needs to find out the exact number of events for each device. The
> > information can either be provided by the guest, or the hypervisor needs
> > to figure it out on its own.
> > 
> > With Julien's PCI Passthrough work, Xen will become able to read the
> > amount of events a device is capable of generating, so in the long term
> > this problem should be easy to solve. But Julien's work might land one
> > or two Xen releases after ITS.
> > 
> > In the meantime, we can extend an existing PHYSDEVOP hypercall or add a
> > new one. Julien, do you agree?
> 
> PHYSDEVOP are stable ABI and some are already being in used by Linux even on
> ARM.
> 
> I am not in favor of adding a PHYSDEVOP which may not be necessary in a couple
> of the release. Furthermore, there are already some issues with how device are
> being added in the ITS. Indeed, the PHYSDEVOP operations will provide the RID
> (can be deduced from the BDF), but the deviceID may not be equal to the RID on
> some platform. This is where IORT (on ACPI) and msi-map (on DT) comes from.
> The plumbing will be added during the PCI work.
> 
> So I would be more in favor to hardcode the device info (DeviceID, number of
> max number of MSIs) per platform until we get PCI passthrough working.

I think that would be OK for a first ITS implementation in Xen. We can
defer the physdevop to later, after you complete PCI passthrough. I also
agree that we'll probably end up with a physdevop anyway, but at least
we'll have a better idea about what we need.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array

2017-01-20 Thread Julien Grall

Hi Stefano,

Sorry for the late answer, still going through my e-mail backlog.

On 06/01/2017 21:20, Stefano Stabellini wrote:

On Fri, 6 Jan 2017, Andre Przywara wrote:

It is also possible to end up calling mapti with an inexistent eventid
for host_devid. Could that be a problem?


Not at all. Actually there is no such thing as a "nonexistent event ID",
because the event ID will be written by the device as the payload to the
MSI doorbell address, probably because it learned about it by the
driver. So if we provision an ITTE with an event ID which the device
will never send, that LPI will just never fire.
Since Xen (in contrast to the driver in the domain) has no idea how many
and which MSIs the device will use, we just allocate a bunch of them.
The upper limit (32 atm) is something we probably need to still think
about, though.
I tried to learn a limit from Linux ("nvecs" in its_create_device()
seems to be the source), but couldn't find anything useful other than 32.
We will learn about exceeding this as soon as a domain tries to map a
virtual LPI with an event ID higher than 31, however it's too late to
fix this then. We can bark when this happens to learn if any device ever
does this during our testing to get some heuristic data.

Eventually all boils down to Xen getting more information from Dom0
about the required number of MSIs. We could then even limit the
allocation to less than 32, if that helps.


Originally Julien and I thought that Xen should map events up to the
theoretically maximum for each device, but we realized that they were
too many: an MSI-X capable device can generate up to 2048 different
events.

Xen needs to find out the exact number of events for each device. The
information can either be provided by the guest, or the hypervisor needs
to figure it out on its own.

With Julien's PCI Passthrough work, Xen will become able to read the
amount of events a device is capable of generating, so in the long term
this problem should be easy to solve. But Julien's work might land one
or two Xen releases after ITS.

In the meantime, we can extend an existing PHYSDEVOP hypercall or add a
new one. Julien, do you agree?


PHYSDEVOP are stable ABI and some are already being in used by Linux 
even on ARM.


I am not in favor of adding a PHYSDEVOP which may not be necessary in a 
couple of the release. Furthermore, there are already some issues with 
how device are being added in the ITS. Indeed, the PHYSDEVOP operations 
will provide the RID (can be deduced from the BDF), but the deviceID may 
not be equal to the RID on some platform. This is where IORT (on ACPI) 
and msi-map (on DT) comes from. The plumbing will be added during the 
PCI work.


So I would be more in favor to hardcode the device info (DeviceID, 
number of max number of MSIs) per platform until we get PCI passthrough 
working.


Note that the PCI work will not include any thoughts regarding platform 
device supporting MSI. So we may end up to add a new PHYSDEVO.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array

2017-01-06 Thread Stefano Stabellini
On Fri, 6 Jan 2017, Andre Przywara wrote:
> >> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. 
> >> */
> >> +union host_lpi {
> >> +uint64_t data;
> >> +struct {
> >> +uint64_t virt_lpi:32;
> >> +uint64_t dom_id:16;
> >> +uint64_t vcpu_id:16;
> >> +};
> >> +};
> > 
> > Just go with a regular struct
> > 
> > struct host_lpi {
> > uint32_t virt_lpi;
> > uint16_t dom_id;
> > uint16_t vcpu_id;
> > };
> > 
> > The aarch64 C ABI guarantees the alignments of the fields.
> 
> Yes, I will get rid of the bitfields. But the actual purpose of the
> union is to allow lock-free atomic access. I just see now that I failed
> to document that, sorry!
> 
> We can't afford to have a lock for the actual data here, so the idea was
> to use the naturally atomic access a native data type would give us.
> In case we want to write multiple members, we assemble them in a local
> copy and then write the uint64_t variable into the actual location.
> Similar for reading. Single members can be updated using the members
> directly.
> Since the architecture guarantees atomic access for an aligned memory
> access to/from a GPR, I think this is safe.
> I am not sure we need to use the atomic__{read,write} accessors
> here? I tried it: the resulting assembly is identical, and the source
> doesn't look too bad either, so I guess I will change them over, just to
> be safe?

Yes, it will also work as documentation


> > It is also possible to end up calling mapti with an inexistent eventid
> > for host_devid. Could that be a problem?
> 
> Not at all. Actually there is no such thing as a "nonexistent event ID",
> because the event ID will be written by the device as the payload to the
> MSI doorbell address, probably because it learned about it by the
> driver. So if we provision an ITTE with an event ID which the device
> will never send, that LPI will just never fire.
> Since Xen (in contrast to the driver in the domain) has no idea how many
> and which MSIs the device will use, we just allocate a bunch of them.
> The upper limit (32 atm) is something we probably need to still think
> about, though.
> I tried to learn a limit from Linux ("nvecs" in its_create_device()
> seems to be the source), but couldn't find anything useful other than 32.
> We will learn about exceeding this as soon as a domain tries to map a
> virtual LPI with an event ID higher than 31, however it's too late to
> fix this then. We can bark when this happens to learn if any device ever
> does this during our testing to get some heuristic data.
> 
> Eventually all boils down to Xen getting more information from Dom0
> about the required number of MSIs. We could then even limit the
> allocation to less than 32, if that helps.

Originally Julien and I thought that Xen should map events up to the
theoretically maximum for each device, but we realized that they were
too many: an MSI-X capable device can generate up to 2048 different
events.

Xen needs to find out the exact number of events for each device. The
information can either be provided by the guest, or the hypervisor needs
to figure it out on its own.

With Julien's PCI Passthrough work, Xen will become able to read the
amount of events a device is capable of generating, so in the long term
this problem should be easy to solve. But Julien's work might land one
or two Xen releases after ITS.

In the meantime, we can extend an existing PHYSDEVOP hypercall or add a
new one. Julien, do you agree?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array

2017-01-06 Thread Andre Przywara
Hi,

On 05/01/17 18:56, Stefano Stabellini wrote:
> On Thu, 22 Dec 2016, Andre Przywara wrote:
> 7> The number of LPIs on a host can be potentially huge (millions),
>> although in practise will be mostly reasonable. So prematurely allocating
>> an array of struct irq_desc's for each LPI is not an option.
>> However Xen itself does not care about LPIs, as every LPI will be injected
>> into a guest (Dom0 for now).
>> Create a dense data structure (8 Bytes) for each LPI which holds just
>> enough information to determine the virtual IRQ number and the VCPU into
>> which the LPI needs to be injected.
>> Also to not artificially limit the number of LPIs, we create a 2-level
>> table for holding those structures.
>> This patch introduces functions to initialize these tables and to
>> create, lookup and destroy entries for a given LPI.
>> We allocate and access LPI information in a way that does not require
>> a lock.
>>
>> Signed-off-by: Andre Przywara 
>> ---
>>  xen/arch/arm/gic-its.c| 233 
>> +-
>>  xen/include/asm-arm/gic-its.h |   1 +
>>  2 files changed, 233 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/arm/gic-its.c b/xen/arch/arm/gic-its.c
>> index e157c6b..e7ddd90 100644
>> --- a/xen/arch/arm/gic-its.c
>> +++ b/xen/arch/arm/gic-its.c
>> @@ -18,21 +18,36 @@
>>  
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>>  
>> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
>> +union host_lpi {
>> +uint64_t data;
>> +struct {
>> +uint64_t virt_lpi:32;
>> +uint64_t dom_id:16;
>> +uint64_t vcpu_id:16;
>> +};
>> +};
> 
> Just go with a regular struct
> 
> struct host_lpi {
> uint32_t virt_lpi;
> uint16_t dom_id;
> uint16_t vcpu_id;
> };
> 
> The aarch64 C ABI guarantees the alignments of the fields.

Yes, I will get rid of the bitfields. But the actual purpose of the
union is to allow lock-free atomic access. I just see now that I failed
to document that, sorry!

We can't afford to have a lock for the actual data here, so the idea was
to use the naturally atomic access a native data type would give us.
In case we want to write multiple members, we assemble them in a local
copy and then write the uint64_t variable into the actual location.
Similar for reading. Single members can be updated using the members
directly.
Since the architecture guarantees atomic access for an aligned memory
access to/from a GPR, I think this is safe.
I am not sure we need to use the atomic__{read,write} accessors
here? I tried it: the resulting assembly is identical, and the source
doesn't look too bad either, so I guess I will change them over, just to
be safe?

Please note that this not about access ordering due to concurrent
accesses (so no barriers), we just need to guarantee that a host_lpi
state is consistent in respect to its members.

>>  /* Global state */
>>  static struct {
>>  uint8_t *lpi_property;
>> +union host_lpi **host_lpis;
>>  unsigned int host_lpi_bits;
>> +/* Protects allocation and deallocation of host LPIs, but not the 
>> access */
>> +spinlock_t host_lpis_lock;
>>  } lpi_data;
>>  
>>  /* Physical redistributor address */
>> @@ -43,6 +58,19 @@ static DEFINE_PER_CPU(uint64_t, rdist_id);
>>  static DEFINE_PER_CPU(void *, pending_table);
>>  
>>  #define MAX_PHYS_LPIS   (BIT_ULL(lpi_data.host_lpi_bits) - 8192)
>> +#define HOST_LPIS_PER_PAGE  (PAGE_SIZE / sizeof(union host_lpi))
>> +
>> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
>> +{
>> +if ( plpi < 8192 || plpi >= MAX_PHYS_LPIS + 8192 )
>> +return NULL;
>> +
>> +plpi -= 8192;
>> +if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
>> +return NULL;
>> +
>> +return _data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % 
>> HOST_LPIS_PER_PAGE];
>> +}
>>  
>>  #define ITS_CMD_QUEUE_SZSZ_64K
>>  
>> @@ -96,6 +124,20 @@ static int its_send_cmd_sync(struct host_its *its, int 
>> cpu)
>>  return its_send_command(its, cmd);
>>  }
>>  
>> +static int its_send_cmd_mapti(struct host_its *its,
>> +  uint32_t deviceid, uint32_t eventid,
>> +  uint32_t pintid, uint16_t icid)
>> +{
>> +uint64_t cmd[4];
>> +
>> +cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
>> +cmd[1] = eventid | ((uint64_t)pintid << 32);
>> +cmd[2] = icid;
>> +cmd[3] = 0x00;
>> +
>> +return its_send_command(its, cmd);
>> +}
>> +
>>  static int its_send_cmd_mapc(struct host_its *its, int collection_id, int 
>> cpu)
>>  {
>>  uint64_t cmd[4];
>> @@ -124,6 +166,19 @@ static int its_send_cmd_mapd(struct host_its *its, 
>> uint32_t deviceid,
>>  return its_send_command(its, cmd);
>>  }
>>  
>> +static 

Re: [Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array

2017-01-05 Thread Stefano Stabellini
On Thu, 22 Dec 2016, Andre Przywara wrote:
7> The number of LPIs on a host can be potentially huge (millions),
> although in practise will be mostly reasonable. So prematurely allocating
> an array of struct irq_desc's for each LPI is not an option.
> However Xen itself does not care about LPIs, as every LPI will be injected
> into a guest (Dom0 for now).
> Create a dense data structure (8 Bytes) for each LPI which holds just
> enough information to determine the virtual IRQ number and the VCPU into
> which the LPI needs to be injected.
> Also to not artificially limit the number of LPIs, we create a 2-level
> table for holding those structures.
> This patch introduces functions to initialize these tables and to
> create, lookup and destroy entries for a given LPI.
> We allocate and access LPI information in a way that does not require
> a lock.
> 
> Signed-off-by: Andre Przywara 
> ---
>  xen/arch/arm/gic-its.c| 233 
> +-
>  xen/include/asm-arm/gic-its.h |   1 +
>  2 files changed, 233 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/gic-its.c b/xen/arch/arm/gic-its.c
> index e157c6b..e7ddd90 100644
> --- a/xen/arch/arm/gic-its.c
> +++ b/xen/arch/arm/gic-its.c
> @@ -18,21 +18,36 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  
> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
> +union host_lpi {
> +uint64_t data;
> +struct {
> +uint64_t virt_lpi:32;
> +uint64_t dom_id:16;
> +uint64_t vcpu_id:16;
> +};
> +};

Just go with a regular struct

struct host_lpi {
uint32_t virt_lpi;
uint16_t dom_id;
uint16_t vcpu_id;
};

The aarch64 C ABI guarantees the alignments of the fields.


>  /* Global state */
>  static struct {
>  uint8_t *lpi_property;
> +union host_lpi **host_lpis;
>  unsigned int host_lpi_bits;
> +/* Protects allocation and deallocation of host LPIs, but not the access 
> */
> +spinlock_t host_lpis_lock;
>  } lpi_data;
>  
>  /* Physical redistributor address */
> @@ -43,6 +58,19 @@ static DEFINE_PER_CPU(uint64_t, rdist_id);
>  static DEFINE_PER_CPU(void *, pending_table);
>  
>  #define MAX_PHYS_LPIS   (BIT_ULL(lpi_data.host_lpi_bits) - 8192)
> +#define HOST_LPIS_PER_PAGE  (PAGE_SIZE / sizeof(union host_lpi))
> +
> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
> +{
> +if ( plpi < 8192 || plpi >= MAX_PHYS_LPIS + 8192 )
> +return NULL;
> +
> +plpi -= 8192;
> +if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
> +return NULL;
> +
> +return _data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % 
> HOST_LPIS_PER_PAGE];
> +}
>  
>  #define ITS_CMD_QUEUE_SZSZ_64K
>  
> @@ -96,6 +124,20 @@ static int its_send_cmd_sync(struct host_its *its, int 
> cpu)
>  return its_send_command(its, cmd);
>  }
>  
> +static int its_send_cmd_mapti(struct host_its *its,
> +  uint32_t deviceid, uint32_t eventid,
> +  uint32_t pintid, uint16_t icid)
> +{
> +uint64_t cmd[4];
> +
> +cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
> +cmd[1] = eventid | ((uint64_t)pintid << 32);
> +cmd[2] = icid;
> +cmd[3] = 0x00;
> +
> +return its_send_command(its, cmd);
> +}
> +
>  static int its_send_cmd_mapc(struct host_its *its, int collection_id, int 
> cpu)
>  {
>  uint64_t cmd[4];
> @@ -124,6 +166,19 @@ static int its_send_cmd_mapd(struct host_its *its, 
> uint32_t deviceid,
>  return its_send_command(its, cmd);
>  }
>  
> +static int its_send_cmd_inv(struct host_its *its,
> +uint32_t deviceid, uint32_t eventid)
> +{
> +uint64_t cmd[4];
> +
> +cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
> +cmd[1] = eventid;
> +cmd[2] = 0x00;
> +cmd[3] = 0x00;
> +
> +return its_send_command(its, cmd);
> +}
> +
>  /* Set up the (1:1) collection mapping for the given host CPU. */
>  void gicv3_its_setup_collection(int cpu)
>  {
> @@ -366,21 +421,181 @@ uint64_t gicv3_lpi_get_proptable()
>  static unsigned int max_lpi_bits = CONFIG_MAX_HOST_LPI_BITS;
>  integer_param("max_lpi_bits", max_lpi_bits);
>  
> +/* Allocate the 2nd level array for host LPIs. This one holds pointers
> + * to the page with the actual "union host_lpi" entries. Our LPI limit
> + * avoids excessive memory usage.
> + */
>  int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
>  {
> +int nr_lpi_ptrs;
> +
>  lpi_data.host_lpi_bits = min(hw_lpi_bits, max_lpi_bits);
>  
> +spin_lock_init(_data.host_lpis_lock);
> +
> +nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
> +lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
> +if ( !lpi_data.host_lpis )
> +return 

[Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array

2016-12-22 Thread Andre Przywara
The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
We allocate and access LPI information in a way that does not require
a lock.

Signed-off-by: Andre Przywara 
---
 xen/arch/arm/gic-its.c| 233 +-
 xen/include/asm-arm/gic-its.h |   1 +
 2 files changed, 233 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/gic-its.c b/xen/arch/arm/gic-its.c
index e157c6b..e7ddd90 100644
--- a/xen/arch/arm/gic-its.c
+++ b/xen/arch/arm/gic-its.c
@@ -18,21 +18,36 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
+/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
+union host_lpi {
+uint64_t data;
+struct {
+uint64_t virt_lpi:32;
+uint64_t dom_id:16;
+uint64_t vcpu_id:16;
+};
+};
+
 /* Global state */
 static struct {
 uint8_t *lpi_property;
+union host_lpi **host_lpis;
 unsigned int host_lpi_bits;
+/* Protects allocation and deallocation of host LPIs, but not the access */
+spinlock_t host_lpis_lock;
 } lpi_data;
 
 /* Physical redistributor address */
@@ -43,6 +58,19 @@ static DEFINE_PER_CPU(uint64_t, rdist_id);
 static DEFINE_PER_CPU(void *, pending_table);
 
 #define MAX_PHYS_LPIS   (BIT_ULL(lpi_data.host_lpi_bits) - 8192)
+#define HOST_LPIS_PER_PAGE  (PAGE_SIZE / sizeof(union host_lpi))
+
+static union host_lpi *gic_get_host_lpi(uint32_t plpi)
+{
+if ( plpi < 8192 || plpi >= MAX_PHYS_LPIS + 8192 )
+return NULL;
+
+plpi -= 8192;
+if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
+return NULL;
+
+return _data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % 
HOST_LPIS_PER_PAGE];
+}
 
 #define ITS_CMD_QUEUE_SZSZ_64K
 
@@ -96,6 +124,20 @@ static int its_send_cmd_sync(struct host_its *its, int cpu)
 return its_send_command(its, cmd);
 }
 
+static int its_send_cmd_mapti(struct host_its *its,
+  uint32_t deviceid, uint32_t eventid,
+  uint32_t pintid, uint16_t icid)
+{
+uint64_t cmd[4];
+
+cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
+cmd[1] = eventid | ((uint64_t)pintid << 32);
+cmd[2] = icid;
+cmd[3] = 0x00;
+
+return its_send_command(its, cmd);
+}
+
 static int its_send_cmd_mapc(struct host_its *its, int collection_id, int cpu)
 {
 uint64_t cmd[4];
@@ -124,6 +166,19 @@ static int its_send_cmd_mapd(struct host_its *its, 
uint32_t deviceid,
 return its_send_command(its, cmd);
 }
 
+static int its_send_cmd_inv(struct host_its *its,
+uint32_t deviceid, uint32_t eventid)
+{
+uint64_t cmd[4];
+
+cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
+cmd[1] = eventid;
+cmd[2] = 0x00;
+cmd[3] = 0x00;
+
+return its_send_command(its, cmd);
+}
+
 /* Set up the (1:1) collection mapping for the given host CPU. */
 void gicv3_its_setup_collection(int cpu)
 {
@@ -366,21 +421,181 @@ uint64_t gicv3_lpi_get_proptable()
 static unsigned int max_lpi_bits = CONFIG_MAX_HOST_LPI_BITS;
 integer_param("max_lpi_bits", max_lpi_bits);
 
+/* Allocate the 2nd level array for host LPIs. This one holds pointers
+ * to the page with the actual "union host_lpi" entries. Our LPI limit
+ * avoids excessive memory usage.
+ */
 int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
 {
+int nr_lpi_ptrs;
+
 lpi_data.host_lpi_bits = min(hw_lpi_bits, max_lpi_bits);
 
+spin_lock_init(_data.host_lpis_lock);
+
+nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
+lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
+if ( !lpi_data.host_lpis )
+return -ENOMEM;
+
 printk("GICv3: using at most %lld LPIs on the host.\n", MAX_PHYS_LPIS);
 
 return 0;
 }
 
+#define INVALID_DOMID ((uint16_t)~0)
+#define LPI_BLOCK   32
+
+/* Must be called with host_lpis_lock held. */
+static int find_unused_host_lpi(int start, uint32_t *index)
+{
+int chunk;
+uint32_t i = *index;
+
+for ( chunk = start; chunk < MAX_PHYS_LPIS / HOST_LPIS_PER_PAGE; chunk++ )
+{
+/* If we hit an unallocated chunk, use entry 0 in that one. */
+if ( !lpi_data.host_lpis[chunk] )
+