Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-25 Thread Gleb Natapov
On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> without any vmexit.
> 
> - When delivering a interrupt to guest, if target vcpu is running,
>   update Posted-interrupt requests bitmap and send a notification event
>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>   without any software involvemnt.
> 
Looks like you allocating one irq vector per vcpu per pcpu and then
migrate it or reallocate when vcpu move from one pcpu to another.
This is not scalable and migrating irq migration slows things down.
What's wrong with allocating one global vector for posted interrupt
during vmx initialization and use it for all vcpus?

> - If target vcpu is not running or there already a notification event
>   pending in the vcpu, do nothing. The interrupt will be handled by old
>   way.
> 
> Signed-off-by: Yang Zhang 
> ---
>  arch/x86/include/asm/kvm_host.h |3 +
>  arch/x86/include/asm/vmx.h  |4 +
>  arch/x86/kernel/apic/io_apic.c  |  138 
>  arch/x86/kvm/lapic.c|   31 ++-
>  arch/x86/kvm/lapic.h|8 ++
>  arch/x86/kvm/vmx.c  |  192 
> +--
>  arch/x86/kvm/x86.c  |2 +
>  include/linux/kvm_host.h|1 +
>  virt/kvm/kvm_main.c |2 +
>  9 files changed, 372 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8e07a86..1145894 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -683,9 +683,12 @@ struct kvm_x86_ops {
>   void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>   void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>   int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> + int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
>   void (*update_irq)(struct kvm_vcpu *vcpu);
>   void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
>   int need_eoi, int global);
> + int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> + void (*pi_migrate)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>   int (*get_tdp_level)(void);
>   u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 1003341..7b9e1d0 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -152,6 +152,7 @@
>  #define PIN_BASED_EXT_INTR_MASK 0x0001
>  #define PIN_BASED_NMI_EXITING   0x0008
>  #define PIN_BASED_VIRTUAL_NMIS  0x0020
> +#define PIN_BASED_POSTED_INTR   0x0080
>  
>  #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002
>  #define VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200
> @@ -174,6 +175,7 @@
>  /* VMCS Encodings */
>  enum vmcs_field {
>   VIRTUAL_PROCESSOR_ID= 0x,
> + POSTED_INTR_NV  = 0x0002,
>   GUEST_ES_SELECTOR   = 0x0800,
>   GUEST_CS_SELECTOR   = 0x0802,
>   GUEST_SS_SELECTOR   = 0x0804,
> @@ -208,6 +210,8 @@ enum vmcs_field {
>   VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x2013,
>   APIC_ACCESS_ADDR= 0x2014,
>   APIC_ACCESS_ADDR_HIGH   = 0x2015,
> + POSTED_INTR_DESC_ADDR   = 0x2016,
> + POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
>   EPT_POINTER = 0x201a,
>   EPT_POINTER_HIGH= 0x201b,
>   EOI_EXIT_BITMAP0= 0x201c,
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 1817fa9..97cb8ee 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
>  }
>  #endif
>  
> +static int
> +pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
> +   bool force)
> +{
> + unsigned int dest;
> + struct irq_cfg *cfg = (struct irq_cfg *)data->chip_data;
> + if (cpumask_equal(cfg->domain, mask))
> + return IRQ_SET_MASK_OK;
> +
> + if (__ioapic_set_affinity(data, mask, &dest))
> + return -1;
> +
> + return IRQ_SET_MASK_OK;
> +}
> +
> +static void pi_mask(struct irq_data *data)
> +{
> + ;
> +}
> +
> +static void pi_unmask(struct irq_data *data)
> +{
> + ;
> +}
> +
> +static struct irq_chip pi_chip = {
> + .name   = "POSTED-INTR",
> + .irq_ack= ack_apic_edge,
> + .irq_unmask = pi_unmask,
> + .irq_mask   = pi_mask,
> + .irq_set_affinity   = pi_set_affinity,
> +};
> +
> +int arch_pi_migrate(int irq, int cpu)
> +{
> + struct irq_data *data = irq_get_irq_data(irq);
> + struct ir

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-25 Thread Zhang, Yang Z
Gleb Natapov wrote on 2012-11-25:
> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
>> without any vmexit.
>> 
>> - When delivering a interrupt to guest, if target vcpu is running,
>>   update Posted-interrupt requests bitmap and send a notification event
>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>>   without any software involvemnt.
> Looks like you allocating one irq vector per vcpu per pcpu and then
> migrate it or reallocate when vcpu move from one pcpu to another.
> This is not scalable and migrating irq migration slows things down.
> What's wrong with allocating one global vector for posted interrupt
> during vmx initialization and use it for all vcpus?

Consider the following situation: 
If vcpu A is running when notification event which belong to vcpu B is arrived, 
since the vector match the vcpu A's notification vector, then this event will 
be consumed by vcpu A(even it do nothing) and the interrupt cannot be handled 
in time.

>> - If target vcpu is not running or there already a notification event
>>   pending in the vcpu, do nothing. The interrupt will be handled by old
>>   way.
>> Signed-off-by: Yang Zhang 
>> ---
>>  arch/x86/include/asm/kvm_host.h |3 + arch/x86/include/asm/vmx.h   
>>|4 + arch/x86/kernel/apic/io_apic.c  |  138
>>   arch/x86/kvm/lapic.c|   31
>>  ++- arch/x86/kvm/lapic.h|8 ++ arch/x86/kvm/vmx.c  
>> |  192 +--
>>  arch/x86/kvm/x86.c  |2 + include/linux/kvm_host.h 
>>|1 + virt/kvm/kvm_main.c |2 + 9 files changed,
>>  372 insertions(+), 9 deletions(-)
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index 8e07a86..1145894 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -683,9 +683,12 @@ struct kvm_x86_ops {
>>  void (*enable_irq_window)(struct kvm_vcpu *vcpu);   void
>>  (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);   int
>>  (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); + int
>>  (*has_posted_interrupt)(struct kvm_vcpu *vcpu); void
>>  (*update_irq)(struct kvm_vcpu *vcpu);   void (*set_eoi_exitmap)(struct
>>  kvm_vcpu *vcpu, int vector, int need_eoi, int 
>> global);
>> +int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
>> +void (*pi_migrate)(struct kvm_vcpu *vcpu);
>>  int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>  int (*get_tdp_level)(void);
>>  u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 1003341..7b9e1d0 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -152,6 +152,7 @@
>>  #define PIN_BASED_EXT_INTR_MASK 0x0001
>>  #define PIN_BASED_NMI_EXITING   0x0008
>>  #define PIN_BASED_VIRTUAL_NMIS  0x0020
>> +#define PIN_BASED_POSTED_INTR   0x0080
>> 
>>  #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
>>  VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@
>>  /* VMCS Encodings */ enum vmcs_field {  VIRTUAL_PROCESSOR_ID  
>>   = 0x, +POSTED_INTR_NV  = 0x0002,
>>  GUEST_ES_SELECTOR   = 0x0800,   GUEST_CS_SELECTOR 
>>   = 0x0802,  GUEST_SS_SELECTOR   = 0x0804,
>>  @@ -208,6 +210,8 @@ enum vmcs_field {   VIRTUAL_APIC_PAGE_ADDR_HIGH
>>  = 0x2013,   APIC_ACCESS_ADDR= 0x2014,
>>  APIC_ACCESS_ADDR_HIGH   = 0x2015,
>> +POSTED_INTR_DESC_ADDR   = 0x2016,
>> +POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
>>  EPT_POINTER = 0x201a,
>>  EPT_POINTER_HIGH= 0x201b,
>>  EOI_EXIT_BITMAP0= 0x201c,
>> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
>> index 1817fa9..97cb8ee 100644
>> --- a/arch/x86/kernel/apic/io_apic.c
>> +++ b/arch/x86/kernel/apic/io_apic.c
>> @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
>>  }
>>  #endif
>> +static int
>> +pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
>> +  bool force)
>> +{
>> +unsigned int dest;
>> +struct irq_cfg *cfg = (struct irq_cfg *)data->chip_data;
>> +if (cpumask_equal(cfg->domain, mask))
>> +return IRQ_SET_MASK_OK;
>> +
>> +if (__ioapic_set_affinity(data, mask, &dest))
>> +return -1;
>> +
>> +return IRQ_SET_MASK_OK;
>> +}
>> +
>> +static void pi_mask(struct irq_data *data)
>> +{
>> +;
>> +}
>> +
>> +static void pi_unmask(struct irq_data *data)
>> +{
>> +;
>> +

Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-25:
> > On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >> without any vmexit.
> >> 
> >> - When delivering a interrupt to guest, if target vcpu is running,
> >>   update Posted-interrupt requests bitmap and send a notification event
> >>   to the vcpu. Then the vcpu will handle this interrupt automatically,
> >>   without any software involvemnt.
> > Looks like you allocating one irq vector per vcpu per pcpu and then
> > migrate it or reallocate when vcpu move from one pcpu to another.
> > This is not scalable and migrating irq migration slows things down.
> > What's wrong with allocating one global vector for posted interrupt
> > during vmx initialization and use it for all vcpus?
> 
> Consider the following situation: 
> If vcpu A is running when notification event which belong to vcpu B is 
> arrived, since the vector match the vcpu A's notification vector, then this 
> event will be consumed by vcpu A(even it do nothing) and the interrupt cannot 
> be handled in time.
The exact same situation is possible with your code. vcpu B can be
migrated from pcpu and vcpu A will take its place and will be assigned
the same vector as vcpu B. But I fail to see why is this a problem. vcpu
A will ignore PI since pir will be empty and vcpu B should detect new
event during next vmentry. 

> 
> >> - If target vcpu is not running or there already a notification event
> >>   pending in the vcpu, do nothing. The interrupt will be handled by old
> >>   way.
> >> Signed-off-by: Yang Zhang 
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |3 + arch/x86/include/asm/vmx.h   
> >>|4 + arch/x86/kernel/apic/io_apic.c  |  138
> >>   arch/x86/kvm/lapic.c|   31
> >>  ++- arch/x86/kvm/lapic.h|8 ++ arch/x86/kvm/vmx.c  
> >> |  192 +--
> >>  arch/x86/kvm/x86.c  |2 + include/linux/kvm_host.h 
> >>|1 + virt/kvm/kvm_main.c |2 + 9 files changed,
> >>  372 insertions(+), 9 deletions(-)
> >> diff --git a/arch/x86/include/asm/kvm_host.h
> >> b/arch/x86/include/asm/kvm_host.h index 8e07a86..1145894 100644 ---
> >> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -683,9 +683,12 @@ struct kvm_x86_ops {
> >>void (*enable_irq_window)(struct kvm_vcpu *vcpu);   void
> >>  (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 
> >> int
> >>  (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); +   int
> >>  (*has_posted_interrupt)(struct kvm_vcpu *vcpu);   void
> >>  (*update_irq)(struct kvm_vcpu *vcpu); void (*set_eoi_exitmap)(struct
> >>  kvm_vcpu *vcpu, int vector,   int need_eoi, int 
> >> global);
> >> +  int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> >> +  void (*pi_migrate)(struct kvm_vcpu *vcpu);
> >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>int (*get_tdp_level)(void);
> >>u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> >> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >> index 1003341..7b9e1d0 100644
> >> --- a/arch/x86/include/asm/vmx.h
> >> +++ b/arch/x86/include/asm/vmx.h
> >> @@ -152,6 +152,7 @@
> >>  #define PIN_BASED_EXT_INTR_MASK 0x0001
> >>  #define PIN_BASED_NMI_EXITING   0x0008
> >>  #define PIN_BASED_VIRTUAL_NMIS  0x0020
> >> +#define PIN_BASED_POSTED_INTR   0x0080
> >> 
> >>  #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
> >>  VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -174,6 +175,7 @@
> >>  /* VMCS Encodings */ enum vmcs_field {VIRTUAL_PROCESSOR_ID  
> >>   = 0x, +  POSTED_INTR_NV  = 0x0002,
> >>GUEST_ES_SELECTOR   = 0x0800,   GUEST_CS_SELECTOR 
> >>   = 0x0802,GUEST_SS_SELECTOR   = 0x0804,
> >>  @@ -208,6 +210,8 @@ enum vmcs_field { VIRTUAL_APIC_PAGE_ADDR_HIGH
> >>  = 0x2013, APIC_ACCESS_ADDR= 0x2014,
> >>APIC_ACCESS_ADDR_HIGH   = 0x2015,
> >> +  POSTED_INTR_DESC_ADDR   = 0x2016,
> >> +  POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
> >>EPT_POINTER = 0x201a,
> >>EPT_POINTER_HIGH= 0x201b,
> >>EOI_EXIT_BITMAP0= 0x201c,
> >> diff --git a/arch/x86/kernel/apic/io_apic.c 
> >> b/arch/x86/kernel/apic/io_apic.c
> >> index 1817fa9..97cb8ee 100644
> >> --- a/arch/x86/kernel/apic/io_apic.c
> >> +++ b/arch/x86/kernel/apic/io_apic.c
> >> @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
> >>  }
> >>  #endif
> >> +static int
> >> +pi_set_affinity(struct irq_data *data, co

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Zhang, Yang Z
Gleb Natapov wrote on 2012-11-26:
> On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-25:
>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
 Posted Interrupt allows vAPICV interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification event
   to the vcpu. Then the vcpu will handle this interrupt automatically,
   without any software involvemnt.
>>> Looks like you allocating one irq vector per vcpu per pcpu and then
>>> migrate it or reallocate when vcpu move from one pcpu to another.
>>> This is not scalable and migrating irq migration slows things down.
>>> What's wrong with allocating one global vector for posted interrupt
>>> during vmx initialization and use it for all vcpus?
>> 
>> Consider the following situation:
>> If vcpu A is running when notification event which belong to vcpu B is 
>> arrived,
> since the vector match the vcpu A's notification vector, then this event
> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> be handled in time. The exact same situation is possible with your code.
> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> be assigned the same vector as vcpu B. But I fail to see why is this a
No, the on bit will be set to prevent notification event when vcpu B start 
migration. And it only free the vector before it going to run in another pcpu. 

> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> detect new event during next vmentry.
Yes, but the next vmentry may happen long time later and interrupt cannot be 
serviced until next vmentry. In current way, it will cause vmexit and 
re-schedule the vcpu B.
 
>> 
>>> 
 +  if (!cfg) {
 +  free_irq_at(irq, NULL);
 +  return 0;
 +  }
 +
 +  raw_spin_lock_irqsave(&vector_lock, flags);
 +  if (!__assign_irq_vector(irq, cfg, mask))
 +  ret = irq;
 +  raw_spin_unlock_irqrestore(&vector_lock, flags);
 +
 +  if (ret) {
 +  irq_set_chip_data(irq, cfg);
 +  irq_clear_status_flags(irq, IRQ_NOREQUEST);
 +  } else {
 +  free_irq_at(irq, cfg);
 +  }
 +  return ret;
 +}
>>> 
>>> This function is mostly cut&paste of create_irq_nr().
>> 
>> Yes, this function allow to allocate vector from specified cpu.
>> 
> Does not justify code duplication.
ok. will change it in next version.
 
 
if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
apic->vid_enabled = true;
 +
 +  if (kvm_x86_ops->has_posted_interrupt(vcpu))
 +  apic->pi_enabled = true;
 +
>>> This is global state, no need per apic variable.
Even all vcpus use the same setting, but according to SDM, apicv really is a 
per apic variable.
Anyway, if you think we should not put it here, where is the best place?

 @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> *vcpu,
>>> int cpu)
struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
unsigned long sysenter_esp;
 +  if (enable_apicv_pi && to_vmx(vcpu)->pi)
 +  pi_set_on(to_vmx(vcpu)->pi);
 +
>>> Why?
>> 
>> Here means the vcpu start migration. So we should prevent the
>> notification event until migration end.
>> 
> You check for IN_GUEST_MODE while sending notification. Why is this not
For interrupt from emulated device, it enough. But VT-d device doesn't know the 
vcpu is migrating, so set the on bit to prevent the notification event when 
target vcpu is migrating.

> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
I think the follow check can ensure the vcpu is in migration, am I wrong?
if (vmx->loaded_vmcs->cpu != cpu)
{
if (enable_apicv_pi && to_vmx(vcpu)->pi)
pi_set_on(to_vmx(vcpu)->pi);
}

 +  kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
 +
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);  
 local_irq_disable();
list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
vcpu->cpu = -1; kvm_cpu_vmxoff();   }
 +  if (enable_apicv_pi && to_vmx(vcpu)->pi)
 +  pi_set_on(to_vmx(vcpu)->pi);
>>> Why?
>> 
>> When vcpu schedule out, no need to send notification event to it, just set 
>> the
> PIR and wakeup it is enough.
> Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
Right.

> mode. Also in this case we probably should set bit directly in IRR and leave
> PIR alone.
>From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is 
>enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or 
>PIR if target vcpu is not running.

> 
>> 
  }
>

Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Gleb Natapov
On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-26:
> > On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-25:
> >>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>  Posted Interrupt allows vAPICV interrupts to inject into guest directly
>  without any vmexit.
>  
>  - When delivering a interrupt to guest, if target vcpu is running,
>    update Posted-interrupt requests bitmap and send a notification event
>    to the vcpu. Then the vcpu will handle this interrupt automatically,
>    without any software involvemnt.
> >>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>> This is not scalable and migrating irq migration slows things down.
> >>> What's wrong with allocating one global vector for posted interrupt
> >>> during vmx initialization and use it for all vcpus?
> >> 
> >> Consider the following situation:
> >> If vcpu A is running when notification event which belong to vcpu B is 
> >> arrived,
> > since the vector match the vcpu A's notification vector, then this event
> > will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> > be handled in time. The exact same situation is possible with your code.
> > vcpu B can be migrated from pcpu and vcpu A will take its place and will
> > be assigned the same vector as vcpu B. But I fail to see why is this a
> No, the on bit will be set to prevent notification event when vcpu B start 
> migration. And it only free the vector before it going to run in another 
> pcpu. 
There is a race. Sender check on bit, vcpu B migrate to another pcpu and
starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
A gets it.


> 
> > problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> > detect new event during next vmentry.
> Yes, but the next vmentry may happen long time later and interrupt cannot be 
> serviced until next vmentry. In current way, it will cause vmexit and 
> re-schedule the vcpu B.
Vmentry will happen when scheduler will decide that vcpu can run. There
is no problem here. What you probably want to say is that vcpu may not be
aware of interrupt happening since it was migrated to different vcpu
just after PI IPI was sent and thus missed it. But than PIR interrupts
should be processed during vmentry on another pcpu:

Sender: Guest:

set pir
set on
if (vcpu in guest mode on pcpu1)
   vmexit on pcpu1
   vmentry on pcpu2
   process pir, deliver interrupt
send PI IPI to pcpu1


>  
> >> 
> >>> 
>  +if (!cfg) {
>  +free_irq_at(irq, NULL);
>  +return 0;
>  +}
>  +
>  +raw_spin_lock_irqsave(&vector_lock, flags);
>  +if (!__assign_irq_vector(irq, cfg, mask))
>  +ret = irq;
>  +raw_spin_unlock_irqrestore(&vector_lock, flags);
>  +
>  +if (ret) {
>  +irq_set_chip_data(irq, cfg);
>  +irq_clear_status_flags(irq, IRQ_NOREQUEST);
>  +} else {
>  +free_irq_at(irq, cfg);
>  +}
>  +return ret;
>  +}
> >>> 
> >>> This function is mostly cut&paste of create_irq_nr().
> >> 
> >> Yes, this function allow to allocate vector from specified cpu.
> >> 
> > Does not justify code duplication.
> ok. will change it in next version.
>  
Please use single global vector in the next version.

>  
>   if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>   apic->vid_enabled = true;
>  +
>  +if (kvm_x86_ops->has_posted_interrupt(vcpu))
>  +apic->pi_enabled = true;
>  +
> >>> This is global state, no need per apic variable.
> Even all vcpus use the same setting, but according to SDM, apicv really is a 
> per apic variable.
It is not per vapic in our implementation and this is what is
important here.

> Anyway, if you think we should not put it here, where is the best place?
It is not needed, just use has_posted_interrupt(vcpu) instead.

> 
>  @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > *vcpu,
> >>> int cpu)
>   struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>   unsigned long sysenter_esp;
>  +if (enable_apicv_pi && to_vmx(vcpu)->pi)
>  +pi_set_on(to_vmx(vcpu)->pi);
>  +
> >>> Why?
> >> 
> >> Here means the vcpu start migration. So we should prevent the
> >> notification event until migration end.
> >> 
> > You check for IN_GUEST_MODE while sending notification. Why is this not
> For interrupt from emulated device, it enough. But VT-d device doesn't kn

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-26 Thread Zhang, Yang Z
Gleb Natapov wrote on 2012-11-26:
> On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-26:
>>> On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-25:
> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
>> without any vmexit.
>> 
>> - When delivering a interrupt to guest, if target vcpu is running,
>>   update Posted-interrupt requests bitmap and send a notification
>>   event to the vcpu. Then the vcpu will handle this interrupt
>>   automatically, without any software involvemnt.
> Looks like you allocating one irq vector per vcpu per pcpu and then
> migrate it or reallocate when vcpu move from one pcpu to another.
> This is not scalable and migrating irq migration slows things down.
> What's wrong with allocating one global vector for posted interrupt
> during vmx initialization and use it for all vcpus?
 
 Consider the following situation:
 If vcpu A is running when notification event which belong to vcpu B is
> arrived,
>>> since the vector match the vcpu A's notification vector, then this event
>>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
>>> be handled in time. The exact same situation is possible with your code.
>>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
>>> be assigned the same vector as vcpu B. But I fail to see why is this a
>> No, the on bit will be set to prevent notification event when vcpu B start
> migration. And it only free the vector before it going to run in another pcpu.
> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> A gets it.
Yes, it do exist. But I think it should be ok even this happens.

>> 
>>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
>>> detect new event during next vmentry.
>> Yes, but the next vmentry may happen long time later and interrupt cannot be
> serviced until next vmentry. In current way, it will cause vmexit and 
> re-schedule
> the vcpu B.
> Vmentry will happen when scheduler will decide that vcpu can run. There
I don't know how scheduler can know the vcpu can run in this case, can you 
elaborate it? 
I thought it may have problems with global vector in some cases(maybe I am 
wrong, since I am not familiar with KVM scheduler):
If target VCPU is in idle, and this notification event is consumed by other 
VCPU, then how can scheduler know the vcpu is ready to run? Even there is a way 
for scheduler to know, then when? Isn't it too late?
If notification event arrived in hypervisor, then how the handler know which 
VCPU the notification event belong to?

> is no problem here. What you probably want to say is that vcpu may not be
> aware of interrupt happening since it was migrated to different vcpu
> just after PI IPI was sent and thus missed it. But than PIR interrupts
> should be processed during vmentry on another pcpu:
> 
> Sender: Guest:
> 
> set pir
> set on
> if (vcpu in guest mode on pcpu1)
>vmexit on pcpu1
>vmentry on pcpu2
>process pir, deliver interrupt
>   send PI IPI to pcpu1

>> 
 
> 
>> +if (!cfg) {
>> +free_irq_at(irq, NULL);
>> +return 0;
>> +}
>> +
>> +raw_spin_lock_irqsave(&vector_lock, flags);
>> +if (!__assign_irq_vector(irq, cfg, mask))
>> +ret = irq;
>> +raw_spin_unlock_irqrestore(&vector_lock, flags);
>> +
>> +if (ret) {
>> +irq_set_chip_data(irq, cfg);
>> +irq_clear_status_flags(irq, IRQ_NOREQUEST);
>> +} else {
>> +free_irq_at(irq, cfg);
>> +}
>> +return ret;
>> +}
> 
> This function is mostly cut&paste of create_irq_nr().
 
 Yes, this function allow to allocate vector from specified cpu.
 
>>> Does not justify code duplication.
>> ok. will change it in next version.
>> 
> Please use single global vector in the next version.
> 
>> 
>>  if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>>  apic->vid_enabled = true;
>> +
>> +if (kvm_x86_ops->has_posted_interrupt(vcpu))
>> +apic->pi_enabled = true;
>> +
> This is global state, no need per apic variable.
>> Even all vcpus use the same setting, but according to SDM, apicv really is a 
>> per
> apic variable.
> It is not per vapic in our implementation and this is what is
> important here.
> 
>> Anyway, if you think we should not put it here, where is the best place?
> It is not needed

Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Gleb Natapov
On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-26:
> > On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
>  Gleb Natapov wrote on 2012-11-25:
> > On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >> without any vmexit.
> >> 
> >> - When delivering a interrupt to guest, if target vcpu is running,
> >>   update Posted-interrupt requests bitmap and send a notification
> >>   event to the vcpu. Then the vcpu will handle this interrupt
> >>   automatically, without any software involvemnt.
> > Looks like you allocating one irq vector per vcpu per pcpu and then
> > migrate it or reallocate when vcpu move from one pcpu to another.
> > This is not scalable and migrating irq migration slows things down.
> > What's wrong with allocating one global vector for posted interrupt
> > during vmx initialization and use it for all vcpus?
>  
>  Consider the following situation:
>  If vcpu A is running when notification event which belong to vcpu B is
> > arrived,
> >>> since the vector match the vcpu A's notification vector, then this event
> >>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> >>> be handled in time. The exact same situation is possible with your code.
> >>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> >>> be assigned the same vector as vcpu B. But I fail to see why is this a
> >> No, the on bit will be set to prevent notification event when vcpu B start
> > migration. And it only free the vector before it going to run in another 
> > pcpu.
> > There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> > starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> > A gets it.
> Yes, it do exist. But I think it should be ok even this happens.
> 
Then it is OK to use global PI vector. The race should be dealt with
anyway.

> >> 
> >>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> >>> detect new event during next vmentry.
> >> Yes, but the next vmentry may happen long time later and interrupt cannot 
> >> be
> > serviced until next vmentry. In current way, it will cause vmexit and 
> > re-schedule
> > the vcpu B.
> > Vmentry will happen when scheduler will decide that vcpu can run. There
> I don't know how scheduler can know the vcpu can run in this case, can you 
> elaborate it? 
> I thought it may have problems with global vector in some cases(maybe I am 
> wrong, since I am not familiar with KVM scheduler):
> If target VCPU is in idle, and this notification event is consumed by other 
> VCPU, then how can scheduler know the vcpu is ready to run? Even there is a 
> way for scheduler to know, then when? Isn't it too late?
> If notification event arrived in hypervisor, then how the handler know which 
> VCPU the notification event belong to?
When vcpu is idle its thread sleeps inside host kernel (see
virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
you should call kvm_vcpu_kick(), but only after changing vcpu
state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
there which checks apic IRR, but not PIR, so it is not enough to set
bit in PIR and call kvm_vcpu_kick() to wake up vcpu.

> 
> > is no problem here. What you probably want to say is that vcpu may not be
> > aware of interrupt happening since it was migrated to different vcpu
> > just after PI IPI was sent and thus missed it. But than PIR interrupts
> > should be processed during vmentry on another pcpu:
> > 
> > Sender: Guest:
> > 
> > set pir
> > set on
> > if (vcpu in guest mode on pcpu1)
> >vmexit on pcpu1
> >vmentry on pcpu2
> >process pir, deliver interrupt
> > send PI IPI to pcpu1
> 
> >> 
>  
> > 
> >> +  if (!cfg) {
> >> +  free_irq_at(irq, NULL);
> >> +  return 0;
> >> +  }
> >> +
> >> +  raw_spin_lock_irqsave(&vector_lock, flags);
> >> +  if (!__assign_irq_vector(irq, cfg, mask))
> >> +  ret = irq;
> >> +  raw_spin_unlock_irqrestore(&vector_lock, flags);
> >> +
> >> +  if (ret) {
> >> +  irq_set_chip_data(irq, cfg);
> >> +  irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >> +  } else {
> >> +  free_irq_at(irq, cfg);
> >> +  }
> >> +  return ret;
> >> +}
> > 
> > This function is mostly cut&paste of create_irq_nr().
>  
>  Yes, this function allow to allocate vector from spe

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Zhang, Yang Z
Gleb Natapov wrote on 2012-11-27:
> On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-26:
>>> On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2012-11-26:
> On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-25:
>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
 Posted Interrupt allows vAPICV interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification
   event to the vcpu. Then the vcpu will handle this interrupt
   automatically, without any software involvemnt.
>>> Looks like you allocating one irq vector per vcpu per pcpu and then
>>> migrate it or reallocate when vcpu move from one pcpu to another.
>>> This is not scalable and migrating irq migration slows things down.
>>> What's wrong with allocating one global vector for posted interrupt
>>> during vmx initialization and use it for all vcpus?
>> 
>> Consider the following situation:
>> If vcpu A is running when notification event which belong to vcpu B is
>>> arrived,
> since the vector match the vcpu A's notification vector, then this event
> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> be handled in time. The exact same situation is possible with your code.
> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> be assigned the same vector as vcpu B. But I fail to see why is this a
 No, the on bit will be set to prevent notification event when vcpu B start
>>> migration. And it only free the vector before it going to run in another 
>>> pcpu.
>>> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
>>> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
>>> A gets it.
>> Yes, it do exist. But I think it should be ok even this happens.
>> 
> Then it is OK to use global PI vector. The race should be dealt with
> anyway.
Or using lock can deal with it too.
 
 
> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> detect new event during next vmentry.
 Yes, but the next vmentry may happen long time later and interrupt cannot
> be
>>> serviced until next vmentry. In current way, it will cause vmexit and
>>> re-schedule the vcpu B. Vmentry will happen when scheduler will decide
>>> that vcpu can run. There
>> I don't know how scheduler can know the vcpu can run in this case, can
>> you elaborate it? I thought it may have problems with global vector in
>> some cases(maybe I am wrong, since I am not familiar with KVM
>> scheduler): If target VCPU is in idle, and this notification event is
>> consumed by other VCPU,
> then how can scheduler know the vcpu is ready to run? Even there is a way for
> scheduler to know, then when? Isn't it too late?
>> If notification event arrived in hypervisor, then how the handler know which
> VCPU the notification event belong to?
> When vcpu is idle its thread sleeps inside host kernel (see
> virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
> you should call kvm_vcpu_kick(), but only after changing vcpu
> state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
> checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
> there which checks apic IRR, but not PIR, so it is not enough to set
> bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick when 
the waiting event happened to wake up the blocked vcpu, but this event is 
consumed by other vcpu without any chance for us to kick it. Then how it will 
move out from blocked list to run queue.
BTW, what I am talking is for the interrupt from VT-d case. For virtual 
interrupt, I think global vector is ok.
Also, the second problem is also about the VT-d case.
When cpu is running in VM root mode, and then an notification event arrives, 
since all VCPU use the same notification vector, we cannot distinguish which 
VCPU the notification event want to deliver to. And we cannot put the right 
VCPU to run queue.
 
>> 
>>> is no problem here. What you probably want to say is that vcpu may not be
>>> aware of interrupt happening since it was migrated to different vcpu
>>> just after PI IPI was sent and thus missed it. But than PIR interrupts
>>> should be processed during vmentry on another pcpu:
>>> 
>>> Sender: Guest:
>>> 
>>> set pir
>>> set on
>>> if (vcpu in guest mode on pcpu1)
>>>vmexit on pcpu1
>>>vmentry on pcpu2
>>>process pir, deliver interrupt
>>> send PI IPI to pcpu1
>> 
 
>> 
>>> 
>>

RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Veruca Salt




> From: yang.z.zh...@intel.com
> To: g...@redhat.com
> CC: kvm@vger.kernel.org; mtosa...@redhat.com; haitao.s...@intel.com; 
> xiantao.zh...@intel.com
> Subject: RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
> Date: Tue, 27 Nov 2012 11:10:20 +
>
> Gleb Natapov wrote on 2012-11-27:
> > On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-11-26:
> >>>>> On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
> >>>>>> Gleb Natapov wrote on 2012-11-25:
> >>>>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >>>>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest 
> >>>>>>>> directly
> >>>>>>>> without any vmexit.
> >>>>>>>>
> >>>>>>>> - When delivering a interrupt to guest, if target vcpu is running,
> >>>>>>>> update Posted-interrupt requests bitmap and send a notification
> >>>>>>>> event to the vcpu. Then the vcpu will handle this interrupt
> >>>>>>>> automatically, without any software involvemnt.
> >>>>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>>>>>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>>>>>> This is not scalable and migrating irq migration slows things down.
> >>>>>>> What's wrong with allocating one global vector for posted interrupt
> >>>>>>> during vmx initialization and use it for all vcpus?
> >>>>>>
> >>>>>> Consider the following situation:
> >>>>>> If vcpu A is running when notification event which belong to vcpu B is
> >>> arrived,
> >>>>> since the vector match the vcpu A's notification vector, then this event
> >>>>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> >>>>> be handled in time. The exact same situation is possible with your code.
> >>>>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> >>>>> be assigned the same vector as vcpu B. But I fail to see why is this a
> >>>> No, the on bit will be set to prevent notification event when vcpu B 
> >>>> start
> >>> migration. And it only free the vector before it going to run in another 
> >>> pcpu.
> >>> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> >>> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> >>> A gets it.
> >> Yes, it do exist. But I think it should be ok even this happens.
> >>
> > Then it is OK to use global PI vector. The race should be dealt with
> > anyway.
> Or using lock can deal with it too.
>
> >>>>
> >>>>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> >>>>> detect new event during next vmentry.
> >>>> Yes, but the next vmentry may happen long time later and interrupt cannot
> > be
> >>> serviced until next vmentry. In current way, it will cause vmexit and
> >>> re-schedule the vcpu B. Vmentry will happen when scheduler will decide
> >>> that vcpu can run. There
> >> I don't know how scheduler can know the vcpu can run in this case, can
> >> you elaborate it? I thought it may have problems with global vector in
> >> some cases(maybe I am wrong, since I am not familiar with KVM
> >> scheduler): If target VCPU is in idle, and this notification event is
> >> consumed by other VCPU,
> > then how can scheduler know the vcpu is ready to run? Even there is a way 
> > for
> > scheduler to know, then when? Isn't it too late?
> >> If notification event arrived in hypervisor, then how the handler know 
> >> which
> > VCPU the notification event belong to?
> > When vcpu is idle its thread sleeps inside host kernel (see
> > virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
> > you should call kvm_vcpu_kick(), but only after changing vcpu
> > state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
> > checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()

Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting

2012-11-27 Thread Gleb Natapov
On Tue, Nov 27, 2012 at 11:10:20AM +, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-27:
> > On Tue, Nov 27, 2012 at 03:38:05AM +, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 12:29:54PM +, Zhang, Yang Z wrote:
>  Gleb Natapov wrote on 2012-11-26:
> > On Mon, Nov 26, 2012 at 03:51:04AM +, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-25:
> >>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>  Posted Interrupt allows vAPICV interrupts to inject into guest 
>  directly
>  without any vmexit.
>  
>  - When delivering a interrupt to guest, if target vcpu is running,
>    update Posted-interrupt requests bitmap and send a notification
>    event to the vcpu. Then the vcpu will handle this interrupt
>    automatically, without any software involvemnt.
> >>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>> This is not scalable and migrating irq migration slows things down.
> >>> What's wrong with allocating one global vector for posted interrupt
> >>> during vmx initialization and use it for all vcpus?
> >> 
> >> Consider the following situation:
> >> If vcpu A is running when notification event which belong to vcpu B is
> >>> arrived,
> > since the vector match the vcpu A's notification vector, then this event
> > will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> > be handled in time. The exact same situation is possible with your code.
> > vcpu B can be migrated from pcpu and vcpu A will take its place and will
> > be assigned the same vector as vcpu B. But I fail to see why is this a
>  No, the on bit will be set to prevent notification event when vcpu B 
>  start
> >>> migration. And it only free the vector before it going to run in another 
> >>> pcpu.
> >>> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> >>> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> >>> A gets it.
> >> Yes, it do exist. But I think it should be ok even this happens.
> >> 
> > Then it is OK to use global PI vector. The race should be dealt with
> > anyway.
> Or using lock can deal with it too.
>  
Last thing we want is to hold lock while injecting interrupt. Also Vt-d
HW will not be able to use the lock.

>  
> > problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> > detect new event during next vmentry.
>  Yes, but the next vmentry may happen long time later and interrupt cannot
> > be
> >>> serviced until next vmentry. In current way, it will cause vmexit and
> >>> re-schedule the vcpu B. Vmentry will happen when scheduler will decide
> >>> that vcpu can run. There
> >> I don't know how scheduler can know the vcpu can run in this case, can
> >> you elaborate it? I thought it may have problems with global vector in
> >> some cases(maybe I am wrong, since I am not familiar with KVM
> >> scheduler): If target VCPU is in idle, and this notification event is
> >> consumed by other VCPU,
> > then how can scheduler know the vcpu is ready to run? Even there is a way 
> > for
> > scheduler to know, then when? Isn't it too late?
> >> If notification event arrived in hypervisor, then how the handler know 
> >> which
> > VCPU the notification event belong to?
> > When vcpu is idle its thread sleeps inside host kernel (see
> > virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
> > you should call kvm_vcpu_kick(), but only after changing vcpu
> > state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
> > checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
> > there which checks apic IRR, but not PIR, so it is not enough to set
> > bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
> Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick 
> when the waiting event happened to wake up the blocked vcpu, but this event 
> is consumed by other vcpu without any chance for us to kick it. Then how it 
> will move out from blocked list to run queue.
> BTW, what I am talking is for the interrupt from VT-d case. For virtual 
> interrupt, I think global vector is ok.
> Also, the second problem is also about the VT-d case.
> When cpu is running in VM root mode, and then an notification event arrives, 
> since all VCPU use the same notification vector, we cannot distinguish which 
> VCPU the notification event want to deliver to. And we cannot put the right 
> VCPU to run queue.
>  
There is not VT-d code in proposed patches (is there spec available about
how VT-d integrates with PI?), so discussion is purely theoretical. VT-d
device will have to be reprogrammed to generate regular interrupt when
vcpu thread goes to sleep. This interrupt will be injected by VFIO