Re: [RFC PATCH v3 13/20] x86: DMA support for memory encryption

2016-11-15 Thread Radim Krčmář
2016-11-15 11:02-0600, Tom Lendacky:
> On 11/15/2016 8:39 AM, Radim Krčmář wrote:
>> 2016-11-09 18:37-0600, Tom Lendacky:
>>> Since DMA addresses will effectively look like 48-bit addresses when the
>>> memory encryption mask is set, SWIOTLB is needed if the DMA mask of the
>>> device performing the DMA does not support 48-bits. SWIOTLB will be
>>> initialized to create un-encrypted bounce buffers for use by these devices.
>>>
>>> Signed-off-by: Tom Lendacky <thomas.lenda...@amd.com>
>>> ---
>>> diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
>>> @@ -64,13 +66,15 @@ static struct dma_map_ops swiotlb_dma_ops = {
>>>   * pci_swiotlb_detect_override - set swiotlb to 1 if necessary
>>>   *
>>>   * This returns non-zero if we are forced to use swiotlb (by the boot
>>> - * option).
>>> + * option). If memory encryption is enabled then swiotlb will be set
>>> + * to 1 so that bounce buffers are allocated and used for devices that
>>> + * do not support the addressing range required for the encryption mask.
>>>   */
>>>  int __init pci_swiotlb_detect_override(void)
>>>  {
>>> int use_swiotlb = swiotlb | swiotlb_force;
>>>  
>>> -   if (swiotlb_force)
>>> +   if (swiotlb_force || sme_me_mask)
>>> swiotlb = 1;
>>>  
>>> return use_swiotlb;
>> 
>> We want to return 1 even if only sme_me_mask is 1, because the return
>> value is used for detection.  The following would be less obscure, IMO:
>> 
>>  if (swiotlb_force || sme_me_mask)
>>  swiotlb = 1;
>> 
>>  return swiotlb;
> 
> If we do that then all DMA would go through the swiotlb bounce buffers.

No, that is decided for example in swiotlb_map_page() and we need to
call pci_swiotlb_init() to register that function.

> By setting swiotlb to 1 we indicate that the bounce buffers will be
> needed for those devices that can't support the addressing range when
> the encryption bit is set (48 bit DMA). But if the device can support
> the addressing range we won't use the bounce buffers.

If we return 0 here, then pci_swiotlb_init() will not be called =>
dma_ops won't be set to swiotlb_dma_ops => we won't use bounce buffers.

>> We setup encrypted swiotlb and then decrypt it, but sometimes set it up
>> decrypted (late_alloc) ... why isn't the swiotlb set up decrypted
>> directly?
> 
> When swiotlb is allocated in swiotlb_init(), it is too early to make
> use of the api to the change the page attributes. Because of this,
> the callback to make those changes is needed.

Thanks. (I don't know page table setup enough to see a lesser evil. :])

>>> @@ -541,7 +583,7 @@ static phys_addr_t
>>>  map_single(struct device *hwdev, phys_addr_t phys, size_t size,
>>>enum dma_data_direction dir)
>>>  {
>>> -   dma_addr_t start_dma_addr = phys_to_dma(hwdev, io_tlb_start);
>>> +   dma_addr_t start_dma_addr = swiotlb_phys_to_dma(hwdev, io_tlb_start);
>> 
>> We have decrypted io_tlb_start before, so shouldn't its physical address
>> be saved without the sme bit?  (Which changes a lot ...)
> 
> I'm not sure what you mean here, can you elaborate a bit more?

The C-bit (sme bit) is a part of the physical address.
If we know that a certain physical page should be accessed as
unencrypted (the bounce buffer) then the C-bit is 0.
I'm wondering why we save the physical address with the C-bit set when
we know that it can't be accessed that way (because we remove it every
time).

The naming is a bit confusing, because physical addresses are actually
virtualized by SME -- maybe we should be calling them SME addresses?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v3 13/20] x86: DMA support for memory encryption

2016-11-15 Thread Radim Krčmář
2016-11-09 18:37-0600, Tom Lendacky:
> Since DMA addresses will effectively look like 48-bit addresses when the
> memory encryption mask is set, SWIOTLB is needed if the DMA mask of the
> device performing the DMA does not support 48-bits. SWIOTLB will be
> initialized to create un-encrypted bounce buffers for use by these devices.
> 
> Signed-off-by: Tom Lendacky 
> ---
> diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
> @@ -30,7 +30,7 @@ static dma_addr_t nommu_map_page(struct device *dev, struct 
> page *page,
>enum dma_data_direction dir,
>unsigned long attrs)
>  {
> - dma_addr_t bus = page_to_phys(page) + offset;
> + dma_addr_t bus = phys_to_dma(dev, page_to_phys(page)) + offset;
>   WARN_ON(size == 0);
>   if (!check_addr("map_single", dev, bus, size))
>   return DMA_ERROR_CODE;
> diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> @@ -12,6 +12,8 @@
>  int swiotlb __read_mostly;
>  
>  void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
> @@ -64,13 +66,15 @@ static struct dma_map_ops swiotlb_dma_ops = {
>   * pci_swiotlb_detect_override - set swiotlb to 1 if necessary
>   *
>   * This returns non-zero if we are forced to use swiotlb (by the boot
> - * option).
> + * option). If memory encryption is enabled then swiotlb will be set
> + * to 1 so that bounce buffers are allocated and used for devices that
> + * do not support the addressing range required for the encryption mask.
>   */
>  int __init pci_swiotlb_detect_override(void)
>  {
>   int use_swiotlb = swiotlb | swiotlb_force;
>  
> - if (swiotlb_force)
> + if (swiotlb_force || sme_me_mask)
>   swiotlb = 1;
>  
>   return use_swiotlb;

We want to return 1 even if only sme_me_mask is 1, because the return
value is used for detection.  The following would be less obscure, IMO:

if (swiotlb_force || sme_me_mask)
swiotlb = 1;

return swiotlb;

> diff --git a/init/main.c b/init/main.c
> @@ -598,6 +602,15 @@ asmlinkage __visible void __init start_kernel(void)
>*/
>   locking_selftest();
>  
> + /*
> +  * This needs to be called before any devices perform DMA
> +  * operations that might use the swiotlb bounce buffers.
> +  * This call will mark the bounce buffers as un-encrypted so
> +  * that their usage will not cause "plain-text" data to be
> +  * decrypted when accessed.
> +  */
> + mem_encrypt_init();

(Comments below are connected to the reason why we call this.)

> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> @@ -159,6 +171,31 @@ void swiotlb_print_info(void)
> +/*
> + * If memory encryption is active, the DMA address for an encrypted page may
> + * be beyond the range of the device. If bounce buffers are required be sure
> + * that they are not on an encrypted page. This should be called before the
> + * iotlb area is used.
> + */
> +void __init swiotlb_clear_encryption(void)
> +{
> + void *vaddr;
> + unsigned long bytes;
> +
> + if (no_iotlb_memory || !io_tlb_start || late_alloc)

io_tlb_start seems redundant -- when can !no_iotlb_memory &&
!io_tlb_start happen?

Is the order of calls
  1) swiotlb init
  2) SME init
  3) swiotlb late init 
?

We setup encrypted swiotlb and then decrypt it, but sometimes set it up
decrypted (late_alloc) ... why isn't the swiotlb set up decrypted
directly?

> + return;
> +
> + vaddr = phys_to_virt(io_tlb_start);
> + bytes = PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT);
> + swiotlb_set_mem_unenc(vaddr, bytes);
> + memset(vaddr, 0, bytes);
> +
> + vaddr = phys_to_virt(io_tlb_overflow_buffer);
> + bytes = PAGE_ALIGN(io_tlb_overflow);
> + swiotlb_set_mem_unenc(vaddr, bytes);
> + memset(vaddr, 0, bytes);
> +}
> +
> @@ -541,7 +583,7 @@ static phys_addr_t
>  map_single(struct device *hwdev, phys_addr_t phys, size_t size,
>  enum dma_data_direction dir)
>  {
> - dma_addr_t start_dma_addr = phys_to_dma(hwdev, io_tlb_start);
> + dma_addr_t start_dma_addr = swiotlb_phys_to_dma(hwdev, io_tlb_start);

We have decrypted io_tlb_start before, so shouldn't its physical address
be saved without the sme bit?  (Which changes a lot ...)

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 03/13] KVM: Define a new interface kvm_intr_is_single_vcpu()

2015-09-18 Thread Radim Krčmář
2015-09-17 23:18+, Wu, Feng:
>> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
>> On 17/09/2015 17:58, Radim Krčmář wrote:
>>> xAPIC address are only 8 bit long so they always get delivered to x2APIC
>>> cluster 0, where first 16 bits work like xAPIC flat logical mode.
>> 
>> Ok, I was wondering whether this was the correct interpretation.  Thanks!
> 
> Paolo, I don't think Radim clarify your concern, right? Since mda is 8-bit, it
> is wrong with mda >> 16, this is your concern, right?

In case it was:  mda is u32 so the bitshift is defined by C.
(xAPIC destinations in KVM's x2APIC mode are stored in lowest 8 bits of
 mda, hence the cluster is always 0.)

Or am I still missing the point?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 03/13] KVM: Define a new interface kvm_intr_is_single_vcpu()

2015-09-17 Thread Radim Krčmář
2015-09-17 16:24+0200, Paolo Bonzini:
> I think kvm_apic_match_logical_addr for MSI and IOAPIC interrupts is
> buggy in x2apic mode.
> 
> It does:
> 
> if (apic_x2apic_mode(apic))
> return ((logical_id >> 16) == (mda >> 16))
>&& (logical_id & mda & 0x) != 0;
> 
> But mda is only 8-bits for MSI and IOAPIC interrupts.
> 
> Radim, should kvm_apic_mda also handle the !ipi && x2apic_mda && dest_id
> != APIC_BROADCAST case?  It never triggers with Linux because it uses
> only the physical mode (that's not super-easy to see;
> ioapic_set_affinity looks for the RTEs in irq_data->chip_data and that
> is allocated with kzalloc).

KVM handles that case, it's just convoluted.
(I wish we scrapped the IR-less x2APIC mode.)

For interrupts from MSI and IOxAPIC:
- Flat logical interrupts are delivered as if we had natural
  (CPU0<->bit0, CPU1<->bit1, ...) flat logical xAPIC for first 8 VCPUs.
- Cluster logical doesn't work much, it's interpreted like flat logical.
  I didn't care about xAPIC cluster because Linux, the sole user of our
  paravirtualized x2APIC, doesn't configure it.

I'll paste kvm_apic_mda() source for better explanation:

  static u32 kvm_apic_mda(unsigned int dest_id, struct kvm_lapic *source,
struct kvm_lapic *target)
  {
bool ipi = source != NULL;
bool x2apic_mda = apic_x2apic_mode(ipi ? source : target);
  
if (!ipi && dest_id == APIC_BROADCAST && x2apic_mda)
return X2APIC_BROADCAST;
  
return x2apic_mda ? dest_id : SET_APIC_DEST_FIELD(dest_id);
  }

MSI/IOxAPIC interrupt means that source is NULL and if the target is in
x2APIC mode, the original 'dest_id' is returned as mda => a flat logical
xAPIC to 0x0f will get interpreted as (cluster) logical x2APIC 0xf in
kvm_apic_match_logical_addr().
xAPIC address are only 8 bit long so they always get delivered to x2APIC
cluster 0, where first 16 bits work like xAPIC flat logical mode.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Radim Krčmář
2015-01-09 16:18+0100, Paolo Bonzini:
 On 09/01/2015 16:12, Radim Krčmář wrote:
   The chipset doesn't support it. :(
  
  I meant that we need to recompute PI entries for lowest priority
  interrupts every time guest's TPR changes.
  
  Luckily, Linux doesn't use TPR, but other OS might be a reason to drop
  lowest priority from PI optimizations.  (Or make it more complicated.)
 
 Doing vector hashing is a possibility as well.  I would like to know
 what existing chipsets do in practice, then we can mimic it.

When looking at /proc/interrupts from time to time, I have only seen
interrupts landing on the first CPU of the set.

We could also distinguish between AMD and Intel ...
AMD should deliver to the highest APIC ID.
(If we still need to decide after focus processor and APR checks.)

I'll try to check using a more trustworthy approach.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-09 Thread Radim Krčmář
2014-12-12 23:14+0800, Feng Wu:
 This patch defines a new interface kvm_find_dest_vcpu for
 VT-d PI, which can returns the destination vCPU of the
 interrupt for guests.
 
 Since VT-d PI cannot handle broadcast/multicast interrupt,
 Here we only handle Fixed and Lowest priority interrupts.
 
 The current method of handling guest lowest priority interrtups
 is to use a counter 'apic_arb_prio' for each vCPU, we choose the
 vCPU with smallest 'apic_arb_prio' and then increase it by 1.
 However, for VT-d PI, we cannot re-use this, since we no longer
 have control to 'apic_arb_prio' with posted interrupt direct
 delivery by Hardware.
 
 Here, we introduce a similar way with 'apic_arb_prio' to handle
 guest lowest priority interrtups when VT-d PI is used. Here is the
 ideas:
 - Each vCPU has a counter 'round_robin_counter'.
 - When guests sets an interrupts to lowest priority, we choose
 the vCPU with smallest 'round_robin_counter' as the destination,
 then increase it.

There are two points relevant to this patch in new KVM's implementation,
(KVM: x86: amend APIC lowest priority arbitration,
 https://lkml.org/lkml/2015/1/9/362)

1) lowest priority depends on TPR
2) there is no need for balancing

(1) has to be considered with PI as well.
I kept (2) to avoid whining from people building on that behaviour, but
lowest priority backed by PI could be transparent without it.

Patch below removes the balancing, but I am not sure this is a price we
allowed ourselves to pay ... what are your opinions?


---8---
KVM: x86: don't balance lowest priority interrupts

Balancing is not mandated by specification and real hardware most likely
doesn't do it.  We break backward compatibility to allow optimizations.
(Posted interrupts can deliver to only one fixed destination.)

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/lapic.c| 8 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 97a5dd0222c8..aa4bd8286232 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -370,7 +370,6 @@ struct kvm_vcpu_arch {
u64 apic_base;
struct kvm_lapic *apic;/* kernel irqchip context */
unsigned long apic_attention;
-   int32_t apic_arb_prio;
int mp_state;
u64 ia32_misc_enable_msr;
bool tpr_access_reporting;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5b9d8c589bba..eb85af8e8fc0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -749,7 +749,6 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
  trig_mode, vector);
switch (delivery_mode) {
case APIC_DM_LOWEST:
-   vcpu-arch.apic_arb_prio++;
case APIC_DM_FIXED:
/* FIXME add logic for vcpu on reset */
if (unlikely(!apic_enabled(apic)))
@@ -837,11 +836,9 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
kvm_vcpu *vcpu2)
 *  - uses the APR register (which also considers ISR and IRR),
 *  - chooses the highest APIC ID when APRs are identical,
 *  - and allows a focus processor.
-* XXX: pseudo-balancing with apic_arb_prio is a KVM-specific feature
 */
-   int tpr = kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) -
- kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI);
-   return tpr ? : vcpu1-arch.apic_arb_prio - vcpu2-arch.apic_arb_prio;
+   return kvm_apic_get_reg(vcpu1-arch.apic, APIC_TASKPRI) -
+  kvm_apic_get_reg(vcpu2-arch.apic, APIC_TASKPRI);
 }
 
 static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
@@ -1595,7 +1592,6 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
vcpu-arch.pv_eoi.msr_val = 0;
apic_update_ppr(apic);
 
-   vcpu-arch.apic_arb_prio = 0;
vcpu-arch.apic_attention = 0;
 
apic_debug(%s: vcpu=%p, id=%d, base_msr=
-- 
2.2.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu