Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel

2023-07-07 Thread Joerg Roedel
On Fri, Jul 07, 2023 at 12:23:59PM +0800, Baoquan He wrote:
> I am wondering why we don't detect the cpu type and return early inside
> sev_enable() if it's Intel cpu.
> 
> We can't rely on CONFIG_AMD_MEM_ENCRYPT to decide if the code need be
> executed or not because we usually enable them all in distros.

Looking at the code in head_64.S, by the time sev_enable() runs the SEV
bit should already be set in sev_status. Maybe use that to detect
whether SEV is enabled and bail out early?

Regards,

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 08/12] x86/sev: Park APs on AP Jump Table with GHCB protocol version 2

2022-01-27 Thread Joerg Roedel
On Fri, Nov 12, 2021 at 05:33:05PM +0100, Borislav Petkov wrote:
> On Mon, Sep 13, 2021 at 05:55:59PM +0200, Joerg Roedel wrote:
> > +"ljmpl *%0" : :
> > +"m" (real_mode_header->sev_real_ap_park_asm),
> > +"b" (sev_es_jump_table_pa >> 4));
> 
> In any case, this asm needs comments: why those regs, why
> sev_es_jump_table_pa >> 4 in rbx (I found later in the patch why) and so
> on.

Turned out the jump_table_pa is not used in asm code anymore. It was a
left-over from a previous version of the patch, it is removed now.

> > +SYM_INNER_LABEL(sev_ap_park_paging_off, SYM_L_GLOBAL)
> 
> Global symbol but used only in this file. .L-prefix then?

It needs to be a global symbol so the pa_ variant can be generated.

Regards,

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 07/12] x86/sev: Setup code to park APs in the AP Jump Table

2022-01-26 Thread Joerg Roedel
On Wed, Nov 10, 2021 at 05:37:32PM +0100, Borislav Petkov wrote:
> On Mon, Sep 13, 2021 at 05:55:58PM +0200, Joerg Roedel wrote:
> >  extern unsigned char real_mode_blob[];
> > diff --git a/arch/x86/include/asm/sev-ap-jumptable.h 
> > b/arch/x86/include/asm/sev-ap-jumptable.h
> > new file mode 100644
> > index ..1c8b2ce779e2
> > --- /dev/null
> > +++ b/arch/x86/include/asm/sev-ap-jumptable.h
> 
> Why a separate header? arch/x86/include/asm/sev.h looks small enough.

The header is included in assembly, so I made a separate one.

> > +void __init sev_es_setup_ap_jump_table_data(void *base, u32 pa)
> 
> Why is this a separate function?
> 
> It is all part of the jump table setup.

Right, but the sev_es_setup_ap_jump_table_blob() function is already
pretty big and I wanted to keep things readable.

> 
> > +   return 0;
> 
> Why are you returning 0 here and below?

This is in an initcall and it returns just 0 when the environment is not
ready to setup the ap jump table. Returning non-zero would create a
warning message in the caller for something that is not a bug in the
kernel.

> > + * This file contains the source code for the binary blob which gets 
> > copied to
> > + * the SEV-ES AP Jumptable to park APs while offlining CPUs or booting a 
> > new
> 
> I've seen "Jumptable", "Jump Table" and "jump table" at least. I'd say, do
> the last one everywhere pls.

Fair, sorry for my english being too german :) I changed everything to
'jump table'.

> > +   /* Reset remaining registers */
> > +   movl$0, %esp
> > +   movl$0, %eax
> > +   movl$0, %ebx
> > +   movl$0, %edx
> 
> All 4: use xor

XOR changes EFLAGS, can not use them here.

> > +
> > +   /* Reset CR0 to get out of protected mode */
> > +   movl$0x6010, %ecx
> 
> Another magic naked number.

This is the CR0 reset value. I have updated the comment to make this
more clear.

Thanks,

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 03/12] x86/sev: Save and print negotiated GHCB protocol version

2022-01-26 Thread Joerg Roedel
On Wed, Nov 03, 2021 at 03:27:23PM +0100, Borislav Petkov wrote:
> On Mon, Sep 13, 2021 at 05:55:54PM +0200, Joerg Roedel wrote:
> > From: Joerg Roedel 
> > 
> > Save the results of the GHCB protocol negotiation into a data structure
> > and print information about versions supported and used to the kernel
> > log.
> 
> Which is useful for?

For easier debugging, I added a sentence about that to the changelog.

> > +struct sev_ghcb_protocol_info {
> 
> Too long a name - ghcb_info is perfectly fine.

Changed, thanks.

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 01/12] kexec: Allow architecture code to opt-out at runtime

2021-11-02 Thread Joerg Roedel
Hi again,

On Mon, Nov 01, 2021 at 04:11:42PM -0500, Eric W. Biederman wrote:
> I seem to remember the consensus when this was reviewed that it was
> unnecessary and there is already support for doing something like
> this at a more fine grained level so we don't need a new kexec hook.

Forgot to state to problem again which these patches solve:

Currently a Linux kernel running as an SEV-ES guest has no way to
successfully kexec into a new kernel. The normal SIPI sequence to reset
the non-boot VCPUs does not work in SEV-ES guests and special code is
needed in Linux to safely hand over the VCPUs from one kernel to the
next. What happens currently is that the kexec'ed kernel will just hang.

The code which implements the VCPU hand-over is also included in this
patch-set, but it requires a certain level of Hypervisor support which
is not available everywhere.

To make it clear to the user that kexec will not work in their
environment, it is best to disable the respected syscalls. This is what
the hook is needed for.

Regards,

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 01/12] kexec: Allow architecture code to opt-out at runtime

2021-11-02 Thread Joerg Roedel
On Mon, Nov 01, 2021 at 04:11:42PM -0500, Eric W. Biederman wrote:
> I seem to remember the consensus when this was reviewed that it was
> unnecessary and there is already support for doing something like
> this at a more fine grained level so we don't need a new kexec hook.

It was a discussion, no consenus :)

I still think it is better to solve this in generic code for everybody
to re-use than with an hack in the architecture hooks.

More and more platforms which enable confidential computing features
may need this hook in the future.

Regards,

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2 00/12] x86/sev: KEXEC/KDUMP support for SEV-ES guests

2021-09-13 Thread Joerg Roedel
On Mon, Sep 13, 2021 at 09:02:38AM -0700, Dave Hansen wrote:
> On 9/13/21 8:55 AM, Joerg Roedel wrote:
> > This does not work under SEV-ES, because the hypervisor has no access
> > to the vCPU registers and can't make modifications to them. So an
> > SEV-ES guest needs to reset the vCPU itself and park it using the
> > AP-reset-hold protocol. Upon wakeup the guest needs to jump to
> > real-mode and to the reset-vector configured in the AP-Jump-Table.
> 
> How does this end up looking to an end user that tries to kexec() from a
> an SEV-ES kernel?  Does it just hang?

Yes, the kexec will just hang. This patch-set contains code to disable
the kexec syscalls in situations where it would not work for that
reason.

Actually with the changes to the decompressor in this patch-set the
kexec'ed kernel could boot, but would fail to bring up all the APs.

Regards,

Joerg

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 04/12] x86/sev: Do not hardcode GHCB protocol version

2021-07-31 Thread Joerg Roedel
Hi Tom,

On Wed, Jul 21, 2021 at 04:17:38PM -0500, Tom Lendacky wrote:
> On 7/21/21 9:20 AM, Joerg Roedel wrote:
> > /* Fill in protocol and format specifiers */
> > -   ghcb->protocol_version = GHCB_PROTOCOL_MAX;
> > +   ghcb->protocol_version = sev_get_ghcb_proto_ver();
> 
> So this probably needs better clarification in the spec, but the GHCB
> version field is for the GHCB structure layout. So if you don't plan to
> use the XSS field that was added for version 2 of the layout, then you
> should report the GHCB structure version as 1.

Yeah, this makes sense, exept for the struct field-name ;) But anyway, I
keep the version field at 1 for now.

Regards,

Joerg

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 11/12] x86/sev: Handle CLFLUSH MMIO events

2021-07-31 Thread Joerg Roedel
Hi Sean,

On Fri, Jul 30, 2021 at 10:42:30PM +, Sean Christopherson wrote:
> On Wed, Jul 21, 2021, Joerg Roedel wrote:
> This wording can be misread as "the hypervisor is responsible for _all_ cache
> management".  Maybe just:
> 
>   /*
>* Ignore CLFLUSHes - the hyperivsor is responsible for cache
>* management of emulated MMIO.
>*/

Right, will update the comment, thanks.

> Side topic, out of curisoity, what's mapping/accessing emulated MMIO as 
> non-UC?

The CLFLUSHes happen when the kexec'ed kernel maps the VGA framebuffer
as unencrypted. Initially it is mapped encrypted and before re-mapping
the kernel flushes the range from the caches.

I have not investigated why this doesn't happen on the first boot,
though.

Regards,

Joerg

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 2/2] x86/kexec/64: Forbid kexec when running as an SEV-ES guest

2021-05-06 Thread Joerg Roedel
On Thu, May 06, 2021 at 12:42:03PM -0500, Eric W. Biederman wrote:
> I don't understand this.
> 
> Fundamentally kexec is about doing things more or less inspite of
> what the firmware is doing.
> 
> I don't have any idea what a SEV-ES is.  But the normal x86 boot doesn't
> do anything special.  Is cross cpu IPI emulation buggy?

Under SEV-ES the normal SIPI-based sequence to re-initialize a CPU does
not work anymore. An SEV-ES guest is a special virtual machine where the
hardware encrypts the guest memory and the guest register state. The
hypervisor can't make any modifications to the guests registers at
runtime. Therefore it also can't emulate a SIPI sequence and reset the
vCPU.

The guest kernel has to reset the vCPU itself and hand it over from the
old kernel to the kexec'ed kernel. This isn't currently implemented and
therefore kexec needs to be disabled when running as an SEV-ES guest.

Implementing this also requires an extension to the guest-hypervisor
protocol (the GHCB Spec[1]) which is only available in version 2. So a
guest running on a hypervisor supporting only version 1 will never
properly support kexec.

> What is the actual problem you are trying to avoid?

Currently, if someone tries kexec in an SEV-ES guest, the kexec'ed
kernel will only be able to bring up the boot CPU, not the others. The
others will wake up with the old kernels CPU state in the new kernels
memory and do undefined things, most likely triple-fault because their
page-table is not existent anymore.

So since kexec currently does not work as expected under SEV-ES, it is
better to hide it until everything is implemented so it can do what the
user expects.

> And yes for a temporary hack the suggestion of putting code into
> machine_kexec_prepare seems much more reasonable so we don't have to
> carry special case infrastructure for the forseeable future.

As I said above, for protocol version 1 it will stay disabled, so it is
not only a temporary hack.

Regards,

Joerg

[1] https://developer.amd.com/wp-content/resources/56421.pdf


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] kexec: Allow architecture code to opt-out at runtime

2021-05-06 Thread Joerg Roedel
On Thu, May 06, 2021 at 03:43:23PM +, Sean Christopherson wrote:
> This misses kexec_file_load.

Right, thanks, I will fix that in the next version.

> Also, is a new hook really needed?  E.g. the SEV-ES check be shoved
> into machine_kexec_prepare().  The downside is that we'd do a fair
> amount of work before detecting failure, but that doesn't seem hugely
> problematic.

That could work, but I think its more user-friendly to just claim that
the syscalls are not supported at all.

Regards,

Joerg

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-25 Thread Joerg Roedel
On Fri, Feb 22, 2019 at 02:00:26PM +0100, Borislav Petkov wrote:
> On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote:
> > The current default of 256MB was found by experiments on a bigger
> > number of machines, to create a reasonable default that is at least
> > likely to be sufficient of an average machine.
> 
> Exactly, and this is what makes sense.
> 
> The code should try the requested reservation and if it fails, it should
> try high allocation with default swiotlb size because we need to reserve
> *some* range.

Right, makes sense. While at it, maybe it is time to move the default
allocation policy to 'high' again. The change was reverted six years ago
because it broke old kexec tools, but those are probably out-of-service
now. I think this change would make the whole crashdump allocation
process less fragile.

Regards,

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-22 Thread Joerg Roedel
On Fri, Feb 22, 2019 at 10:11:01AM +0800, Dave Young wrote:
> In case people have a lot of devices need more swiotlb, then he manually
> set the ,high with ,low together.

The option to specify the high and low values for the crashkernel are
important for certain machines. The point is that swiotlb already
allocates 64MB of low memory by default. But that memory is only used
for 32bit DMA-mask devices that want to DMA into high memory. There are
drivers just allocating GFP_DMA32 memory, which also ends up in the low
region (but not swiotlb), that is why the previous default of 72MB low
memory was not enough, it only left 8MB of GFP_DMA32 memory. The current
default of 256MB was found by experiments on a bigger number of
machines, to create a reasonable default that is at least likely to be
sufficient of an average machine.

There is no way today for the kernel to find an optimum value for the
amount of low memory required to successfully create a crash dump. It
depends on the amount of devices in the system and how the drivers
for them are written. The drivers have no way to report back their
requirements, and even if they had, at the time the allocation happens
no driver is loaded yet.

So it is up to the system administrator to find workable values for the
high and low memory requirements, even using experiments as a last
resort.

Regards,

Joerg

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 00/17] Fixes and Cleanups for the Intel VT-d driver

2015-06-08 Thread Joerg Roedel
On Mon, Jun 08, 2015 at 04:06:45PM +0800, Li, ZhenHua wrote:
 Finished testing on my HP huge system, SuperDome X.
 It works well.

Thanks for testing this, Zhen-Hua. Might I add your Tested-by to the
patches?


Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 01/10] iommu/vt-d: Update iommu_attach_domain() and its callers

2015-01-12 Thread Joerg Roedel
On Mon, Jan 12, 2015 at 03:06:19PM +0800, Li, Zhen-Hua wrote:
 Allow specification of the domain-id for the new domain.
 This patch only adds the 'did' parameter to iommu_attach_domain()
 and modifies all of its callers to specify the default value of -1
 which says no did specified, allocate a new one.

I think its better to keep the old iommu_attach_domain() interface in
place and introduce a new function (like iommu_attach_domain_with_id()
or something) which has the additional parameter. Then you can rewrite
iommu_attach_domain():

iommu_attach_domai(...)
{
return iommu_attach_domain_with_id(..., -1);
}

This way you don't have to update all the callers of
iommu_attach_domain() and the interface is more readable.


Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 02/10] iommu/vt-d: Items required for kdump

2015-01-12 Thread Joerg Roedel
On Mon, Jan 12, 2015 at 11:15:38AM -0500, Vivek Goyal wrote:
 On Mon, Jan 12, 2015 at 05:06:46PM +0100, Joerg Roedel wrote:
  On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote:
   Kdump has the notion of backup region. Where certain parts of old kernels
   memory can be moved to a different location (first 640K on x86 as of now)
   and new kernel can make use of this memory now.
   
   So we will have to just make sure that no parts of this old page table
   fall into backup region.
  
  Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough
  yet ;)
  Sadly, your above statement is true for all hardware-accessible data
  structures in IOMMU code. I think about how we can solve this, is there
  an easy way to allocate memory that is not in any backup region?
 
 Hmm..., there does not seem to be any easy way to do this. In fact, as of
 now, kernel does not even know where is backup region. All these details are
 managed by user space completely (except for new kexec_file_load() syscall).
 
 That means we are left with ugly options now.
 
 - Define per arch kexec backup regions in kernel and export it to user
   space and let kexec-tools make use of that deinition (instead of
   defining its own). That way memory allocation code in kernel can look
   at this backup area and skip it for certain allocations.

Yes, that makes sense. In fact, I think all allocations for DMA memory
need to take this into account to avoid potentially serious data
corruption.
If any memory for a disk superblock gets allocated in backup memory and
a kdump happens, the new kernel might zero out that area and the disk
controler then writes the zeroes to disk instead of the superblock.


Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 02/10] iommu/vt-d: Items required for kdump

2015-01-12 Thread Joerg Roedel
On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote:
 +
 +#ifdef CONFIG_CRASH_DUMP
 +
 +/*
 + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
 + *
 + * Fixes the crashdump kernel to deal with an active iommu and legacy
 + * DMA from the (old) panicked kernel in a manner similar to how legacy
 + * DMA is handled when no hardware iommu was in use by the old kernel --
 + * allow the legacy DMA to continue into its current buffers.
 + *
 + * In the crashdump kernel, this code:
 + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA).
 + * 2. Do not re-enable IOMMU's translating.
 + * 3. In kdump kernel, use the old root entry table.
 + * 4. Leaves the current translations in-place so that legacy DMA will
 + *continue to use its current buffers.
 + * 5. Allocates to the device drivers in the crashdump kernel
 + *portions of the iova address ranges that are different
 + *from the iova address ranges that were being used by the old kernel
 + *at the time of the panic.
 + *
 + */

It looks like you are still copying the io-page-tables from the old
kernel into the kdump kernel, is that right? With the approach that was
proposed you only need to copy over the context entries 1-1. They are
still pointing to the page-tables in the old kernels memory (which is
just fine).

The root-entry of the old kernel is also re-used, and when the kdump
kernel starts to use a device, its context entry is updated to point to
a newly allocated page-table.


Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v8 02/10] iommu/vt-d: Items required for kdump

2015-01-12 Thread Joerg Roedel
On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote:
 Kdump has the notion of backup region. Where certain parts of old kernels
 memory can be moved to a different location (first 640K on x86 as of now)
 and new kernel can make use of this memory now.
 
 So we will have to just make sure that no parts of this old page table
 fall into backup region.

Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough
yet ;)
Sadly, your above statement is true for all hardware-accessible data
structures in IOMMU code. I think about how we can solve this, is there
an easy way to allocate memory that is not in any backup region?

Thanks,

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-12-12 Thread Joerg Roedel
On Fri, Dec 12, 2014 at 10:25:31AM +0800, Li, ZhenHua wrote:
 Sorry I have no plan yet.
 Could you send me your logs on your AMD system?

 On 12/10/2014 04:46 PM, Baoquan He wrote:
 This issue happens on AMD iommu too, do you have any plans or
 thoughts on that?

I think the best approach for now is to get a prove-of-concept on the
VT-d driver. If it works there the way we expect, we can implement the
same handling in the AMD driver. But I see no reason to hold back the
VT-d patches until it is also fixed for AMD systems.


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-12-01 Thread Joerg Roedel
On Mon, Dec 01, 2014 at 02:31:38PM +0800, Li, ZhenHua wrote:
 After I implement these two steps, there comes a new fault:
 
 [1.594890] dmar: DRHD: handling fault status reg 2
 [1.594894] dmar: INTR-REMAP: Request device [[41:00.0] fault index 4d
 [1.594894] INTR-REMAP:[fault reason 34] Present field in the IRTE entry
 is clear
 
 It is caused by similar reason, so I will fix it like fixing the DMAR
 faults: Do NOT disable and re-enable the interrupt remapping, try to
 use data from old kernel.

Yes, that sounds right, thanks. Looking forward to your patches.


Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-11-17 Thread Joerg Roedel
On Fri, Nov 14, 2014 at 02:27:44PM +0800, Li, ZhenHua wrote:
 I am working following  your directions:
 
 1.  If the VT-d driver finds the IOMMU enabled, it reuses its root entry
 table, and do NOT disable-enable iommu. Other data will be copied.
 
 2. When a device driver issues the first dma_map command for a
 device, we assign a new and empty page-table, thus removing all
 mappings from the old kernel for the device.
 
 Please let me know if I get something wrong.

Yes, this sounds right. Happily waiting for patches :)


Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-10-22 Thread Joerg Roedel
Hi Bjorn,

On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
 I was looking at Zhen-Hua's recent patches, trying to figure out if I
 need to do anything with them.  Resetting devices in the old kernel
 seems like a non-starter.  Resetting devices in the new kernel, ...,
 well, maybe.  It seems ugly, and it seems like the sort of problem
 that IOMMUs are designed to solve.

Actually resetting the devices in the kdump kernel would be one of the
better solutions for this problem. When we have a generic way to stop
all in-flight DMA from the PCI endpoints we could safely disable and
then re-enable the IOMMU.

 On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel j...@8bytes.org wrote:
  That is a solution to prevent the in-flight DMA failures. But what
  happens when there is some in-flight DMA to a disk to write some inodes
  or a new superblock. Then this scratch address-space may cause
  filesystem corruption at worst.
 
 This in-flight DMA is from a device programmed by the old kernel, and
 it would be reading data from the old kernel's buffers.  I think
 you're suggesting that we might want that DMA read to complete so the
 device can update filesystem metadata?

Well, it is not about updating filesystem metadata. In the kdump kernel
we have these options:

1) Disable the IOMMU. Problem here is, that DMA is now
   untranslated, so that any in-flight DMA might read or write
   from a random location in memory, corrupting the kdump or
   even the new kexec kernel memory. So this is a non-starter.

2) Re-program the IOMMU to block all DMA. This is safer as it
   does not corrupt any data in memory. But some devices react
   very poorly on a master abort from the IOMMU, so bad that the
   driver in the kdump kernel fails to initialize that device.
   In this case taking dump itself might fail (and often does,
   according to reports)

3) To prevent master aborts like in option (2), David suggested
   to map the whole DMA address space to a scratch page. This
   also prevents system memory corruption and the master aborts.
   But the problem is, that in-flight DMA will now read all
   zeros. This can corrupt disk and network data, at worst it
   nulls out the superblocks of your filesystem and you lose all
   data. So this is not an option too.

What we currently do in the VT-d driver is a mixture of (1) and (2). The
VT-d driver disables the IOMMU hardware (opening a race window for
memory data corruption), re-initializes it to reject any ongoing DMA
(which causes master aborts for in-flight DMA) and re-enables the IOMMU
hardware.

But this strategy fails in heavy IO environments quite often and we look
into ways to solve the problem, or at least improve the current
situation as good as we can.

I talked to David about this at LPC and we came up with basically this
procedure:

1. If the VT-d driver finds the IOMMU enabled, it reuses its
   root-context table. This way the IOMMU must not be disabled
   and re-enabled, eliminating the race we have now.
   Other data structures like the context-entries are copied
   over from the old kernel.  We basically keep all mappings
   from the old kernel, allowing any in-flight DMA to succeed.
   No memory or disk data corruption.
   The issue here is, that we modify data from the old kernel
   which is about to be dumped. But there are ways to work
   around that.

2. When a device driver issues the first dma_map command for a
   device, we assign a new and empty page-table, thus removing
   all mappings from the old kernel for the device.
   Rationale is, that at this point the device driver should
   have reset the device to a point where all in-flight DMA is
   canceled.

This approach goes into the same direction as Bill Sumners patch-set,
which Li took over. But it goes not as far as keeping the old mappings
while the kdump kernel is still working with the devices (which might
introduce new issues and corner cases).

  So with this in mind I would prefer initially taking over the
  page-tables from the old kernel before the device drivers re-initialize
  the devices.
 
 This makes the dump kernel more dependent on data from the old kernel,
 which we obviously want to avoid when possible.

Sure, but this is not really possible here (unless we have a generic and
reliable way to reset all PCI endpoint devices and cancel all in-flight
DMA before we disable the IOMMU in the kdump kernel).
Otherwise we always risk data corruption somewhere, in system memory or
on disk.

 I didn't find the previous discussion where pointing every virtual bus
 address at the same physical scratch page was proposed.  Why was that
 better than programming the IOMMU to reject every DMA?

As I said, the problem

Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

2014-07-02 Thread Joerg Roedel
Hi David,

On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
 There could be all kinds of existing mappings in the DMA page tables,
 and I'm not sure it's safe to preserve them. What prevents the crashdump
 kernel from trying to use any of the physical pages which are
 accessible, and which could thus be corrupted by stray DMA?
 
 In fact, the old kernel could even have set up 1:1 passthrough mappings
 for some devices, which would then be able to DMA *anywhere*. Surely we
 need to prevent that?

Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.

 After the last round of this patchset, we discussed a potential
 improvement where you point every virtual bus address at the *same*
 physical scratch page.

That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.

So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv3 3/6] Crashdump-Accepting-Active-IOMMU-Domain-Interfaces

2014-03-04 Thread Joerg Roedel
On Fri, Jan 10, 2014 at 03:07:29PM -0700, Bill Sumner wrote:
 +context_get_entry(struct context_entry *context_addr,
 + struct intel_iommu *iommu, u32 bus, int devfn)
 +{
 + unsigned long long q;   /* quadword scratch */
 + struct root_entry *root_phys;   /* Phys adr (root table entry) */
 + struct root_entry  root_temp;   /* Local copy of root_entry */
 + struct context_entry *context_phys; /* Phys adr */
 +
 + if (pr_dbg.domain_get)
 + pr_debug(ENTER %s B:D:F=%2.2x:%2.2x:%1.1x 
 context_entry:0x%llx intel_iommu:0x%llx\n,
 + __func__, bus, devfn3, devfn7,
 + (u64)context_addr, (u64)iommu);
 +
 + if (bus  255)  /* Sanity check */
 + return -5;
 + if (devfn  255 || devfn  0)   /* Sanity check */
 + return -6;
 +
 + q = readq(iommu-reg + DMAR_RTADDR_REG);
 + pr_debug(IOMMU %d: DMAR_RTADDR_REG:0x%16.16llx\n, iommu-seq_id, q);
 + if (!q)
 + return -1;
 +
 + root_phys = (struct root_entry *) q;/* Adr(base of vector) */
 + root_phys += bus;   /* Adr(entry we want) */
 +
 + oldcopy(root_temp, root_phys, sizeof(root_temp));
 +
 + pr_debug(root_temp.val:0x%llx .rsvd1:0x%llx root_phys:0x%llx\n,
 + root_temp.val, root_temp.rsvd1, (u64)root_phys);
 +
 + if (!root_present(root_temp))
 + return -2;
 +
 + pr_debug(B:D:F=%2.2x:%2.2x:%1.1x root_temp.val: %llx .rsvd1: %llx\n,
 + bus, devfn  3, devfn  7, root_temp.val, root_temp.rsvd1);
 +
 + if (root_temp.rsvd1)/* If (root_entry is bad) */
 + return -3;
 +
 + context_phys = get_context_phys_from_root(root_temp);
 + if (!context_phys)
 + return -4;

What do all these magic return values mean? 



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU

2014-03-04 Thread Joerg Roedel
Hi Bill,

On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote:
 Bill Sumner (6):
   Crashdump-Accepting-Active-IOMMU-Flags-and-Prototype
   Crashdump-Accepting-Active-IOMMU-Utility-functions
   Crashdump-Accepting-Active-IOMMU-Domain-Interfaces
   Crashdump-Accepting-Active-IOMMU-Copy-Translations
   Crashdump-Accepting-Active-IOMMU-Debug-Print-IOMMU
   Crashdump-Accepting-Active-IOMMU-Call-From-Mainline
 
  drivers/iommu/intel-iommu.c | 1293 
 ---
  1 file changed, 1225 insertions(+), 68 deletions(-)

This is a pretty big change, before merging I would like to get more
eyes on it. Please make sure you get more reviews on the code and the
general concept. A few things I noticed while looking at it:

* You put a lot of code into #define CONFIG_CRASH_DUMP. Isn't the same
  approach necessary to for KEXEC?

* Please get rid of all the pr_debug stuff.

* I think it makes sense to put the functions to access the IOMMU
  configuration values of the old kernel into a new file. This is better
  than adding over 1000 new lines to intel-iommu.c

* I have seen your new single-patch for this change. A single patch with
  more than 1200 changed lines is not something anyone is willing to
  review. Please split the patches again in a meaningful and bisectable
  way.


Thanks,

Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] intel-iommu: Quiesce devices before disabling IOMMU

2013-09-24 Thread Joerg Roedel
On Wed, Sep 18, 2013 at 03:09:01PM +0900, Takao Indoh wrote:
 + /*
 +  *  In the case of kdump, ioremap is needed because root-entry table
 +  *  exists in first kernel's memory area which is not mapped in second
 +  *  kernel
 +  */
 + root = (struct root_entry *)ioremap(addr, PAGE_SIZE);
 + if (!root)
 + return;
 +
 + for (bus = 0; bus  ROOT_ENTRY_NR; bus++) {
 + if (!root_present(root[bus]))
 + continue;
 +
 + context = (struct context_entry *)ioremap(
 + root[bus].val  VTD_PAGE_MASK, PAGE_SIZE);
 + if (!context)
 + continue;
 +
 + for (devfn = 0; devfn  CONTEXT_ENTRY_NR; devfn++) {
 + if (!context_present(context[devfn]))
 + continue;
 +
 + dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
 + if (!dev)
 + continue;
 +
 + if (!pci_reset_bus(dev-bus)) /* go to next bus */
 + break;
 + else /* Try per-function reset */
 + pci_reset_function(dev);
 +
 + }
 + iounmap(context);
 + }
 + iounmap(root);

I am not convinced that this is the right approach. If a device wasn't
translated by VT-d in the old kernel doesn't mean it will not be
translated in the new kernel. How about unconditionally resetting all
PCI busses and/or functions here before IOMMU initialization proceeds?


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register

2013-04-15 Thread Joerg Roedel
On Mon, Apr 15, 2013 at 06:00:13PM +0900, Takao Indoh wrote:
 On DMAR initialization during kdump boot, do you guys agree to change
 order like this?
 
 Current order:
 (1) Disable translation
 (2) PCI initialization
 (3) Make pgtable and enable translation.
 
 Order I'm proposing:
 (1) PCI initialization
 (2) Disable translation
 (3) Make pgtable and enable translation.

I think this should work. In fact, translation only needs to be disabled
while the IOMMU hardware is reprogrammed to the new data structures the
kdump kernel set up.

 As Joerg said, if we need to consider the case that kdump kernel is
 compiled without dma-remapping(CONFIG_INTEL_IOMMU is off?), I can update
 my patch to handle such a case properly by adding ifdef
 CONFIG_INTEL_IOMMU.

Thinking about it, this case is a bit more special. If the kdump kernel
has no IOMMU driver at all there is also no code to disable it. So
unless we want to force the kdump kernel to have the driver when the
first kernel had it, I think we also want some quirk (depending on
!CONFIG_INTEL_IOMMU) that disables translation at boot.

I know, its complicated :)


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register

2013-04-05 Thread Joerg Roedel
On Wed, Apr 03, 2013 at 09:24:39AM +0100, David Woodhouse wrote:
 On Wed, 2013-04-03 at 16:11 +0900, Takao Indoh wrote:
  Yeah, you are right. I forgot such a case.
 
 If you disable translation and there's some device still doing DMA, it's
 going to scribble over random areas of memory. You really want to have
 translation enabled and all the page tables *cleared*, during kexec. I
 think it's fair to insist that the secondary kernel should use the IOMMU
 if the first one did.

Do we really need to insist on that? The IOMMU initialization on x86
happens after the kernel scanned and enumerated the PCI bus. While doing
this the kernel (at least it should) disables all devices it finds. So
when the IOMMU init code runs we should be safe from any in-flight DMA
and can either disable translation or re-initialize it for the kdump
kernel. Until then translation needs to stay enabled of course, so that
the old page-tables are still used and in-flight DMA doesn't corrupt
any data.


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register

2013-04-02 Thread Joerg Roedel
On Mon, Apr 01, 2013 at 02:45:18PM +0900, Takao Indoh wrote:
 Current flow on kdump boot
 enable_IR
   intel_enable_irq_remapping
 iommu_disable_irq_remapping  == IRES/QIES/TES disabled here
 dmar_disable_qi  == do nothing
 dmar_enable_qi   == QIES enabled
 intel_setup_irq_remapping== IRES enabled

But what we want to do here in the kdumo case is to disable translation
too, right? Because the former kernel might have translation and
irq-remapping enabled and the kdump kernel might be compiled without
support for dma-remapping. So if we don't disable translation here too
the kdump kernel is unable to do DMA.

I agree that disabling translation should be a bit more explicit instead
of the current code.


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register

2013-03-27 Thread Joerg Roedel
On Wed, Mar 27, 2013 at 02:02:44PM +0900, Takao Indoh wrote:
 The root cause of this problem is mismatch between iommu-gcmd and
 global command register in the case of kdump. At boot time, initial
 value of iommu-gcmd is zero as I wrote above, but actual global command
 register is *not* zero because some bits like IRE/TE/QIE are already set
 in *first* kernel. Therefore this patch synchronize them to fix this
 problem.

Ok, I understand, but I still don't see why this is a problem. There is
no point for the kdump kernel to preserve hardware state from the
previous kernel. So I think the way it is implemented is correct.


Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register

2013-03-26 Thread Joerg Roedel
On Thu, Mar 21, 2013 at 10:32:36AM +0900, Takao Indoh wrote:
 In this function, clearing IRE bit in iommu-gcmd and writing it to
 global command register. But initial value of iommu-gcmd is zero, so
 this writel means clearing all bits in global command register.

Seems weird. Why is the value of gcmd zero in your case? The usage of
this register is well encapsulated by the different parts of the VT-d
driver. There are other places which enable/disable translation and qpi
the same way it is done with interrupt remapping. So it looks to me that
it is unlikely that gcmd is really zero in your case.

Can you explain that more and also describe what the actual misbehavior
is you are trying to fix here?

Thanks,

Joerg



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread Joerg Roedel
Hi,

On Wed, Apr 11, 2012 at 09:39:43AM +0800, zhangyanfei wrote:
 The problem is that VMCS internal is hidden by Intel in its
 specification. So, we reverse engineering it in the way implemented in
 this patch set.

Have you made sure this layout is the same on all uarchitectures that
implment VMX?


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] AMD IOMMU kdump fix plus cleanups (v2)

2010-04-07 Thread Joerg Roedel
On Fri, Apr 02, 2010 at 06:27:51PM -0700, Chris Wright wrote:
 Series includes the following patches:
 
   x86/amd-iommu: enable iommu before attaching devices
   x86/amd-iommu: warn when issuing command to uninitialized cmd buffer
   Revert x86: disable IOMMUs on kernel crash
   x86/amd-iommu: use for_each_pci_dev
 
 First one is the primary bug fix.
 
 v2
 - add disable_iommus on failure path
 - move removal of CMD_BUFFER_UNINITIALIZED to iommu_enable_command_buffer()
 - fix ;; typo in patch 2
 - add Revert x86: disable IOMMUs on kernel crash
 - add x86/amd-iommu: use for_each_pci_dev

Applied all, thanks.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash

2010-04-04 Thread Joerg Roedel
On Sat, Apr 03, 2010 at 10:44:22AM -0700, Eric W. Biederman wrote:
 Joerg Roedel j...@8bytes.org writes:

  Hmm, I think for this we need to change the gart code too and disable
  the gart before its initialization runs to not re-introduce issues fixed
  in commit bc2cea6a34fdb30f118ec75db39a46a191870607, no?
 
 That is a different code path with a different set of assumptions and
 restrictions.  On a normal kexec of course we want to do an orderly shutdown.

Thats another problem with this patch. It introduces a difference
between the panic-shutdown kexec and the ordinary kexec.

 For the gart with a little luck we can just ignore it on kexec on
 panic.

The commit I mentioned above already proves this assumption wrong.

 Unlike a virtualization capable iommu it doesn't prevent access
 to devices, when it is enabled.   Worst case is that we have to start
 including iommu=off for gart systems.

No no no. This is a maintenance nightmare for almost everybody. Where do
you want to Document this special cases that 'if kernel uses gart then
and only then boot the kexec kernel with iommu=off'.
Always passing iommu=off to the kexec kernel doesn't work too for
obvious reasons.

 The best case is that we can figure out how to have the gart code
 reinitialize itself sanely, starting from some arbitrary point.

Yes, that is missing in this patch. But to keep changes small and don't
bother with the gart code at all I suggest to remove the shutdown
routine from the amd-iommu code only and not the whole shutdown call in
the machine_crash_shutdown path.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash

2010-04-04 Thread Joerg Roedel
On Sun, Apr 04, 2010 at 09:24:30AM +0200, Bernhard Walle wrote:
 Am 03.04.10 19:49, schrieb Eric W. Biederman:
  Not a problem.  We require a lot of things of the kdump kernel,
  and it is immediately apparent in a basic sanity test.
 
 Also, in most cases (for example: distribution kernels), the kdump
 kernel nowadays is identical to the running kernel. So, if the running
 kernel has IOMMU support, the kdump kernel also has.

Yes, I know. But is that a requirement for kexec? If not we still
potentially have this problem. It is a smaller problem than
data-corruption caused by in-flight dma because most
people^Wdistributions indeed use the same kernel for normal boot and
kexec, thats true.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash

2010-04-04 Thread Joerg Roedel
On Sun, Apr 04, 2010 at 02:44:36AM -0700, Eric W. Biederman wrote:
 Joerg Roedel j...@8bytes.org writes:
 
  On Sun, Apr 04, 2010 at 09:24:30AM +0200, Bernhard Walle wrote:
  Am 03.04.10 19:49, schrieb Eric W. Biederman:
   Not a problem.  We require a lot of things of the kdump kernel,
   and it is immediately apparent in a basic sanity test.
  
  Also, in most cases (for example: distribution kernels), the kdump
  kernel nowadays is identical to the running kernel. So, if the running
  kernel has IOMMU support, the kdump kernel also has.
 
  Yes, I know. But is that a requirement for kexec?
 
 For normal kexec no.  That path is expected to do a clean hardware
 shutdown.
 
 For kexec on panic aka kdump the requirement is that your your crash
 kernel be able to initialize your hardware from any state it can be
 put in.

Ok, if you show me where this is documented for everybody then I am
probably convinced :-)
We should fixup the gart initialization anyway.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash

2010-04-03 Thread Joerg Roedel
On Fri, Apr 02, 2010 at 06:27:54PM -0700, Chris Wright wrote:
 This effectively reverts commit 61d047be99757fd9b0af900d7abce9a13a337488.
 
 Disabling the IOMMU can potetially allow DMA transactions to
 complete without being translated.  Leave it enabled, and allow
 crash kernel to do the IOMMU reinitialization properly.
 
 Cc: Joerg Roedel joerg.roe...@amd.com
 Cc: Eric Biederman ebied...@xmission.com
 Cc: Neil Horman nhor...@tuxdriver.com
 Cc: Vivek Goyal vgo...@redhat.com
 Signed-off-by: Chris Wright chr...@sous-sol.org
 ---
  arch/x86/kernel/crash.c |6 --
  1 file changed, 6 deletions(-)
 
 --- a/arch/x86/kernel/crash.c
 +++ b/arch/x86/kernel/crash.c
 @@ -27,7 +27,6 @@
  #include asm/cpu.h
  #include asm/reboot.h
  #include asm/virtext.h
 -#include asm/x86_init.h
  
  #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
  
 @@ -103,10 +102,5 @@ void native_machine_crash_shutdown(struc
  #ifdef CONFIG_HPET_TIMER
   hpet_disable();
  #endif
 -
 -#ifdef CONFIG_X86_64
 - x86_platform.iommu_shutdown();
 -#endif
 -
   crash_save_cpu(regs, safe_smp_processor_id());

Hmm, I think for this we need to change the gart code too and disable
the gart before its initialization runs to not re-introduce issues fixed
in commit bc2cea6a34fdb30f118ec75db39a46a191870607, no?

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] x86/amd-iommu: enable iommu before attaching devices

2010-04-03 Thread Joerg Roedel
On Fri, Apr 02, 2010 at 11:59:32AM -0400, Vivek Goyal wrote:
 1. kernel crashes, we leave IOMMU enabled.

True for everything except gart and amd iommu.

   a. So during this small window when iommu is disabled and we enable
  it back, any inflight DMA will passthrough possibly to an
  unintended physical address as translation is disabled and it
  can corrupt the kdump kenrel.

Right.

   b. Even after enabling the iommu, I guess we will continue to
  use cached DTE, and translation information to handle any
  in-flight DMA. The difference is that now iommus are enabled
  so any in-flight DMA should go to the address as intended in
  first kenrel and should not corrupt anything.

Right.

 
 3. Once iommus are enabled again, we allocated and initilize protection
domains. We attach devices to domains. In the process we flush the
DTE, PDE and IO TLBs.
 
   c. Looks like do_attach-set_dte_entry(), by default gives write
  permission (IW) to all the devices. I am assuming that at
  this point of time translation is enabled and possibly unity
  mapped.

No, The IW bit in the DTE must be set because all write permission bits
(DTE and page tabled) are ANDed to determine if a device can write to a
particular address. So as long as the paging mode is unequal to zero the
hardware will walk the page-table first to find out if the device has
write permission. With paging mode == 0 your statement about read-write
unity-mapping is true. This is used for a pass-through domain (iommu=pt)
btw.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash

2010-04-03 Thread Joerg Roedel
On Fri, Apr 02, 2010 at 06:27:54PM -0700, Chris Wright wrote:
 This effectively reverts commit 61d047be99757fd9b0af900d7abce9a13a337488.
 
 Disabling the IOMMU can potetially allow DMA transactions to
 complete without being translated.  Leave it enabled, and allow
 crash kernel to do the IOMMU reinitialization properly.
 
 Cc: Joerg Roedel joerg.roe...@amd.com
 Cc: Eric Biederman ebied...@xmission.com
 Cc: Neil Horman nhor...@tuxdriver.com
 Cc: Vivek Goyal vgo...@redhat.com
 Signed-off-by: Chris Wright chr...@sous-sol.org
 ---
  arch/x86/kernel/crash.c |6 --
  1 file changed, 6 deletions(-)
 
 --- a/arch/x86/kernel/crash.c
 +++ b/arch/x86/kernel/crash.c
 @@ -27,7 +27,6 @@
  #include asm/cpu.h
  #include asm/reboot.h
  #include asm/virtext.h
 -#include asm/x86_init.h
  
  #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
  
 @@ -103,10 +102,5 @@ void native_machine_crash_shutdown(struc
  #ifdef CONFIG_HPET_TIMER
   hpet_disable();
  #endif
 -
 -#ifdef CONFIG_X86_64
 - x86_platform.iommu_shutdown();
 -#endif
 -
   crash_save_cpu(regs, safe_smp_processor_id());

Another problem: This also breaks if the kdump kernel has no
iommu-support.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 1/2] x86/amd-iommu: enable iommu before attaching devices

2010-04-02 Thread Joerg Roedel
On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote:
 Hit another kdump problem as reported by Neil Horman.  When initializaing
 the IOMMU, we attach devices to their domains before the IOMMU is
 fully (re)initialized.  Attaching a device will issue some important
 invalidations.  In the context of the newly kexec'd kdump kernel, the
 IOMMU may have stale cached data from the original kernel.  Because we
 do the attach too early, the invalidation commands are placed in the new
 command buffer before the IOMMU is updated w/ that buffer.  This leaves
 the stale entries in the kdump context and can renders device unusable.
 Simply enable the IOMMU before we do the attach.
 
 Cc: Neil Horman nhor...@tuxdriver.com
 Cc: Vivek Goyal vgo...@redhat.com
 Signed-off-by: Chris Wright chr...@sous-sol.org
 ---
  arch/x86/kernel/amd_iommu_init.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 --- a/arch/x86/kernel/amd_iommu_init.c
 +++ b/arch/x86/kernel/amd_iommu_init.c
 @@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void)
   if (ret)
   goto free;
  
 + enable_iommus();
 +
   if (iommu_pass_through)
   ret = amd_iommu_init_passthrough();
   else
 @@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void)
  
   amd_iommu_init_notifier();
  
 - enable_iommus();
 -
   if (iommu_pass_through)
   goto out;

Ok, good to know this fixes the problem. One issue: If the
initialization of the domains fails the iommu hardware needs to be
disabled again in the free path.

Thanks,

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] amd iommu: force flush of iommu prior during shutdown

2010-04-01 Thread Joerg Roedel
Hi Neil,

first some general words about the problem you discovered: The problem
is not caused by in-flight DMA. The problem is that the IOMMU hardware
has cached the old DTE entry for the device (including the old
page-table root pointer) and is using it still when the kdump kernel has
booted. We had this problem once and fixed it by flushing a DTE in the
IOMMU before it is used for the first time. This seems to be broken
now. Which kernel have you seen this on?

I am back in office next tuesday and will look into this problem too.

On Wed, Mar 31, 2010 at 04:27:45PM -0400, Neil Horman wrote:
 So I'm officially rescinding this patch.

Yeah, the right solution to this problem is to find out why every DTE is
not longer flushed before first use.

 It apparently just covered up the problem, rather than solved it
 outright.  This is going to take some more thought on my part.  I read
 the code a bit closer, and the amd iommu on boot up currently marks
 all its entries as valid and having a valid translation (because if
 they're marked as invalid they're passed through untranslated which
 strikes me as dangerous, since it means a dma address treated as a bus
 address could lead to memory corruption.  The saving grace is that
 they are marked as non-readable and non-writeable, so any device doing
 a dma after the reinit should get logged (which it does), and then
 target aborted (which should effectively squash the translation)

Right. The default for all devices is to forbid DMA.

 I'm starting to wonder if:
 
 1) some dmas are so long lived they start aliasing new dmas that get mapped in
 the kdump kernel leading to various erroneous behavior

At least not in this case. Even when this is true the DMA would target
memory of the crashed kernel and not the kdump area. This is not even
memory corruption because the device will write to memory the driver has
allocated for it.

 2) a slew of target aborts to some hardware results in them being in an
 inconsistent state

Thats indeed true. I have seen that with ixgbe cards for example. They
seem to be really confused after an target abort.

 I'm going to try marking the dev table on shutdown such that all devices have 
 no
 read/write permissions to see if that changes the situation.  It should I 
 think
 give me a pointer as to weather (1) or (2) is the more likely problem.

Probably not. You still need to flush the old entries out of the IOMMU.

Thanks,

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] amd iommu: force flush of iommu prior during shutdown

2010-04-01 Thread Joerg Roedel
On Thu, Apr 01, 2010 at 10:47:36AM -0400, Neil Horman wrote:
 On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote:
  I am back in office next tuesday and will look into this problem too.
  
 Thank you.

Just took a look and I think the problem is that the devices are
attached to domains before the IOMMU hardware is enabled. This happens
in the function prealloc_protection_domains(). The attach code issues
the dte-invalidate commands but they are not executed because the
hardware is off. I will verify this when I have access to hardware
again.
The possible fix will be to enable the hardware earlier in the
initialization path.

  Right. The default for all devices is to forbid DMA.
  
 Thanks, glad to know I read that right, took me a bit to understand it :)

I should probably add a comment :-)

  Thats indeed true. I have seen that with ixgbe cards for example. They
  seem to be really confused after an target abort.
  
 Yeah, this part worries me, target aborts lead to various brain dead hardware
 pieces.  What are you thoughts on leaving the iommu on through a reboot to 
 avoid
 this issue (possibly resetting any pci device that encounters a target abort, 
 as
 noted in the error log on the iommu?

This would only prevent possible data corruption. When the IOMMU is off
the devices will not get a target abort but will only write to different
physical memory locations. The window where a target abort can happen
starts when the kdump kernel re-enables the IOMMU and ends when the new
driver for that device attaches. This is a small window but there is not
a lot we can do to avoid this small time window.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] amd iommu: force flush of iommu prior during shutdown

2010-04-01 Thread Joerg Roedel
On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote:
 On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote:

  The possible fix will be to enable the hardware earlier in the
  initialization path.
  
 That sounds like a reasonable theory, I'll try hack something together
 shortly.

Great. So the problem might be already fixed when I am back in the
office ;-)

  This would only prevent possible data corruption. When the IOMMU is off
  the devices will not get a target abort but will only write to different
  physical memory locations. The window where a target abort can happen
  starts when the kdump kernel re-enables the IOMMU and ends when the new
  driver for that device attaches. This is a small window but there is not
  a lot we can do to avoid this small time window.
  
 Can you explain this a bit further please?  From what I read, when the iommu 
 is
 disabled, AIUI it does no translations.  That means that any dma addresses 
 which
 the driver mapped via the iommu prior to a crash that are stored in devices 
 will
 just get strobed on the bus without any translation.  If those dma address do
 not lay on top of any physical ram, won't that lead to bus errors, and
 transaction aborts?  Worse, if those dma addresses do lie on top of real
 physical addresses, won't we get corruption in various places?  Or am I 
 missing
 part of how that works?

Hm, the device address may not be a valid host physical address, thats
true. But the problem with the small time-window when the IOMMU hardware
is re-programmed from the kdump kernel still exists.
I need to think about other possible side-effects of leaving the IOMMU
enabled on shutdown^Wboot into a kdump kernel.

Joerg


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec