Re: [PATCH v2] x86/kexec: Add EFI config table identity mapping for kexec kernel
On Fri, Jul 07, 2023 at 12:23:59PM +0800, Baoquan He wrote: > I am wondering why we don't detect the cpu type and return early inside > sev_enable() if it's Intel cpu. > > We can't rely on CONFIG_AMD_MEM_ENCRYPT to decide if the code need be > executed or not because we usually enable them all in distros. Looking at the code in head_64.S, by the time sev_enable() runs the SEV bit should already be set in sev_status. Maybe use that to detect whether SEV is enabled and bail out early? Regards, -- Jörg Rödel jroe...@suse.de SUSE Software Solutions Germany GmbH Frankenstraße 146 90461 Nürnberg Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 08/12] x86/sev: Park APs on AP Jump Table with GHCB protocol version 2
On Fri, Nov 12, 2021 at 05:33:05PM +0100, Borislav Petkov wrote: > On Mon, Sep 13, 2021 at 05:55:59PM +0200, Joerg Roedel wrote: > > +"ljmpl *%0" : : > > +"m" (real_mode_header->sev_real_ap_park_asm), > > +"b" (sev_es_jump_table_pa >> 4)); > > In any case, this asm needs comments: why those regs, why > sev_es_jump_table_pa >> 4 in rbx (I found later in the patch why) and so > on. Turned out the jump_table_pa is not used in asm code anymore. It was a left-over from a previous version of the patch, it is removed now. > > +SYM_INNER_LABEL(sev_ap_park_paging_off, SYM_L_GLOBAL) > > Global symbol but used only in this file. .L-prefix then? It needs to be a global symbol so the pa_ variant can be generated. Regards, -- Jörg Rödel jroe...@suse.de SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nürnberg Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 07/12] x86/sev: Setup code to park APs in the AP Jump Table
On Wed, Nov 10, 2021 at 05:37:32PM +0100, Borislav Petkov wrote: > On Mon, Sep 13, 2021 at 05:55:58PM +0200, Joerg Roedel wrote: > > extern unsigned char real_mode_blob[]; > > diff --git a/arch/x86/include/asm/sev-ap-jumptable.h > > b/arch/x86/include/asm/sev-ap-jumptable.h > > new file mode 100644 > > index ..1c8b2ce779e2 > > --- /dev/null > > +++ b/arch/x86/include/asm/sev-ap-jumptable.h > > Why a separate header? arch/x86/include/asm/sev.h looks small enough. The header is included in assembly, so I made a separate one. > > +void __init sev_es_setup_ap_jump_table_data(void *base, u32 pa) > > Why is this a separate function? > > It is all part of the jump table setup. Right, but the sev_es_setup_ap_jump_table_blob() function is already pretty big and I wanted to keep things readable. > > > + return 0; > > Why are you returning 0 here and below? This is in an initcall and it returns just 0 when the environment is not ready to setup the ap jump table. Returning non-zero would create a warning message in the caller for something that is not a bug in the kernel. > > + * This file contains the source code for the binary blob which gets > > copied to > > + * the SEV-ES AP Jumptable to park APs while offlining CPUs or booting a > > new > > I've seen "Jumptable", "Jump Table" and "jump table" at least. I'd say, do > the last one everywhere pls. Fair, sorry for my english being too german :) I changed everything to 'jump table'. > > + /* Reset remaining registers */ > > + movl$0, %esp > > + movl$0, %eax > > + movl$0, %ebx > > + movl$0, %edx > > All 4: use xor XOR changes EFLAGS, can not use them here. > > + > > + /* Reset CR0 to get out of protected mode */ > > + movl$0x6010, %ecx > > Another magic naked number. This is the CR0 reset value. I have updated the comment to make this more clear. Thanks, -- Jörg Rödel jroe...@suse.de SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nürnberg Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 03/12] x86/sev: Save and print negotiated GHCB protocol version
On Wed, Nov 03, 2021 at 03:27:23PM +0100, Borislav Petkov wrote: > On Mon, Sep 13, 2021 at 05:55:54PM +0200, Joerg Roedel wrote: > > From: Joerg Roedel > > > > Save the results of the GHCB protocol negotiation into a data structure > > and print information about versions supported and used to the kernel > > log. > > Which is useful for? For easier debugging, I added a sentence about that to the changelog. > > +struct sev_ghcb_protocol_info { > > Too long a name - ghcb_info is perfectly fine. Changed, thanks. -- Jörg Rödel jroe...@suse.de SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nürnberg Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 01/12] kexec: Allow architecture code to opt-out at runtime
Hi again, On Mon, Nov 01, 2021 at 04:11:42PM -0500, Eric W. Biederman wrote: > I seem to remember the consensus when this was reviewed that it was > unnecessary and there is already support for doing something like > this at a more fine grained level so we don't need a new kexec hook. Forgot to state to problem again which these patches solve: Currently a Linux kernel running as an SEV-ES guest has no way to successfully kexec into a new kernel. The normal SIPI sequence to reset the non-boot VCPUs does not work in SEV-ES guests and special code is needed in Linux to safely hand over the VCPUs from one kernel to the next. What happens currently is that the kexec'ed kernel will just hang. The code which implements the VCPU hand-over is also included in this patch-set, but it requires a certain level of Hypervisor support which is not available everywhere. To make it clear to the user that kexec will not work in their environment, it is best to disable the respected syscalls. This is what the hook is needed for. Regards, -- Jörg Rödel jroe...@suse.de SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nürnberg Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 01/12] kexec: Allow architecture code to opt-out at runtime
On Mon, Nov 01, 2021 at 04:11:42PM -0500, Eric W. Biederman wrote: > I seem to remember the consensus when this was reviewed that it was > unnecessary and there is already support for doing something like > this at a more fine grained level so we don't need a new kexec hook. It was a discussion, no consenus :) I still think it is better to solve this in generic code for everybody to re-use than with an hack in the architecture hooks. More and more platforms which enable confidential computing features may need this hook in the future. Regards, -- Jörg Rödel jroe...@suse.de SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nürnberg Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 00/12] x86/sev: KEXEC/KDUMP support for SEV-ES guests
On Mon, Sep 13, 2021 at 09:02:38AM -0700, Dave Hansen wrote: > On 9/13/21 8:55 AM, Joerg Roedel wrote: > > This does not work under SEV-ES, because the hypervisor has no access > > to the vCPU registers and can't make modifications to them. So an > > SEV-ES guest needs to reset the vCPU itself and park it using the > > AP-reset-hold protocol. Upon wakeup the guest needs to jump to > > real-mode and to the reset-vector configured in the AP-Jump-Table. > > How does this end up looking to an end user that tries to kexec() from a > an SEV-ES kernel? Does it just hang? Yes, the kexec will just hang. This patch-set contains code to disable the kexec syscalls in situations where it would not work for that reason. Actually with the changes to the decompressor in this patch-set the kexec'ed kernel could boot, but would fail to bring up all the APs. Regards, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 04/12] x86/sev: Do not hardcode GHCB protocol version
Hi Tom, On Wed, Jul 21, 2021 at 04:17:38PM -0500, Tom Lendacky wrote: > On 7/21/21 9:20 AM, Joerg Roedel wrote: > > /* Fill in protocol and format specifiers */ > > - ghcb->protocol_version = GHCB_PROTOCOL_MAX; > > + ghcb->protocol_version = sev_get_ghcb_proto_ver(); > > So this probably needs better clarification in the spec, but the GHCB > version field is for the GHCB structure layout. So if you don't plan to > use the XSS field that was added for version 2 of the layout, then you > should report the GHCB structure version as 1. Yeah, this makes sense, exept for the struct field-name ;) But anyway, I keep the version field at 1 for now. Regards, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 11/12] x86/sev: Handle CLFLUSH MMIO events
Hi Sean, On Fri, Jul 30, 2021 at 10:42:30PM +, Sean Christopherson wrote: > On Wed, Jul 21, 2021, Joerg Roedel wrote: > This wording can be misread as "the hypervisor is responsible for _all_ cache > management". Maybe just: > > /* >* Ignore CLFLUSHes - the hyperivsor is responsible for cache >* management of emulated MMIO. >*/ Right, will update the comment, thanks. > Side topic, out of curisoity, what's mapping/accessing emulated MMIO as > non-UC? The CLFLUSHes happen when the kexec'ed kernel maps the VGA framebuffer as unencrypted. Initially it is mapped encrypted and before re-mapping the kernel flushes the range from the caches. I have not investigated why this doesn't happen on the first boot, though. Regards, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/2] x86/kexec/64: Forbid kexec when running as an SEV-ES guest
On Thu, May 06, 2021 at 12:42:03PM -0500, Eric W. Biederman wrote: > I don't understand this. > > Fundamentally kexec is about doing things more or less inspite of > what the firmware is doing. > > I don't have any idea what a SEV-ES is. But the normal x86 boot doesn't > do anything special. Is cross cpu IPI emulation buggy? Under SEV-ES the normal SIPI-based sequence to re-initialize a CPU does not work anymore. An SEV-ES guest is a special virtual machine where the hardware encrypts the guest memory and the guest register state. The hypervisor can't make any modifications to the guests registers at runtime. Therefore it also can't emulate a SIPI sequence and reset the vCPU. The guest kernel has to reset the vCPU itself and hand it over from the old kernel to the kexec'ed kernel. This isn't currently implemented and therefore kexec needs to be disabled when running as an SEV-ES guest. Implementing this also requires an extension to the guest-hypervisor protocol (the GHCB Spec[1]) which is only available in version 2. So a guest running on a hypervisor supporting only version 1 will never properly support kexec. > What is the actual problem you are trying to avoid? Currently, if someone tries kexec in an SEV-ES guest, the kexec'ed kernel will only be able to bring up the boot CPU, not the others. The others will wake up with the old kernels CPU state in the new kernels memory and do undefined things, most likely triple-fault because their page-table is not existent anymore. So since kexec currently does not work as expected under SEV-ES, it is better to hide it until everything is implemented so it can do what the user expects. > And yes for a temporary hack the suggestion of putting code into > machine_kexec_prepare seems much more reasonable so we don't have to > carry special case infrastructure for the forseeable future. As I said above, for protocol version 1 it will stay disabled, so it is not only a temporary hack. Regards, Joerg [1] https://developer.amd.com/wp-content/resources/56421.pdf ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/2] kexec: Allow architecture code to opt-out at runtime
On Thu, May 06, 2021 at 03:43:23PM +, Sean Christopherson wrote: > This misses kexec_file_load. Right, thanks, I will fix that in the next version. > Also, is a new hook really needed? E.g. the SEV-ES check be shoved > into machine_kexec_prepare(). The downside is that we'd do a fair > amount of work before detecting failure, but that doesn't seem hugely > problematic. That could work, but I think its more user-friendly to just claim that the syscalls are not supported at all. Regards, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On Fri, Feb 22, 2019 at 02:00:26PM +0100, Borislav Petkov wrote: > On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote: > > The current default of 256MB was found by experiments on a bigger > > number of machines, to create a reasonable default that is at least > > likely to be sufficient of an average machine. > > Exactly, and this is what makes sense. > > The code should try the requested reservation and if it fails, it should > try high allocation with default swiotlb size because we need to reserve > *some* range. Right, makes sense. While at it, maybe it is time to move the default allocation policy to 'high' again. The change was reverted six years ago because it broke old kexec tools, but those are probably out-of-service now. I think this change would make the whole crashdump allocation process less fragile. Regards, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
On Fri, Feb 22, 2019 at 10:11:01AM +0800, Dave Young wrote: > In case people have a lot of devices need more swiotlb, then he manually > set the ,high with ,low together. The option to specify the high and low values for the crashkernel are important for certain machines. The point is that swiotlb already allocates 64MB of low memory by default. But that memory is only used for 32bit DMA-mask devices that want to DMA into high memory. There are drivers just allocating GFP_DMA32 memory, which also ends up in the low region (but not swiotlb), that is why the previous default of 72MB low memory was not enough, it only left 8MB of GFP_DMA32 memory. The current default of 256MB was found by experiments on a bigger number of machines, to create a reasonable default that is at least likely to be sufficient of an average machine. There is no way today for the kernel to find an optimum value for the amount of low memory required to successfully create a crash dump. It depends on the amount of devices in the system and how the drivers for them are written. The drivers have no way to report back their requirements, and even if they had, at the time the allocation happens no driver is loaded yet. So it is up to the system administrator to find workable values for the high and low memory requirements, even using experiments as a last resort. Regards, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 00/17] Fixes and Cleanups for the Intel VT-d driver
On Mon, Jun 08, 2015 at 04:06:45PM +0800, Li, ZhenHua wrote: Finished testing on my HP huge system, SuperDome X. It works well. Thanks for testing this, Zhen-Hua. Might I add your Tested-by to the patches? Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 01/10] iommu/vt-d: Update iommu_attach_domain() and its callers
On Mon, Jan 12, 2015 at 03:06:19PM +0800, Li, Zhen-Hua wrote: Allow specification of the domain-id for the new domain. This patch only adds the 'did' parameter to iommu_attach_domain() and modifies all of its callers to specify the default value of -1 which says no did specified, allocate a new one. I think its better to keep the old iommu_attach_domain() interface in place and introduce a new function (like iommu_attach_domain_with_id() or something) which has the additional parameter. Then you can rewrite iommu_attach_domain(): iommu_attach_domai(...) { return iommu_attach_domain_with_id(..., -1); } This way you don't have to update all the callers of iommu_attach_domain() and the interface is more readable. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 02/10] iommu/vt-d: Items required for kdump
On Mon, Jan 12, 2015 at 11:15:38AM -0500, Vivek Goyal wrote: On Mon, Jan 12, 2015 at 05:06:46PM +0100, Joerg Roedel wrote: On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote: Kdump has the notion of backup region. Where certain parts of old kernels memory can be moved to a different location (first 640K on x86 as of now) and new kernel can make use of this memory now. So we will have to just make sure that no parts of this old page table fall into backup region. Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough yet ;) Sadly, your above statement is true for all hardware-accessible data structures in IOMMU code. I think about how we can solve this, is there an easy way to allocate memory that is not in any backup region? Hmm..., there does not seem to be any easy way to do this. In fact, as of now, kernel does not even know where is backup region. All these details are managed by user space completely (except for new kexec_file_load() syscall). That means we are left with ugly options now. - Define per arch kexec backup regions in kernel and export it to user space and let kexec-tools make use of that deinition (instead of defining its own). That way memory allocation code in kernel can look at this backup area and skip it for certain allocations. Yes, that makes sense. In fact, I think all allocations for DMA memory need to take this into account to avoid potentially serious data corruption. If any memory for a disk superblock gets allocated in backup memory and a kdump happens, the new kernel might zero out that area and the disk controler then writes the zeroes to disk instead of the superblock. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 02/10] iommu/vt-d: Items required for kdump
On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote: + +#ifdef CONFIG_CRASH_DUMP + +/* + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU + * + * Fixes the crashdump kernel to deal with an active iommu and legacy + * DMA from the (old) panicked kernel in a manner similar to how legacy + * DMA is handled when no hardware iommu was in use by the old kernel -- + * allow the legacy DMA to continue into its current buffers. + * + * In the crashdump kernel, this code: + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA). + * 2. Do not re-enable IOMMU's translating. + * 3. In kdump kernel, use the old root entry table. + * 4. Leaves the current translations in-place so that legacy DMA will + *continue to use its current buffers. + * 5. Allocates to the device drivers in the crashdump kernel + *portions of the iova address ranges that are different + *from the iova address ranges that were being used by the old kernel + *at the time of the panic. + * + */ It looks like you are still copying the io-page-tables from the old kernel into the kdump kernel, is that right? With the approach that was proposed you only need to copy over the context entries 1-1. They are still pointing to the page-tables in the old kernels memory (which is just fine). The root-entry of the old kernel is also re-used, and when the kdump kernel starts to use a device, its context entry is updated to point to a newly allocated page-table. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v8 02/10] iommu/vt-d: Items required for kdump
On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote: Kdump has the notion of backup region. Where certain parts of old kernels memory can be moved to a different location (first 640K on x86 as of now) and new kernel can make use of this memory now. So we will have to just make sure that no parts of this old page table fall into backup region. Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough yet ;) Sadly, your above statement is true for all hardware-accessible data structures in IOMMU code. I think about how we can solve this, is there an easy way to allocate memory that is not in any backup region? Thanks, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
On Fri, Dec 12, 2014 at 10:25:31AM +0800, Li, ZhenHua wrote: Sorry I have no plan yet. Could you send me your logs on your AMD system? On 12/10/2014 04:46 PM, Baoquan He wrote: This issue happens on AMD iommu too, do you have any plans or thoughts on that? I think the best approach for now is to get a prove-of-concept on the VT-d driver. If it works there the way we expect, we can implement the same handling in the AMD driver. But I see no reason to hold back the VT-d patches until it is also fixed for AMD systems. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
On Mon, Dec 01, 2014 at 02:31:38PM +0800, Li, ZhenHua wrote: After I implement these two steps, there comes a new fault: [1.594890] dmar: DRHD: handling fault status reg 2 [1.594894] dmar: INTR-REMAP: Request device [[41:00.0] fault index 4d [1.594894] INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear It is caused by similar reason, so I will fix it like fixing the DMAR faults: Do NOT disable and re-enable the interrupt remapping, try to use data from old kernel. Yes, that sounds right, thanks. Looking forward to your patches. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
On Fri, Nov 14, 2014 at 02:27:44PM +0800, Li, ZhenHua wrote: I am working following your directions: 1. If the VT-d driver finds the IOMMU enabled, it reuses its root entry table, and do NOT disable-enable iommu. Other data will be copied. 2. When a device driver issues the first dma_map command for a device, we assign a new and empty page-table, thus removing all mappings from the old kernel for the device. Please let me know if I get something wrong. Yes, this sounds right. Happily waiting for patches :) Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
Hi Bjorn, On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote: I was looking at Zhen-Hua's recent patches, trying to figure out if I need to do anything with them. Resetting devices in the old kernel seems like a non-starter. Resetting devices in the new kernel, ..., well, maybe. It seems ugly, and it seems like the sort of problem that IOMMUs are designed to solve. Actually resetting the devices in the kdump kernel would be one of the better solutions for this problem. When we have a generic way to stop all in-flight DMA from the PCI endpoints we could safely disable and then re-enable the IOMMU. On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel j...@8bytes.org wrote: That is a solution to prevent the in-flight DMA failures. But what happens when there is some in-flight DMA to a disk to write some inodes or a new superblock. Then this scratch address-space may cause filesystem corruption at worst. This in-flight DMA is from a device programmed by the old kernel, and it would be reading data from the old kernel's buffers. I think you're suggesting that we might want that DMA read to complete so the device can update filesystem metadata? Well, it is not about updating filesystem metadata. In the kdump kernel we have these options: 1) Disable the IOMMU. Problem here is, that DMA is now untranslated, so that any in-flight DMA might read or write from a random location in memory, corrupting the kdump or even the new kexec kernel memory. So this is a non-starter. 2) Re-program the IOMMU to block all DMA. This is safer as it does not corrupt any data in memory. But some devices react very poorly on a master abort from the IOMMU, so bad that the driver in the kdump kernel fails to initialize that device. In this case taking dump itself might fail (and often does, according to reports) 3) To prevent master aborts like in option (2), David suggested to map the whole DMA address space to a scratch page. This also prevents system memory corruption and the master aborts. But the problem is, that in-flight DMA will now read all zeros. This can corrupt disk and network data, at worst it nulls out the superblocks of your filesystem and you lose all data. So this is not an option too. What we currently do in the VT-d driver is a mixture of (1) and (2). The VT-d driver disables the IOMMU hardware (opening a race window for memory data corruption), re-initializes it to reject any ongoing DMA (which causes master aborts for in-flight DMA) and re-enables the IOMMU hardware. But this strategy fails in heavy IO environments quite often and we look into ways to solve the problem, or at least improve the current situation as good as we can. I talked to David about this at LPC and we came up with basically this procedure: 1. If the VT-d driver finds the IOMMU enabled, it reuses its root-context table. This way the IOMMU must not be disabled and re-enabled, eliminating the race we have now. Other data structures like the context-entries are copied over from the old kernel. We basically keep all mappings from the old kernel, allowing any in-flight DMA to succeed. No memory or disk data corruption. The issue here is, that we modify data from the old kernel which is about to be dumped. But there are ways to work around that. 2. When a device driver issues the first dma_map command for a device, we assign a new and empty page-table, thus removing all mappings from the old kernel for the device. Rationale is, that at this point the device driver should have reset the device to a point where all in-flight DMA is canceled. This approach goes into the same direction as Bill Sumners patch-set, which Li took over. But it goes not as far as keeping the old mappings while the kdump kernel is still working with the devices (which might introduce new issues and corner cases). So with this in mind I would prefer initially taking over the page-tables from the old kernel before the device drivers re-initialize the devices. This makes the dump kernel more dependent on data from the old kernel, which we obviously want to avoid when possible. Sure, but this is not really possible here (unless we have a generic and reliable way to reset all PCI endpoint devices and cancel all in-flight DMA before we disable the IOMMU in the kdump kernel). Otherwise we always risk data corruption somewhere, in system memory or on disk. I didn't find the previous discussion where pointing every virtual bus address at the same physical scratch page was proposed. Why was that better than programming the IOMMU to reject every DMA? As I said, the problem
Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
Hi David, On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote: There could be all kinds of existing mappings in the DMA page tables, and I'm not sure it's safe to preserve them. What prevents the crashdump kernel from trying to use any of the physical pages which are accessible, and which could thus be corrupted by stray DMA? In fact, the old kernel could even have set up 1:1 passthrough mappings for some devices, which would then be able to DMA *anywhere*. Surely we need to prevent that? Ideally we would prevent that, yes. But the problem is that a failed DMA transaction might put the device into an unrecoverable state. Usually any in-flight DMA transactions should only target buffers set up by the previous kernel and not corrupt any data. After the last round of this patchset, we discussed a potential improvement where you point every virtual bus address at the *same* physical scratch page. That is a solution to prevent the in-flight DMA failures. But what happens when there is some in-flight DMA to a disk to write some inodes or a new superblock. Then this scratch address-space may cause filesystem corruption at worst. So with this in mind I would prefer initially taking over the page-tables from the old kernel before the device drivers re-initialize the devices. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv3 3/6] Crashdump-Accepting-Active-IOMMU-Domain-Interfaces
On Fri, Jan 10, 2014 at 03:07:29PM -0700, Bill Sumner wrote: +context_get_entry(struct context_entry *context_addr, + struct intel_iommu *iommu, u32 bus, int devfn) +{ + unsigned long long q; /* quadword scratch */ + struct root_entry *root_phys; /* Phys adr (root table entry) */ + struct root_entry root_temp; /* Local copy of root_entry */ + struct context_entry *context_phys; /* Phys adr */ + + if (pr_dbg.domain_get) + pr_debug(ENTER %s B:D:F=%2.2x:%2.2x:%1.1x context_entry:0x%llx intel_iommu:0x%llx\n, + __func__, bus, devfn3, devfn7, + (u64)context_addr, (u64)iommu); + + if (bus 255) /* Sanity check */ + return -5; + if (devfn 255 || devfn 0) /* Sanity check */ + return -6; + + q = readq(iommu-reg + DMAR_RTADDR_REG); + pr_debug(IOMMU %d: DMAR_RTADDR_REG:0x%16.16llx\n, iommu-seq_id, q); + if (!q) + return -1; + + root_phys = (struct root_entry *) q;/* Adr(base of vector) */ + root_phys += bus; /* Adr(entry we want) */ + + oldcopy(root_temp, root_phys, sizeof(root_temp)); + + pr_debug(root_temp.val:0x%llx .rsvd1:0x%llx root_phys:0x%llx\n, + root_temp.val, root_temp.rsvd1, (u64)root_phys); + + if (!root_present(root_temp)) + return -2; + + pr_debug(B:D:F=%2.2x:%2.2x:%1.1x root_temp.val: %llx .rsvd1: %llx\n, + bus, devfn 3, devfn 7, root_temp.val, root_temp.rsvd1); + + if (root_temp.rsvd1)/* If (root_entry is bad) */ + return -3; + + context_phys = get_context_phys_from_root(root_temp); + if (!context_phys) + return -4; What do all these magic return values mean? ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU
Hi Bill, On Fri, Jan 10, 2014 at 03:07:26PM -0700, Bill Sumner wrote: Bill Sumner (6): Crashdump-Accepting-Active-IOMMU-Flags-and-Prototype Crashdump-Accepting-Active-IOMMU-Utility-functions Crashdump-Accepting-Active-IOMMU-Domain-Interfaces Crashdump-Accepting-Active-IOMMU-Copy-Translations Crashdump-Accepting-Active-IOMMU-Debug-Print-IOMMU Crashdump-Accepting-Active-IOMMU-Call-From-Mainline drivers/iommu/intel-iommu.c | 1293 --- 1 file changed, 1225 insertions(+), 68 deletions(-) This is a pretty big change, before merging I would like to get more eyes on it. Please make sure you get more reviews on the code and the general concept. A few things I noticed while looking at it: * You put a lot of code into #define CONFIG_CRASH_DUMP. Isn't the same approach necessary to for KEXEC? * Please get rid of all the pr_debug stuff. * I think it makes sense to put the functions to access the IOMMU configuration values of the old kernel into a new file. This is better than adding over 1000 new lines to intel-iommu.c * I have seen your new single-patch for this change. A single patch with more than 1200 changed lines is not something anyone is willing to review. Please split the patches again in a meaningful and bisectable way. Thanks, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2] intel-iommu: Quiesce devices before disabling IOMMU
On Wed, Sep 18, 2013 at 03:09:01PM +0900, Takao Indoh wrote: + /* + * In the case of kdump, ioremap is needed because root-entry table + * exists in first kernel's memory area which is not mapped in second + * kernel + */ + root = (struct root_entry *)ioremap(addr, PAGE_SIZE); + if (!root) + return; + + for (bus = 0; bus ROOT_ENTRY_NR; bus++) { + if (!root_present(root[bus])) + continue; + + context = (struct context_entry *)ioremap( + root[bus].val VTD_PAGE_MASK, PAGE_SIZE); + if (!context) + continue; + + for (devfn = 0; devfn CONTEXT_ENTRY_NR; devfn++) { + if (!context_present(context[devfn])) + continue; + + dev = pci_get_domain_bus_and_slot(segment, bus, devfn); + if (!dev) + continue; + + if (!pci_reset_bus(dev-bus)) /* go to next bus */ + break; + else /* Try per-function reset */ + pci_reset_function(dev); + + } + iounmap(context); + } + iounmap(root); I am not convinced that this is the right approach. If a device wasn't translated by VT-d in the old kernel doesn't mean it will not be translated in the new kernel. How about unconditionally resetting all PCI busses and/or functions here before IOMMU initialization proceeds? Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
On Mon, Apr 15, 2013 at 06:00:13PM +0900, Takao Indoh wrote: On DMAR initialization during kdump boot, do you guys agree to change order like this? Current order: (1) Disable translation (2) PCI initialization (3) Make pgtable and enable translation. Order I'm proposing: (1) PCI initialization (2) Disable translation (3) Make pgtable and enable translation. I think this should work. In fact, translation only needs to be disabled while the IOMMU hardware is reprogrammed to the new data structures the kdump kernel set up. As Joerg said, if we need to consider the case that kdump kernel is compiled without dma-remapping(CONFIG_INTEL_IOMMU is off?), I can update my patch to handle such a case properly by adding ifdef CONFIG_INTEL_IOMMU. Thinking about it, this case is a bit more special. If the kdump kernel has no IOMMU driver at all there is also no code to disable it. So unless we want to force the kdump kernel to have the driver when the first kernel had it, I think we also want some quirk (depending on !CONFIG_INTEL_IOMMU) that disables translation at boot. I know, its complicated :) Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
On Wed, Apr 03, 2013 at 09:24:39AM +0100, David Woodhouse wrote: On Wed, 2013-04-03 at 16:11 +0900, Takao Indoh wrote: Yeah, you are right. I forgot such a case. If you disable translation and there's some device still doing DMA, it's going to scribble over random areas of memory. You really want to have translation enabled and all the page tables *cleared*, during kexec. I think it's fair to insist that the secondary kernel should use the IOMMU if the first one did. Do we really need to insist on that? The IOMMU initialization on x86 happens after the kernel scanned and enumerated the PCI bus. While doing this the kernel (at least it should) disables all devices it finds. So when the IOMMU init code runs we should be safe from any in-flight DMA and can either disable translation or re-initialize it for the kdump kernel. Until then translation needs to stay enabled of course, so that the old page-tables are still used and in-flight DMA doesn't corrupt any data. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
On Mon, Apr 01, 2013 at 02:45:18PM +0900, Takao Indoh wrote: Current flow on kdump boot enable_IR intel_enable_irq_remapping iommu_disable_irq_remapping == IRES/QIES/TES disabled here dmar_disable_qi == do nothing dmar_enable_qi == QIES enabled intel_setup_irq_remapping== IRES enabled But what we want to do here in the kdumo case is to disable translation too, right? Because the former kernel might have translation and irq-remapping enabled and the kdump kernel might be compiled without support for dma-remapping. So if we don't disable translation here too the kdump kernel is unable to do DMA. I agree that disabling translation should be a bit more explicit instead of the current code. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
On Wed, Mar 27, 2013 at 02:02:44PM +0900, Takao Indoh wrote: The root cause of this problem is mismatch between iommu-gcmd and global command register in the case of kdump. At boot time, initial value of iommu-gcmd is zero as I wrote above, but actual global command register is *not* zero because some bits like IRE/TE/QIE are already set in *first* kernel. Therefore this patch synchronize them to fix this problem. Ok, I understand, but I still don't see why this is a problem. There is no point for the kdump kernel to preserve hardware state from the previous kernel. So I think the way it is implemented is correct. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
On Thu, Mar 21, 2013 at 10:32:36AM +0900, Takao Indoh wrote: In this function, clearing IRE bit in iommu-gcmd and writing it to global command register. But initial value of iommu-gcmd is zero, so this writel means clearing all bits in global command register. Seems weird. Why is the value of gcmd zero in your case? The usage of this register is well encapsulated by the different parts of the VT-d driver. There are other places which enable/disable translation and qpi the same way it is done with interrupt remapping. So it looks to me that it is unlikely that gcmd is really zero in your case. Can you explain that more and also describe what the actual misbehavior is you are trying to fix here? Thanks, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
Hi, On Wed, Apr 11, 2012 at 09:39:43AM +0800, zhangyanfei wrote: The problem is that VMCS internal is hidden by Intel in its specification. So, we reverse engineering it in the way implemented in this patch set. Have you made sure this layout is the same on all uarchitectures that implment VMX? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] AMD IOMMU kdump fix plus cleanups (v2)
On Fri, Apr 02, 2010 at 06:27:51PM -0700, Chris Wright wrote: Series includes the following patches: x86/amd-iommu: enable iommu before attaching devices x86/amd-iommu: warn when issuing command to uninitialized cmd buffer Revert x86: disable IOMMUs on kernel crash x86/amd-iommu: use for_each_pci_dev First one is the primary bug fix. v2 - add disable_iommus on failure path - move removal of CMD_BUFFER_UNINITIALIZED to iommu_enable_command_buffer() - fix ;; typo in patch 2 - add Revert x86: disable IOMMUs on kernel crash - add x86/amd-iommu: use for_each_pci_dev Applied all, thanks. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash
On Sat, Apr 03, 2010 at 10:44:22AM -0700, Eric W. Biederman wrote: Joerg Roedel j...@8bytes.org writes: Hmm, I think for this we need to change the gart code too and disable the gart before its initialization runs to not re-introduce issues fixed in commit bc2cea6a34fdb30f118ec75db39a46a191870607, no? That is a different code path with a different set of assumptions and restrictions. On a normal kexec of course we want to do an orderly shutdown. Thats another problem with this patch. It introduces a difference between the panic-shutdown kexec and the ordinary kexec. For the gart with a little luck we can just ignore it on kexec on panic. The commit I mentioned above already proves this assumption wrong. Unlike a virtualization capable iommu it doesn't prevent access to devices, when it is enabled. Worst case is that we have to start including iommu=off for gart systems. No no no. This is a maintenance nightmare for almost everybody. Where do you want to Document this special cases that 'if kernel uses gart then and only then boot the kexec kernel with iommu=off'. Always passing iommu=off to the kexec kernel doesn't work too for obvious reasons. The best case is that we can figure out how to have the gart code reinitialize itself sanely, starting from some arbitrary point. Yes, that is missing in this patch. But to keep changes small and don't bother with the gart code at all I suggest to remove the shutdown routine from the amd-iommu code only and not the whole shutdown call in the machine_crash_shutdown path. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash
On Sun, Apr 04, 2010 at 09:24:30AM +0200, Bernhard Walle wrote: Am 03.04.10 19:49, schrieb Eric W. Biederman: Not a problem. We require a lot of things of the kdump kernel, and it is immediately apparent in a basic sanity test. Also, in most cases (for example: distribution kernels), the kdump kernel nowadays is identical to the running kernel. So, if the running kernel has IOMMU support, the kdump kernel also has. Yes, I know. But is that a requirement for kexec? If not we still potentially have this problem. It is a smaller problem than data-corruption caused by in-flight dma because most people^Wdistributions indeed use the same kernel for normal boot and kexec, thats true. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash
On Sun, Apr 04, 2010 at 02:44:36AM -0700, Eric W. Biederman wrote: Joerg Roedel j...@8bytes.org writes: On Sun, Apr 04, 2010 at 09:24:30AM +0200, Bernhard Walle wrote: Am 03.04.10 19:49, schrieb Eric W. Biederman: Not a problem. We require a lot of things of the kdump kernel, and it is immediately apparent in a basic sanity test. Also, in most cases (for example: distribution kernels), the kdump kernel nowadays is identical to the running kernel. So, if the running kernel has IOMMU support, the kdump kernel also has. Yes, I know. But is that a requirement for kexec? For normal kexec no. That path is expected to do a clean hardware shutdown. For kexec on panic aka kdump the requirement is that your your crash kernel be able to initialize your hardware from any state it can be put in. Ok, if you show me where this is documented for everybody then I am probably convinced :-) We should fixup the gart initialization anyway. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash
On Fri, Apr 02, 2010 at 06:27:54PM -0700, Chris Wright wrote: This effectively reverts commit 61d047be99757fd9b0af900d7abce9a13a337488. Disabling the IOMMU can potetially allow DMA transactions to complete without being translated. Leave it enabled, and allow crash kernel to do the IOMMU reinitialization properly. Cc: Joerg Roedel joerg.roe...@amd.com Cc: Eric Biederman ebied...@xmission.com Cc: Neil Horman nhor...@tuxdriver.com Cc: Vivek Goyal vgo...@redhat.com Signed-off-by: Chris Wright chr...@sous-sol.org --- arch/x86/kernel/crash.c |6 -- 1 file changed, 6 deletions(-) --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -27,7 +27,6 @@ #include asm/cpu.h #include asm/reboot.h #include asm/virtext.h -#include asm/x86_init.h #if defined(CONFIG_SMP) defined(CONFIG_X86_LOCAL_APIC) @@ -103,10 +102,5 @@ void native_machine_crash_shutdown(struc #ifdef CONFIG_HPET_TIMER hpet_disable(); #endif - -#ifdef CONFIG_X86_64 - x86_platform.iommu_shutdown(); -#endif - crash_save_cpu(regs, safe_smp_processor_id()); Hmm, I think for this we need to change the gart code too and disable the gart before its initialization runs to not re-introduce issues fixed in commit bc2cea6a34fdb30f118ec75db39a46a191870607, no? Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/2] x86/amd-iommu: enable iommu before attaching devices
On Fri, Apr 02, 2010 at 11:59:32AM -0400, Vivek Goyal wrote: 1. kernel crashes, we leave IOMMU enabled. True for everything except gart and amd iommu. a. So during this small window when iommu is disabled and we enable it back, any inflight DMA will passthrough possibly to an unintended physical address as translation is disabled and it can corrupt the kdump kenrel. Right. b. Even after enabling the iommu, I guess we will continue to use cached DTE, and translation information to handle any in-flight DMA. The difference is that now iommus are enabled so any in-flight DMA should go to the address as intended in first kenrel and should not corrupt anything. Right. 3. Once iommus are enabled again, we allocated and initilize protection domains. We attach devices to domains. In the process we flush the DTE, PDE and IO TLBs. c. Looks like do_attach-set_dte_entry(), by default gives write permission (IW) to all the devices. I am assuming that at this point of time translation is enabled and possibly unity mapped. No, The IW bit in the DTE must be set because all write permission bits (DTE and page tabled) are ANDed to determine if a device can write to a particular address. So as long as the paging mode is unequal to zero the hardware will walk the page-table first to find out if the device has write permission. With paging mode == 0 your statement about read-write unity-mapping is true. This is used for a pass-through domain (iommu=pt) btw. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/4] Revert x86: disable IOMMUs on kernel crash
On Fri, Apr 02, 2010 at 06:27:54PM -0700, Chris Wright wrote: This effectively reverts commit 61d047be99757fd9b0af900d7abce9a13a337488. Disabling the IOMMU can potetially allow DMA transactions to complete without being translated. Leave it enabled, and allow crash kernel to do the IOMMU reinitialization properly. Cc: Joerg Roedel joerg.roe...@amd.com Cc: Eric Biederman ebied...@xmission.com Cc: Neil Horman nhor...@tuxdriver.com Cc: Vivek Goyal vgo...@redhat.com Signed-off-by: Chris Wright chr...@sous-sol.org --- arch/x86/kernel/crash.c |6 -- 1 file changed, 6 deletions(-) --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -27,7 +27,6 @@ #include asm/cpu.h #include asm/reboot.h #include asm/virtext.h -#include asm/x86_init.h #if defined(CONFIG_SMP) defined(CONFIG_X86_LOCAL_APIC) @@ -103,10 +102,5 @@ void native_machine_crash_shutdown(struc #ifdef CONFIG_HPET_TIMER hpet_disable(); #endif - -#ifdef CONFIG_X86_64 - x86_platform.iommu_shutdown(); -#endif - crash_save_cpu(regs, safe_smp_processor_id()); Another problem: This also breaks if the kdump kernel has no iommu-support. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/2] x86/amd-iommu: enable iommu before attaching devices
On Thu, Apr 01, 2010 at 06:23:53PM -0700, Chris Wright wrote: Hit another kdump problem as reported by Neil Horman. When initializaing the IOMMU, we attach devices to their domains before the IOMMU is fully (re)initialized. Attaching a device will issue some important invalidations. In the context of the newly kexec'd kdump kernel, the IOMMU may have stale cached data from the original kernel. Because we do the attach too early, the invalidation commands are placed in the new command buffer before the IOMMU is updated w/ that buffer. This leaves the stale entries in the kdump context and can renders device unusable. Simply enable the IOMMU before we do the attach. Cc: Neil Horman nhor...@tuxdriver.com Cc: Vivek Goyal vgo...@redhat.com Signed-off-by: Chris Wright chr...@sous-sol.org --- arch/x86/kernel/amd_iommu_init.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/arch/x86/kernel/amd_iommu_init.c +++ b/arch/x86/kernel/amd_iommu_init.c @@ -1288,6 +1288,8 @@ static int __init amd_iommu_init(void) if (ret) goto free; + enable_iommus(); + if (iommu_pass_through) ret = amd_iommu_init_passthrough(); else @@ -1300,8 +1302,6 @@ static int __init amd_iommu_init(void) amd_iommu_init_notifier(); - enable_iommus(); - if (iommu_pass_through) goto out; Ok, good to know this fixes the problem. One issue: If the initialization of the domains fails the iommu hardware needs to be disabled again in the free path. Thanks, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] amd iommu: force flush of iommu prior during shutdown
Hi Neil, first some general words about the problem you discovered: The problem is not caused by in-flight DMA. The problem is that the IOMMU hardware has cached the old DTE entry for the device (including the old page-table root pointer) and is using it still when the kdump kernel has booted. We had this problem once and fixed it by flushing a DTE in the IOMMU before it is used for the first time. This seems to be broken now. Which kernel have you seen this on? I am back in office next tuesday and will look into this problem too. On Wed, Mar 31, 2010 at 04:27:45PM -0400, Neil Horman wrote: So I'm officially rescinding this patch. Yeah, the right solution to this problem is to find out why every DTE is not longer flushed before first use. It apparently just covered up the problem, rather than solved it outright. This is going to take some more thought on my part. I read the code a bit closer, and the amd iommu on boot up currently marks all its entries as valid and having a valid translation (because if they're marked as invalid they're passed through untranslated which strikes me as dangerous, since it means a dma address treated as a bus address could lead to memory corruption. The saving grace is that they are marked as non-readable and non-writeable, so any device doing a dma after the reinit should get logged (which it does), and then target aborted (which should effectively squash the translation) Right. The default for all devices is to forbid DMA. I'm starting to wonder if: 1) some dmas are so long lived they start aliasing new dmas that get mapped in the kdump kernel leading to various erroneous behavior At least not in this case. Even when this is true the DMA would target memory of the crashed kernel and not the kdump area. This is not even memory corruption because the device will write to memory the driver has allocated for it. 2) a slew of target aborts to some hardware results in them being in an inconsistent state Thats indeed true. I have seen that with ixgbe cards for example. They seem to be really confused after an target abort. I'm going to try marking the dev table on shutdown such that all devices have no read/write permissions to see if that changes the situation. It should I think give me a pointer as to weather (1) or (2) is the more likely problem. Probably not. You still need to flush the old entries out of the IOMMU. Thanks, Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] amd iommu: force flush of iommu prior during shutdown
On Thu, Apr 01, 2010 at 10:47:36AM -0400, Neil Horman wrote: On Thu, Apr 01, 2010 at 04:29:02PM +0200, Joerg Roedel wrote: I am back in office next tuesday and will look into this problem too. Thank you. Just took a look and I think the problem is that the devices are attached to domains before the IOMMU hardware is enabled. This happens in the function prealloc_protection_domains(). The attach code issues the dte-invalidate commands but they are not executed because the hardware is off. I will verify this when I have access to hardware again. The possible fix will be to enable the hardware earlier in the initialization path. Right. The default for all devices is to forbid DMA. Thanks, glad to know I read that right, took me a bit to understand it :) I should probably add a comment :-) Thats indeed true. I have seen that with ixgbe cards for example. They seem to be really confused after an target abort. Yeah, this part worries me, target aborts lead to various brain dead hardware pieces. What are you thoughts on leaving the iommu on through a reboot to avoid this issue (possibly resetting any pci device that encounters a target abort, as noted in the error log on the iommu? This would only prevent possible data corruption. When the IOMMU is off the devices will not get a target abort but will only write to different physical memory locations. The window where a target abort can happen starts when the kdump kernel re-enables the IOMMU and ends when the new driver for that device attaches. This is a small window but there is not a lot we can do to avoid this small time window. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] amd iommu: force flush of iommu prior during shutdown
On Thu, Apr 01, 2010 at 01:11:49PM -0400, Neil Horman wrote: On Thu, Apr 01, 2010 at 05:56:43PM +0200, Joerg Roedel wrote: The possible fix will be to enable the hardware earlier in the initialization path. That sounds like a reasonable theory, I'll try hack something together shortly. Great. So the problem might be already fixed when I am back in the office ;-) This would only prevent possible data corruption. When the IOMMU is off the devices will not get a target abort but will only write to different physical memory locations. The window where a target abort can happen starts when the kdump kernel re-enables the IOMMU and ends when the new driver for that device attaches. This is a small window but there is not a lot we can do to avoid this small time window. Can you explain this a bit further please? From what I read, when the iommu is disabled, AIUI it does no translations. That means that any dma addresses which the driver mapped via the iommu prior to a crash that are stored in devices will just get strobed on the bus without any translation. If those dma address do not lay on top of any physical ram, won't that lead to bus errors, and transaction aborts? Worse, if those dma addresses do lie on top of real physical addresses, won't we get corruption in various places? Or am I missing part of how that works? Hm, the device address may not be a valid host physical address, thats true. But the problem with the small time-window when the IOMMU hardware is re-programmed from the kdump kernel still exists. I need to think about other possible side-effects of leaving the IOMMU enabled on shutdown^Wboot into a kdump kernel. Joerg ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec